Hi All,
I’m Lijie and now performing some experiments on MADlib. I found that
MADlib runs very slowly on some datasets, so I would like to justify my
settings. Could you help me check the following settings and codes? Sorry
for this long email. I used the latest MADlib 1.18 on PostgreSQL 12.
*(1) **Could you help check whether the data format and scripts I used are
right for n-dimensional dataset?*
I have some training datasets, and each of them has a dense feature array
(like [0.1, 0.2, …, 1.0]) and a class label (+1/-1). For example, for the
‘forest’ dataset (581K tuples) with a 54-dimensional feature array and a
class label, I first stored it into PostgreSQL using
<code>
CREATE TABLE forest (
did serial,
vec double precision[],
labeli integer);
COPY forest (vec, labeli) FROM STDIN;
‘[0.1, 0.2, …, 1.0], -1’
‘[0.3, 0.1, …, 0.9], 1’
…
</code>
Then, to run the Logistic Regression on this dataset, I use the
following code:
<code>
mldb=# \d forest
Table "public.forest"
Column | Type |
Modifiers
--------+--------------------+------------------------------------------------------
did | integer | not null default
nextval('forest_did_seq'::regclass)
vec | double precision[] |
labeli | integer |
mldb=# SELECT madlib.logregr_train(
mldb(# 'forest', -- source table
mldb(# 'forest_logregr_out', -- output table
mldb(# 'labeli', -- labels
mldb(# 'vec', -- features
mldb(# NULL, -- grouping columns
mldb(# 20, -- max number of
iteration
mldb(# 'igd' -- optimizer
mldb(# );
Time: 198911.350 ms
</code>
After about 199s, I got the output table as:
<code>
mldb=# \d forest_logregr_out
Table "public.forest_logregr_out"
Column | Type | Modifiers
--------------------------+--------------------+-----------
coef | double precision[] |
log_likelihood | double precision |
std_err | double precision[] |
z_stats | double precision[] |
p_values | double precision[] |
odds_ratios | double precision[] |
condition_no | double precision |
num_rows_processed | bigint |
num_missing_rows_skipped | bigint |
num_iterations | integer |
variance_covariance | double precision[] |
mldb=# select log_likelihood from forest_logregr_out;
log_likelihood
------------------
-426986.83683879
(1 row)
</code>
Is this procedure correct?
*(2) **Training on a 2,000-dimensional dense dataset (epsilon) is very
slow:*
While training on a 2,000-dimensional dense dataset
(epsilon_sample_10) with only *10 tuples* as follows, MADlib does not
finish in 5 hours* for only 1 iteration*. The CPU usage is always 100%
during the execution. The dataset is available at
https://github.com/JerryLead/Misc/blob/master/MADlib/train.sql.
<code>
mldb=# \d epsilon_sample_10
Table "public.epsilon_sample_10"
Column | Type |
Modifiers
--------+--------------------+-----------------------------------------------------------------
did | integer | not null default
nextval('epsilon_sample_10_did_seq'::regclass)
vec | double precision[] |
labeli | integer |
mldb=# SELECT count(*) from epsilon_sample_10;
count
-------
10
(1 row)
Time: 1.456 ms
mldb=# SELECT madlib.logregr_train('epsilon_sample_10',
'epsilon_sample_10_logregr_out', 'labeli', 'vec', NULL, 1, 'igd');
</code>
*In this case, it is not possible to train the whole epsilon dataset (with
400,000 tuples) in a reasonable time. I guess that this problem is related
to TOAST, since epsilon has a high dimension and it is compressed by TOAST.
However, are there any other reasons for this so slow execution?*
*(3) **For MADlib, is the dataset table scanned once or twice in each
iteration?*
I know that, in each iteration, MADlib needs to scan the dataset table once
to perform IGD/SGD on the whole dataset. My question is that, *at the end
of each iteration*, will MADlib scan the table again to compute the
training loss/accuracy?
*(4) **Is it possible to output the training metrics, such as training
loss and accuracy after each iteration?*
Currently, it seems that MADlib only outputs the log-likelihood at the end
of the SQL execution.
*(5) **Do MADlib’s Logistic Regression and SVM support sparse datasets?*
I also have some sparse datasets denoted as ‘feature_index_vec_array,
feature_value_array, label’, such as ‘[1, 3, 5], [0.1, 0.2, 0.3], -1’. Can
I train these sparse datasets on MADlib using LR and SVM?
Many thanks for reviewing my questions.
Best regards,
Lijie