Re: Long execution time on MADlib

Lijie Xu Fri, 16 Jul 2021 09:16:58 -0700

Thank you Aaron,

I will check it.


Best,
Lijie


On Fri, Jul 16, 2021 at 2:17 AM FENG, Xixuan (Aaron) <[email protected]>
wrote:

> My guess is because logregr computes a matrix X’AX which is big when you
> have 2000 features. The matrix is not needed for training the model but
> only for computing stderr after training. You can probably remove the
> matrix completely but in the engineering perspective it is more difficult
> than just changing the step size as you may need to take care of
> serializing the user defined function states…
>
>
> https://github.com/apache/madlib/blob/2e34c0f45a6e0f3be224ef58a6f4a576eb8eb89a/src/modules/regress/logistic.cpp#L851
>
>
> https://github.com/apache/madlib/blob/2e34c0f45a6e0f3be224ef58a6f4a576eb8eb89a/src/modules/regress/logistic.cpp#L821
>
>
> https://github.com/apache/madlib/blob/2e34c0f45a6e0f3be224ef58a6f4a576eb8eb89a/src/modules/regress/logistic.cpp#L930
>
>
> https://github.com/apache/madlib/blob/2e34c0f45a6e0f3be224ef58a6f4a576eb8eb89a/src/modules/regress/logistic.cpp#L1052
>
>
>
> 2021年7月16日(金) 1:43 Lijie Xu <[email protected]>:
>
>> Dear Aaron,
>>
>> Thanks for your advice. I will try it.
>>
>> In addition, after following Frank's guide, I found MADlib LR and SVM can
>> work normally on some low-dimensional (e.g., 18-28) datasets, even with >1
>> million tuples. However, while working on high-dimensional dataset, such as
>> epsilon dataset with 400,000 tuples and 2,000 features (
>> https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html),
>> MADlib SVM can finish 20 iterations in a reasonable time but MADlib LR
>> (with IGD) cannot finish 2 iterations in several hours.  Any ideas about
>> this problem? Thanks!
>>
>> Best,
>> Lijie
>>
>>
>>
>> On Thu, Jul 15, 2021 at 4:03 PM FENG, Xixuan (Aaron) <
>> [email protected]> wrote:
>>
>>> Hi Lijie,
>>>
>>> I implemented the logregr with incremental gradient descent a few years
>>> ago. Unfortunately at that time we chose to hard-coded the constant
>>> step-size. But luckily you can edit the code as you need.
>>>
>>> Here are the pointers:
>>>
>>> https://github.com/apache/madlib/blob/2e34c0f45a6e0f3be224ef58a6f4a576eb8eb89a/src/modules/regress/logistic.cpp#L818
>>>
>>>
>>> https://github.com/apache/madlib/blob/2e34c0f45a6e0f3be224ef58a6f4a576eb8eb89a/src/modules/regress/logistic.cpp#L918
>>>
>>> Good luck!
>>> Aaron
>>>
>>> 2021年7月15日(木) 22:14 Lijie Xu <[email protected]>:
>>>
>>>> Dear Frank,
>>>>
>>>> Sorry for the late reply and thanks for your great help. I'm doing some
>>>> research work on MADlib. I will follow your advice to test MADlib again.
>>>> Another question is if MADlib LR supports tuning learning_rate?
>>>>
>>>> In MADlib SVM, there is a 'params' in 'svm_classification' to tune the 
>>>> 'init_stepsize'
>>>> and 'decay_factor' as follows.
>>>>
>>>> svm_classification(
>>>>     source_table,
>>>>     model_table,
>>>>     dependent_varname,
>>>>     independent_varname,
>>>>     kernel_func,
>>>>     kernel_params,
>>>>     grouping_col,
>>>>     params,
>>>>     verbose
>>>>     )
>>>>
>>>> However, I did not see this 'params' in LR as:
>>>>
>>>> logregr_train( source_table,
>>>>                out_table,
>>>>                dependent_varname,
>>>>                independent_varname,
>>>>                grouping_cols,
>>>>                max_iter,
>>>>                optimizer,
>>>>                tolerance,
>>>>                verbose
>>>>              )
>>>>
>>>> In addition, I checked the Generalized Linear Models, and
>>>> its 'optim_params' parameter seems to only support tuning 'tolerance,
>>>> max_iter, and optimizer'.
>>>> Is there a way to tune the 'init_stepsize' and 'decay_factor' in LR?
>>>> Thanks!
>>>>
>>>> Best,
>>>> Lijie
>>>>
>>>> On Tue, Jul 6, 2021 at 9:04 PM Frank McQuillan <[email protected]>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> Thank you for the questions.
>>>>>
>>>>> (0)
>>>>> Not sure if you are using Postgres just for development or production,
>>>>> but keep in mind that MADlib is designed to run on a distributed MPP
>>>>> database (Greenplum) with large datasets. It runs fine on Postgres, but
>>>>> obviously Postgres won't scale to very large datasets or it will just be
>>>>> too slow.
>>>>>
>>>>> Also see jupyter notebooks here
>>>>>
>>>>> https://github.com/apache/madlib-site/tree/asf-site/community-artifacts/Supervised-learning
>>>>> for other examples in case of use.
>>>>>
>>>>>
>>>>> (1)
>>>>> - there are 2 problems with your dataset for logistic regression:
>>>>>
>>>>> (i)
>>>>> - as per
>>>>> http://madlib.incubator.apache.org/docs/latest/group__grp__logreg.html
>>>>> MADlib: Logistic Regression
>>>>> <http://madlib.incubator.apache.org/docs/latest/group__grp__logreg.html>
>>>>> Binomial logistic regression models the relationship between a
>>>>> dichotomous dependent variable and one or more predictor variables. The
>>>>> dependent variable may be a Boolean value or a categorial variable that 
>>>>> can
>>>>> be represented with a Boolean expression.
>>>>> madlib.incubator.apache.org
>>>>>
>>>>>
>>>>> the dependent variable is a boolean or an expression that evaluates to
>>>>> boolean
>>>>> - your data has dependent variable of -1 but postgres does not
>>>>> evaluate -1 to FALSE so you should change the -1 to 0
>>>>> - i.e., use 0 for FALSE and 1 for TRUE in postgres
>>>>> https://www.postgresql.org/docs/12/datatype-boolean.html
>>>>> <https://www.postgresql.org/docs/12/datatype-boolean.html>
>>>>> PostgreSQL: Documentation: 12: 8.6. Boolean Type
>>>>> <https://www.postgresql.org/docs/12/datatype-boolean.html>
>>>>> The key words TRUE and FALSE are the preferred (SQL-compliant) method
>>>>> for writing Boolean constants in SQL queries.But you can also use the
>>>>> string representations by following the generic string-literal constant
>>>>> syntax described in Section 4.1.2.7, for example 'yes'::boolean.. Note 
>>>>> that
>>>>> the parser automatically understands that TRUE and FALSE are of type
>>>>> boolean, but this is not so for NULL ...
>>>>> www.postgresql.org
>>>>>
>>>>>
>>>>>
>>>>> (ii)
>>>>> - an intercept variable is not assumed so it is common to provide an
>>>>> explicit intercept term by including a single constant 1 term in the
>>>>> independent variable list
>>>>> - see the example here
>>>>>
>>>>> http://madlib.incubator.apache.org/docs/latest/group__grp__logreg.html#examples
>>>>> MADlib: Logistic Regression
>>>>> <http://madlib.incubator.apache.org/docs/latest/group__grp__logreg.html#examples>
>>>>> Binomial logistic regression models the relationship between a
>>>>> dichotomous dependent variable and one or more predictor variables. The
>>>>> dependent variable may be a Boolean value or a categorial variable that 
>>>>> can
>>>>> be represented with a Boolean expression.
>>>>> madlib.incubator.apache.org
>>>>>
>>>>>
>>>>>
>>>>> That is why the log_likelihood value is too big, that model is not
>>>>> right.
>>>>>
>>>>>
>>>>> (2)
>>>>> if you make the fixes above in (1) it should run OK.  Here are my
>>>>> results on PostgreSQL 11.6 using MADlib version: 1.18.0 on the dataset 
>>>>> with
>>>>> 10 tuples:
>>>>>
>>>>>
>>>>> DROP TABLE IF EXISTS epsilon_sample_10v2 CASCADE;
>>>>>
>>>>>         CREATE TABLE epsilon_sample_10v2 (
>>>>>        did serial,
>>>>>        vec double precision[],
>>>>>        labeli integer
>>>>>         );
>>>>>
>>>>>         COPY epsilon_sample_10v2 (vec, labeli) FROM STDIN;
>>>>>         {1.0,-0.0108282,-0.0196004,0.0422148,...} 0
>>>>>         {1.0,0.00250835,0.0168447,-0.0102934,...} 1
>>>>>         etc.
>>>>>
>>>>> SELECT madlib.logregr_train('epsilon_sample_10v2',
>>>>> 'epsilon_sample_10v2_logregr_out', 'labeli', 'vec', NULL, 1, 'irls'}
>>>>>
>>>>>  logregr_train
>>>>> ---------------
>>>>>
>>>>> (1 row)
>>>>>
>>>>> Time: 317046.342 ms (05:17.046)
>>>>>
>>>>> madlib=# select log_likelihood from epsilon_sample_10v2_logregr_out;
>>>>>   log_likelihood
>>>>> -------------------
>>>>>  -6.93147180559945
>>>>> (1 row)
>>>>>
>>>>>
>>>>> (3)
>>>>> -dataset is not scanned again at the end of every iteration to compute
>>>>> training loss/accuracy.  It should only scan 1x per iteration for
>>>>> optimization
>>>>>
>>>>>
>>>>> (4)
>>>>> - I thought the verbose parameter should do that, but it does not seem
>>>>> to be working for me.  Will need to look into it more.
>>>>>
>>>>>
>>>>> (5)
>>>>> -logistic regression and SVM do not currently support sparse matrix
>>>>> format
>>>>> http://madlib.incubator.apache.org/docs/latest/group__grp__svec.html
>>>>> MADlib: Sparse Vectors
>>>>> <http://madlib.incubator.apache.org/docs/latest/group__grp__svec.html>
>>>>> dict_id_col : TEXT. Name of the id column in the dictionary_tbl.
>>>>> Expected Type: INTEGER or BIGINT. NOTE: Values must be continuous ranging
>>>>> from 0 to total number of elements in the dictionary - 1.
>>>>> madlib.incubator.apache.org
>>>>>
>>>>>
>>>>> Frank
>>>>>
>>>>> ------------------------------
>>>>> *From:* Lijie Xu <[email protected]>
>>>>> *Sent:* Saturday, July 3, 2021 1:21 PM
>>>>> *To:* [email protected] <[email protected]>
>>>>> *Subject:* Long execution time on MADlib
>>>>>
>>>>>
>>>>>
>>>>> Hi All,
>>>>>
>>>>>
>>>>>
>>>>> I’m Lijie and now performing some experiments on MADlib. I found that
>>>>> MADlib runs very slowly on some datasets, so I would like to justify my
>>>>> settings. Could you help me check the following settings and codes? Sorry
>>>>> for this long email. I used the latest MADlib 1.18 on PostgreSQL 12.
>>>>>
>>>>>
>>>>>
>>>>> *(1)  **Could you help check whether the data format and scripts I
>>>>> used are right for n-dimensional dataset?*
>>>>>
>>>>>
>>>>>
>>>>> I have some training datasets, and each of them has a dense feature
>>>>> array (like [0.1, 0.2, …, 1.0]) and a class label (+1/-1). For example, 
>>>>> for
>>>>> the ‘forest’ dataset (581K tuples) with a 54-dimensional feature array and
>>>>> a class label, I first stored it into PostgreSQL using
>>>>>
>>>>>
>>>>>
>>>>> <code>
>>>>>
>>>>>      CREATE TABLE forest (
>>>>>
>>>>>           did serial,
>>>>>
>>>>>           vec double precision[],
>>>>>
>>>>>           labeli integer);
>>>>>
>>>>>
>>>>>
>>>>>       COPY forest (vec, labeli) FROM STDIN;
>>>>>
>>>>>       ‘[0.1, 0.2, …, 1.0], -1’
>>>>>
>>>>>       ‘[0.3, 0.1, …, 0.9], 1’
>>>>>
>>>>>       …
>>>>>
>>>>> </code>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>         Then, to run the Logistic Regression on this dataset, I use
>>>>> the following code:
>>>>>
>>>>>
>>>>>
>>>>> <code>
>>>>>
>>>>> mldb=# \d forest
>>>>>
>>>>>                                Table "public.forest"
>>>>>
>>>>>  Column |        Type        |
>>>>> Modifiers
>>>>>
>>>>>
>>>>> --------+--------------------+------------------------------------------------------
>>>>>
>>>>>  did    | integer            | not null default
>>>>> nextval('forest_did_seq'::regclass)
>>>>>
>>>>>  vec    | double precision[] |
>>>>>
>>>>>  labeli | integer            |
>>>>>
>>>>>
>>>>>
>>>>> mldb=# SELECT madlib.logregr_train(
>>>>>
>>>>> mldb(#     'forest',                                 -- source table
>>>>>
>>>>> mldb(#     'forest_logregr_out',                     -- output table
>>>>>
>>>>> mldb(#     'labeli',                                 -- labels
>>>>>
>>>>> mldb(#     'vec',                                    -- features
>>>>>
>>>>> mldb(#     NULL,                                     -- grouping
>>>>> columns
>>>>>
>>>>> mldb(#     20,                                       -- max number of
>>>>> iteration
>>>>>
>>>>> mldb(#     'igd'                                     -- optimizer
>>>>>
>>>>> mldb(#     );
>>>>>
>>>>>
>>>>>
>>>>> Time: 198911.350 ms
>>>>>
>>>>> </code>
>>>>>
>>>>>
>>>>>
>>>>> After about 199s, I got the output table as:
>>>>>
>>>>> <code>
>>>>>
>>>>> mldb=# \d forest_logregr_out
>>>>>
>>>>>              Table "public.forest_logregr_out"
>>>>>
>>>>>           Column          |        Type        | Modifiers
>>>>>
>>>>> --------------------------+--------------------+-----------
>>>>>
>>>>>  coef                     | double precision[] |
>>>>>
>>>>>  log_likelihood           | double precision   |
>>>>>
>>>>>  std_err                  | double precision[] |
>>>>>
>>>>>  z_stats                  | double precision[] |
>>>>>
>>>>>  p_values                 | double precision[] |
>>>>>
>>>>>  odds_ratios              | double precision[] |
>>>>>
>>>>>  condition_no             | double precision   |
>>>>>
>>>>>  num_rows_processed       | bigint             |
>>>>>
>>>>>  num_missing_rows_skipped | bigint             |
>>>>>
>>>>>  num_iterations           | integer            |
>>>>>
>>>>>  variance_covariance      | double precision[] |
>>>>>
>>>>>
>>>>>
>>>>> mldb=# select log_likelihood from forest_logregr_out;
>>>>>
>>>>>   log_likelihood
>>>>>
>>>>> ------------------
>>>>>
>>>>>  -426986.83683879
>>>>>
>>>>> (1 row)
>>>>>
>>>>> </code>
>>>>>
>>>>>
>>>>>
>>>>> Is this procedure correct?
>>>>>
>>>>>
>>>>>
>>>>> *(2)  **Training on a 2,000-dimensional dense dataset (epsilon) is
>>>>> very slow:*
>>>>>
>>>>>
>>>>>
>>>>>            While training on a 2,000-dimensional dense dataset
>>>>> (epsilon_sample_10) with only *10 tuples* as follows, MADlib does not
>>>>> finish in 5 hours* for only 1 iteration*. The CPU usage is always
>>>>> 100% during the execution. The dataset is available at
>>>>> https://github.com/JerryLead/Misc/blob/master/MADlib/train.sql
>>>>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FJerryLead%2FMisc%2Fblob%2Fmaster%2FMADlib%2Ftrain.sql&data=04%7C01%7Cfmcquillan%40vmware.com%7C4b68d873a6434f4a8ddd08d93e603b96%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637609405309019768%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Ga6INXkQiBvI8RWAfCEZI5uSFOscUdG6RPR0NiuKWjU%3D&reserved=0>
>>>>> .
>>>>>
>>>>>
>>>>>
>>>>> <code>
>>>>>
>>>>> mldb=# \d epsilon_sample_10
>>>>>
>>>>>                                Table "public.epsilon_sample_10"
>>>>>
>>>>>  Column |        Type        |
>>>>>             Modifiers
>>>>>
>>>>>
>>>>> --------+--------------------+-----------------------------------------------------------------
>>>>>
>>>>>  did    | integer            | not null default
>>>>> nextval('epsilon_sample_10_did_seq'::regclass)
>>>>>
>>>>>  vec    | double precision[] |
>>>>>
>>>>>  labeli | integer            |
>>>>>
>>>>>
>>>>>
>>>>> mldb=# SELECT count(*) from epsilon_sample_10;
>>>>>
>>>>>  count
>>>>>
>>>>> -------
>>>>>
>>>>>     10
>>>>>
>>>>> (1 row)
>>>>>
>>>>>
>>>>>
>>>>> Time: 1.456 ms
>>>>>
>>>>>
>>>>>
>>>>> mldb=# SELECT madlib.logregr_train('epsilon_sample_10',
>>>>> 'epsilon_sample_10_logregr_out', 'labeli', 'vec', NULL, 1, 'igd');
>>>>>
>>>>> </code>
>>>>>
>>>>>
>>>>>
>>>>> *In this case, it is not possible to train the whole epsilon dataset
>>>>> (with 400,000 tuples) in a reasonable time. I guess that this problem is
>>>>> related to TOAST, since epsilon has a high dimension and it is compressed
>>>>> by TOAST. However, are there any other reasons for this so slow 
>>>>> execution?*
>>>>>
>>>>>
>>>>>
>>>>> *(3)  **For MADlib, is the dataset table scanned once or twice in
>>>>> each iteration?*
>>>>>
>>>>> I know that, in each iteration, MADlib needs to scan the dataset table
>>>>> once to perform IGD/SGD on the whole dataset. My question is that, *at
>>>>> the end of each iteration*, will MADlib scan the table again to
>>>>> compute the training loss/accuracy?
>>>>>
>>>>>
>>>>>
>>>>> *(4)  **Is it possible to output the training metrics, such as
>>>>> training loss and accuracy after each iteration?*
>>>>>
>>>>> Currently, it seems that MADlib only outputs the log-likelihood at the
>>>>> end of the SQL execution.
>>>>>
>>>>>
>>>>>
>>>>> *(5)  **Do MADlib’s Logistic Regression and SVM support sparse
>>>>> datasets?*
>>>>>
>>>>> I also have some sparse datasets denoted as ‘feature_index_vec_array,
>>>>> feature_value_array, label’, such as ‘[1, 3, 5], [0.1, 0.2, 0.3], -1’. Can
>>>>> I train these sparse datasets on MADlib using LR and SVM?
>>>>>
>>>>>
>>>>>
>>>>> Many thanks for reviewing my questions.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Best regards,
>>>>>
>>>>>
>>>>>
>>>>> Lijie
>>>>>
>>>>

Re: Long execution time on MADlib

Reply via email to