Re: Long execution time on MADlib

Lijie Xu Thu, 15 Jul 2021 06:14:28 -0700

Dear Frank,

Sorry for the late reply and thanks for your great help. I'm doing some
research work on MADlib. I will follow your advice to test MADlib again.
Another question is if MADlib LR supports tuning learning_rate?


In MADlib SVM, there is a 'params' in 'svm_classification' to tune the
'init_stepsize'
and 'decay_factor' as follows.

svm_classification(
    source_table,
    model_table,
    dependent_varname,
    independent_varname,
    kernel_func,
    kernel_params,
    grouping_col,
    params,
    verbose
    )

However, I did not see this 'params' in LR as:

logregr_train( source_table,
               out_table,
               dependent_varname,
               independent_varname,
               grouping_cols,
               max_iter,
               optimizer,
               tolerance,
               verbose
             )

In addition, I checked the Generalized Linear Models, and
its 'optim_params' parameter seems to only support tuning 'tolerance,
max_iter, and optimizer'.
Is there a way to tune the 'init_stepsize' and 'decay_factor' in LR? Thanks!

Best,
Lijie

On Tue, Jul 6, 2021 at 9:04 PM Frank McQuillan <[email protected]>
wrote:

> Hello,
>
> Thank you for the questions.
>
> (0)
> Not sure if you are using Postgres just for development or production, but
> keep in mind that MADlib is designed to run on a distributed MPP database
> (Greenplum) with large datasets. It runs fine on Postgres, but obviously
> Postgres won't scale to very large datasets or it will just be too slow.
>
> Also see jupyter notebooks here
>
> https://github.com/apache/madlib-site/tree/asf-site/community-artifacts/Supervised-learning
> for other examples in case of use.
>
>
> (1)
> - there are 2 problems with your dataset for logistic regression:
>
> (i)
> - as per
> http://madlib.incubator.apache.org/docs/latest/group__grp__logreg.html
> MADlib: Logistic Regression
> <http://madlib.incubator.apache.org/docs/latest/group__grp__logreg.html>
> Binomial logistic regression models the relationship between a dichotomous
> dependent variable and one or more predictor variables. The dependent
> variable may be a Boolean value or a categorial variable that can be
> represented with a Boolean expression.
> madlib.incubator.apache.org
>
>
> the dependent variable is a boolean or an expression that evaluates to
> boolean
> - your data has dependent variable of -1 but postgres does not evaluate -1
> to FALSE so you should change the -1 to 0
> - i.e., use 0 for FALSE and 1 for TRUE in postgres
> https://www.postgresql.org/docs/12/datatype-boolean.html
> <https://www.postgresql.org/docs/12/datatype-boolean.html>
> PostgreSQL: Documentation: 12: 8.6. Boolean Type
> <https://www.postgresql.org/docs/12/datatype-boolean.html>
> The key words TRUE and FALSE are the preferred (SQL-compliant) method for
> writing Boolean constants in SQL queries.But you can also use the string
> representations by following the generic string-literal constant syntax
> described in Section 4.1.2.7, for example 'yes'::boolean.. Note that the
> parser automatically understands that TRUE and FALSE are of type boolean,
> but this is not so for NULL ...
> www.postgresql.org
>
>
>
> (ii)
> - an intercept variable is not assumed so it is common to provide an
> explicit intercept term by including a single constant 1 term in the
> independent variable list
> - see the example here
>
> http://madlib.incubator.apache.org/docs/latest/group__grp__logreg.html#examples
> MADlib: Logistic Regression
> <http://madlib.incubator.apache.org/docs/latest/group__grp__logreg.html#examples>
> Binomial logistic regression models the relationship between a dichotomous
> dependent variable and one or more predictor variables. The dependent
> variable may be a Boolean value or a categorial variable that can be
> represented with a Boolean expression.
> madlib.incubator.apache.org
>
>
>
> That is why the log_likelihood value is too big, that model is not right.
>
>
> (2)
> if you make the fixes above in (1) it should run OK.  Here are my results
> on PostgreSQL 11.6 using MADlib version: 1.18.0 on the dataset with 10
> tuples:
>
>
> DROP TABLE IF EXISTS epsilon_sample_10v2 CASCADE;
>
>         CREATE TABLE epsilon_sample_10v2 (
>        did serial,
>        vec double precision[],
>        labeli integer
>         );
>
>         COPY epsilon_sample_10v2 (vec, labeli) FROM STDIN;
>         {1.0,-0.0108282,-0.0196004,0.0422148,...} 0
>         {1.0,0.00250835,0.0168447,-0.0102934,...} 1
>         etc.
>
> SELECT madlib.logregr_train('epsilon_sample_10v2',
> 'epsilon_sample_10v2_logregr_out', 'labeli', 'vec', NULL, 1, 'irls'}
>
>  logregr_train
> ---------------
>
> (1 row)
>
> Time: 317046.342 ms (05:17.046)
>
> madlib=# select log_likelihood from epsilon_sample_10v2_logregr_out;
>   log_likelihood
> -------------------
>  -6.93147180559945
> (1 row)
>
>
> (3)
> -dataset is not scanned again at the end of every iteration to compute
> training loss/accuracy.  It should only scan 1x per iteration for
> optimization
>
>
> (4)
> - I thought the verbose parameter should do that, but it does not seem to
> be working for me.  Will need to look into it more.
>
>
> (5)
> -logistic regression and SVM do not currently support sparse matrix format
> http://madlib.incubator.apache.org/docs/latest/group__grp__svec.html
> MADlib: Sparse Vectors
> <http://madlib.incubator.apache.org/docs/latest/group__grp__svec.html>
> dict_id_col : TEXT. Name of the id column in the dictionary_tbl. Expected
> Type: INTEGER or BIGINT. NOTE: Values must be continuous ranging from 0 to
> total number of elements in the dictionary - 1.
> madlib.incubator.apache.org
>
>
> Frank
>
> ------------------------------
> *From:* Lijie Xu <[email protected]>
> *Sent:* Saturday, July 3, 2021 1:21 PM
> *To:* [email protected] <[email protected]>
> *Subject:* Long execution time on MADlib
>
>
>
> Hi All,
>
>
>
> I’m Lijie and now performing some experiments on MADlib. I found that
> MADlib runs very slowly on some datasets, so I would like to justify my
> settings. Could you help me check the following settings and codes? Sorry
> for this long email. I used the latest MADlib 1.18 on PostgreSQL 12.
>
>
>
> *(1)  **Could you help check whether the data format and scripts I used
> are right for n-dimensional dataset?*
>
>
>
> I have some training datasets, and each of them has a dense feature array
> (like [0.1, 0.2, …, 1.0]) and a class label (+1/-1). For example, for the
> ‘forest’ dataset (581K tuples) with a 54-dimensional feature array and a
> class label, I first stored it into PostgreSQL using
>
>
>
> <code>
>
>      CREATE TABLE forest (
>
>           did serial,
>
>           vec double precision[],
>
>           labeli integer);
>
>
>
>       COPY forest (vec, labeli) FROM STDIN;
>
>       ‘[0.1, 0.2, …, 1.0], -1’
>
>       ‘[0.3, 0.1, …, 0.9], 1’
>
>       …
>
> </code>
>
>
>
>
>
>         Then, to run the Logistic Regression on this dataset, I use the
> following code:
>
>
>
> <code>
>
> mldb=# \d forest
>
>                                Table "public.forest"
>
>  Column |        Type        |
> Modifiers
>
>
> --------+--------------------+------------------------------------------------------
>
>  did    | integer            | not null default
> nextval('forest_did_seq'::regclass)
>
>  vec    | double precision[] |
>
>  labeli | integer            |
>
>
>
> mldb=# SELECT madlib.logregr_train(
>
> mldb(#     'forest',                                 -- source table
>
> mldb(#     'forest_logregr_out',                     -- output table
>
> mldb(#     'labeli',                                 -- labels
>
> mldb(#     'vec',                                    -- features
>
> mldb(#     NULL,                                     -- grouping columns
>
> mldb(#     20,                                       -- max number of
> iteration
>
> mldb(#     'igd'                                     -- optimizer
>
> mldb(#     );
>
>
>
> Time: 198911.350 ms
>
> </code>
>
>
>
> After about 199s, I got the output table as:
>
> <code>
>
> mldb=# \d forest_logregr_out
>
>              Table "public.forest_logregr_out"
>
>           Column          |        Type        | Modifiers
>
> --------------------------+--------------------+-----------
>
>  coef                     | double precision[] |
>
>  log_likelihood           | double precision   |
>
>  std_err                  | double precision[] |
>
>  z_stats                  | double precision[] |
>
>  p_values                 | double precision[] |
>
>  odds_ratios              | double precision[] |
>
>  condition_no             | double precision   |
>
>  num_rows_processed       | bigint             |
>
>  num_missing_rows_skipped | bigint             |
>
>  num_iterations           | integer            |
>
>  variance_covariance      | double precision[] |
>
>
>
> mldb=# select log_likelihood from forest_logregr_out;
>
>   log_likelihood
>
> ------------------
>
>  -426986.83683879
>
> (1 row)
>
> </code>
>
>
>
> Is this procedure correct?
>
>
>
> *(2)  **Training on a 2,000-dimensional dense dataset (epsilon) is very
> slow:*
>
>
>
>            While training on a 2,000-dimensional dense dataset
> (epsilon_sample_10) with only *10 tuples* as follows, MADlib does not
> finish in 5 hours* for only 1 iteration*. The CPU usage is always 100%
> during the execution. The dataset is available at
> https://github.com/JerryLead/Misc/blob/master/MADlib/train.sql
> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FJerryLead%2FMisc%2Fblob%2Fmaster%2FMADlib%2Ftrain.sql&data=04%7C01%7Cfmcquillan%40vmware.com%7C4b68d873a6434f4a8ddd08d93e603b96%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637609405309019768%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Ga6INXkQiBvI8RWAfCEZI5uSFOscUdG6RPR0NiuKWjU%3D&reserved=0>
> .
>
>
>
> <code>
>
> mldb=# \d epsilon_sample_10
>
>                                Table "public.epsilon_sample_10"
>
>  Column |        Type        |
>             Modifiers
>
>
> --------+--------------------+-----------------------------------------------------------------
>
>  did    | integer            | not null default
> nextval('epsilon_sample_10_did_seq'::regclass)
>
>  vec    | double precision[] |
>
>  labeli | integer            |
>
>
>
> mldb=# SELECT count(*) from epsilon_sample_10;
>
>  count
>
> -------
>
>     10
>
> (1 row)
>
>
>
> Time: 1.456 ms
>
>
>
> mldb=# SELECT madlib.logregr_train('epsilon_sample_10',
> 'epsilon_sample_10_logregr_out', 'labeli', 'vec', NULL, 1, 'igd');
>
> </code>
>
>
>
> *In this case, it is not possible to train the whole epsilon dataset (with
> 400,000 tuples) in a reasonable time. I guess that this problem is related
> to TOAST, since epsilon has a high dimension and it is compressed by TOAST.
> However, are there any other reasons for this so slow execution?*
>
>
>
> *(3)  **For MADlib, is the dataset table scanned once or twice in each
> iteration?*
>
> I know that, in each iteration, MADlib needs to scan the dataset table
> once to perform IGD/SGD on the whole dataset. My question is that, *at
> the end of each iteration*, will MADlib scan the table again to compute
> the training loss/accuracy?
>
>
>
> *(4)  **Is it possible to output the training metrics, such as training
> loss and accuracy after each iteration?*
>
> Currently, it seems that MADlib only outputs the log-likelihood at the end
> of the SQL execution.
>
>
>
> *(5)  **Do MADlib’s Logistic Regression and SVM support sparse datasets?*
>
> I also have some sparse datasets denoted as ‘feature_index_vec_array,
> feature_value_array, label’, such as ‘[1, 3, 5], [0.1, 0.2, 0.3], -1’. Can
> I train these sparse datasets on MADlib using LR and SVM?
>
>
>
> Many thanks for reviewing my questions.
>
>
>
>
>
> Best regards,
>
>
>
> Lijie
>

Re: Long execution time on MADlib

Reply via email to