Thank you Aaron, I will check it.
Best, Lijie On Fri, Jul 16, 2021 at 2:17 AM FENG, Xixuan (Aaron) <[email protected]> wrote: > My guess is because logregr computes a matrix X’AX which is big when you > have 2000 features. The matrix is not needed for training the model but > only for computing stderr after training. You can probably remove the > matrix completely but in the engineering perspective it is more difficult > than just changing the step size as you may need to take care of > serializing the user defined function states… > > > https://github.com/apache/madlib/blob/2e34c0f45a6e0f3be224ef58a6f4a576eb8eb89a/src/modules/regress/logistic.cpp#L851 > > > https://github.com/apache/madlib/blob/2e34c0f45a6e0f3be224ef58a6f4a576eb8eb89a/src/modules/regress/logistic.cpp#L821 > > > https://github.com/apache/madlib/blob/2e34c0f45a6e0f3be224ef58a6f4a576eb8eb89a/src/modules/regress/logistic.cpp#L930 > > > https://github.com/apache/madlib/blob/2e34c0f45a6e0f3be224ef58a6f4a576eb8eb89a/src/modules/regress/logistic.cpp#L1052 > > > > 2021年7月16日(金) 1:43 Lijie Xu <[email protected]>: > >> Dear Aaron, >> >> Thanks for your advice. I will try it. >> >> In addition, after following Frank's guide, I found MADlib LR and SVM can >> work normally on some low-dimensional (e.g., 18-28) datasets, even with >1 >> million tuples. However, while working on high-dimensional dataset, such as >> epsilon dataset with 400,000 tuples and 2,000 features ( >> https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html), >> MADlib SVM can finish 20 iterations in a reasonable time but MADlib LR >> (with IGD) cannot finish 2 iterations in several hours. Any ideas about >> this problem? Thanks! >> >> Best, >> Lijie >> >> >> >> On Thu, Jul 15, 2021 at 4:03 PM FENG, Xixuan (Aaron) < >> [email protected]> wrote: >> >>> Hi Lijie, >>> >>> I implemented the logregr with incremental gradient descent a few years >>> ago. Unfortunately at that time we chose to hard-coded the constant >>> step-size. But luckily you can edit the code as you need. >>> >>> Here are the pointers: >>> >>> https://github.com/apache/madlib/blob/2e34c0f45a6e0f3be224ef58a6f4a576eb8eb89a/src/modules/regress/logistic.cpp#L818 >>> >>> >>> https://github.com/apache/madlib/blob/2e34c0f45a6e0f3be224ef58a6f4a576eb8eb89a/src/modules/regress/logistic.cpp#L918 >>> >>> Good luck! >>> Aaron >>> >>> 2021年7月15日(木) 22:14 Lijie Xu <[email protected]>: >>> >>>> Dear Frank, >>>> >>>> Sorry for the late reply and thanks for your great help. I'm doing some >>>> research work on MADlib. I will follow your advice to test MADlib again. >>>> Another question is if MADlib LR supports tuning learning_rate? >>>> >>>> In MADlib SVM, there is a 'params' in 'svm_classification' to tune the >>>> 'init_stepsize' >>>> and 'decay_factor' as follows. >>>> >>>> svm_classification( >>>> source_table, >>>> model_table, >>>> dependent_varname, >>>> independent_varname, >>>> kernel_func, >>>> kernel_params, >>>> grouping_col, >>>> params, >>>> verbose >>>> ) >>>> >>>> However, I did not see this 'params' in LR as: >>>> >>>> logregr_train( source_table, >>>> out_table, >>>> dependent_varname, >>>> independent_varname, >>>> grouping_cols, >>>> max_iter, >>>> optimizer, >>>> tolerance, >>>> verbose >>>> ) >>>> >>>> In addition, I checked the Generalized Linear Models, and >>>> its 'optim_params' parameter seems to only support tuning 'tolerance, >>>> max_iter, and optimizer'. >>>> Is there a way to tune the 'init_stepsize' and 'decay_factor' in LR? >>>> Thanks! >>>> >>>> Best, >>>> Lijie >>>> >>>> On Tue, Jul 6, 2021 at 9:04 PM Frank McQuillan <[email protected]> >>>> wrote: >>>> >>>>> Hello, >>>>> >>>>> Thank you for the questions. >>>>> >>>>> (0) >>>>> Not sure if you are using Postgres just for development or production, >>>>> but keep in mind that MADlib is designed to run on a distributed MPP >>>>> database (Greenplum) with large datasets. It runs fine on Postgres, but >>>>> obviously Postgres won't scale to very large datasets or it will just be >>>>> too slow. >>>>> >>>>> Also see jupyter notebooks here >>>>> >>>>> https://github.com/apache/madlib-site/tree/asf-site/community-artifacts/Supervised-learning >>>>> for other examples in case of use. >>>>> >>>>> >>>>> (1) >>>>> - there are 2 problems with your dataset for logistic regression: >>>>> >>>>> (i) >>>>> - as per >>>>> http://madlib.incubator.apache.org/docs/latest/group__grp__logreg.html >>>>> MADlib: Logistic Regression >>>>> <http://madlib.incubator.apache.org/docs/latest/group__grp__logreg.html> >>>>> Binomial logistic regression models the relationship between a >>>>> dichotomous dependent variable and one or more predictor variables. The >>>>> dependent variable may be a Boolean value or a categorial variable that >>>>> can >>>>> be represented with a Boolean expression. >>>>> madlib.incubator.apache.org >>>>> >>>>> >>>>> the dependent variable is a boolean or an expression that evaluates to >>>>> boolean >>>>> - your data has dependent variable of -1 but postgres does not >>>>> evaluate -1 to FALSE so you should change the -1 to 0 >>>>> - i.e., use 0 for FALSE and 1 for TRUE in postgres >>>>> https://www.postgresql.org/docs/12/datatype-boolean.html >>>>> <https://www.postgresql.org/docs/12/datatype-boolean.html> >>>>> PostgreSQL: Documentation: 12: 8.6. Boolean Type >>>>> <https://www.postgresql.org/docs/12/datatype-boolean.html> >>>>> The key words TRUE and FALSE are the preferred (SQL-compliant) method >>>>> for writing Boolean constants in SQL queries.But you can also use the >>>>> string representations by following the generic string-literal constant >>>>> syntax described in Section 4.1.2.7, for example 'yes'::boolean.. Note >>>>> that >>>>> the parser automatically understands that TRUE and FALSE are of type >>>>> boolean, but this is not so for NULL ... >>>>> www.postgresql.org >>>>> >>>>> >>>>> >>>>> (ii) >>>>> - an intercept variable is not assumed so it is common to provide an >>>>> explicit intercept term by including a single constant 1 term in the >>>>> independent variable list >>>>> - see the example here >>>>> >>>>> http://madlib.incubator.apache.org/docs/latest/group__grp__logreg.html#examples >>>>> MADlib: Logistic Regression >>>>> <http://madlib.incubator.apache.org/docs/latest/group__grp__logreg.html#examples> >>>>> Binomial logistic regression models the relationship between a >>>>> dichotomous dependent variable and one or more predictor variables. The >>>>> dependent variable may be a Boolean value or a categorial variable that >>>>> can >>>>> be represented with a Boolean expression. >>>>> madlib.incubator.apache.org >>>>> >>>>> >>>>> >>>>> That is why the log_likelihood value is too big, that model is not >>>>> right. >>>>> >>>>> >>>>> (2) >>>>> if you make the fixes above in (1) it should run OK. Here are my >>>>> results on PostgreSQL 11.6 using MADlib version: 1.18.0 on the dataset >>>>> with >>>>> 10 tuples: >>>>> >>>>> >>>>> DROP TABLE IF EXISTS epsilon_sample_10v2 CASCADE; >>>>> >>>>> CREATE TABLE epsilon_sample_10v2 ( >>>>> did serial, >>>>> vec double precision[], >>>>> labeli integer >>>>> ); >>>>> >>>>> COPY epsilon_sample_10v2 (vec, labeli) FROM STDIN; >>>>> {1.0,-0.0108282,-0.0196004,0.0422148,...} 0 >>>>> {1.0,0.00250835,0.0168447,-0.0102934,...} 1 >>>>> etc. >>>>> >>>>> SELECT madlib.logregr_train('epsilon_sample_10v2', >>>>> 'epsilon_sample_10v2_logregr_out', 'labeli', 'vec', NULL, 1, 'irls'} >>>>> >>>>> logregr_train >>>>> --------------- >>>>> >>>>> (1 row) >>>>> >>>>> Time: 317046.342 ms (05:17.046) >>>>> >>>>> madlib=# select log_likelihood from epsilon_sample_10v2_logregr_out; >>>>> log_likelihood >>>>> ------------------- >>>>> -6.93147180559945 >>>>> (1 row) >>>>> >>>>> >>>>> (3) >>>>> -dataset is not scanned again at the end of every iteration to compute >>>>> training loss/accuracy. It should only scan 1x per iteration for >>>>> optimization >>>>> >>>>> >>>>> (4) >>>>> - I thought the verbose parameter should do that, but it does not seem >>>>> to be working for me. Will need to look into it more. >>>>> >>>>> >>>>> (5) >>>>> -logistic regression and SVM do not currently support sparse matrix >>>>> format >>>>> http://madlib.incubator.apache.org/docs/latest/group__grp__svec.html >>>>> MADlib: Sparse Vectors >>>>> <http://madlib.incubator.apache.org/docs/latest/group__grp__svec.html> >>>>> dict_id_col : TEXT. Name of the id column in the dictionary_tbl. >>>>> Expected Type: INTEGER or BIGINT. NOTE: Values must be continuous ranging >>>>> from 0 to total number of elements in the dictionary - 1. >>>>> madlib.incubator.apache.org >>>>> >>>>> >>>>> Frank >>>>> >>>>> ------------------------------ >>>>> *From:* Lijie Xu <[email protected]> >>>>> *Sent:* Saturday, July 3, 2021 1:21 PM >>>>> *To:* [email protected] <[email protected]> >>>>> *Subject:* Long execution time on MADlib >>>>> >>>>> >>>>> >>>>> Hi All, >>>>> >>>>> >>>>> >>>>> I’m Lijie and now performing some experiments on MADlib. I found that >>>>> MADlib runs very slowly on some datasets, so I would like to justify my >>>>> settings. Could you help me check the following settings and codes? Sorry >>>>> for this long email. I used the latest MADlib 1.18 on PostgreSQL 12. >>>>> >>>>> >>>>> >>>>> *(1) **Could you help check whether the data format and scripts I >>>>> used are right for n-dimensional dataset?* >>>>> >>>>> >>>>> >>>>> I have some training datasets, and each of them has a dense feature >>>>> array (like [0.1, 0.2, …, 1.0]) and a class label (+1/-1). For example, >>>>> for >>>>> the ‘forest’ dataset (581K tuples) with a 54-dimensional feature array and >>>>> a class label, I first stored it into PostgreSQL using >>>>> >>>>> >>>>> >>>>> <code> >>>>> >>>>> CREATE TABLE forest ( >>>>> >>>>> did serial, >>>>> >>>>> vec double precision[], >>>>> >>>>> labeli integer); >>>>> >>>>> >>>>> >>>>> COPY forest (vec, labeli) FROM STDIN; >>>>> >>>>> ‘[0.1, 0.2, …, 1.0], -1’ >>>>> >>>>> ‘[0.3, 0.1, …, 0.9], 1’ >>>>> >>>>> … >>>>> >>>>> </code> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Then, to run the Logistic Regression on this dataset, I use >>>>> the following code: >>>>> >>>>> >>>>> >>>>> <code> >>>>> >>>>> mldb=# \d forest >>>>> >>>>> Table "public.forest" >>>>> >>>>> Column | Type | >>>>> Modifiers >>>>> >>>>> >>>>> --------+--------------------+------------------------------------------------------ >>>>> >>>>> did | integer | not null default >>>>> nextval('forest_did_seq'::regclass) >>>>> >>>>> vec | double precision[] | >>>>> >>>>> labeli | integer | >>>>> >>>>> >>>>> >>>>> mldb=# SELECT madlib.logregr_train( >>>>> >>>>> mldb(# 'forest', -- source table >>>>> >>>>> mldb(# 'forest_logregr_out', -- output table >>>>> >>>>> mldb(# 'labeli', -- labels >>>>> >>>>> mldb(# 'vec', -- features >>>>> >>>>> mldb(# NULL, -- grouping >>>>> columns >>>>> >>>>> mldb(# 20, -- max number of >>>>> iteration >>>>> >>>>> mldb(# 'igd' -- optimizer >>>>> >>>>> mldb(# ); >>>>> >>>>> >>>>> >>>>> Time: 198911.350 ms >>>>> >>>>> </code> >>>>> >>>>> >>>>> >>>>> After about 199s, I got the output table as: >>>>> >>>>> <code> >>>>> >>>>> mldb=# \d forest_logregr_out >>>>> >>>>> Table "public.forest_logregr_out" >>>>> >>>>> Column | Type | Modifiers >>>>> >>>>> --------------------------+--------------------+----------- >>>>> >>>>> coef | double precision[] | >>>>> >>>>> log_likelihood | double precision | >>>>> >>>>> std_err | double precision[] | >>>>> >>>>> z_stats | double precision[] | >>>>> >>>>> p_values | double precision[] | >>>>> >>>>> odds_ratios | double precision[] | >>>>> >>>>> condition_no | double precision | >>>>> >>>>> num_rows_processed | bigint | >>>>> >>>>> num_missing_rows_skipped | bigint | >>>>> >>>>> num_iterations | integer | >>>>> >>>>> variance_covariance | double precision[] | >>>>> >>>>> >>>>> >>>>> mldb=# select log_likelihood from forest_logregr_out; >>>>> >>>>> log_likelihood >>>>> >>>>> ------------------ >>>>> >>>>> -426986.83683879 >>>>> >>>>> (1 row) >>>>> >>>>> </code> >>>>> >>>>> >>>>> >>>>> Is this procedure correct? >>>>> >>>>> >>>>> >>>>> *(2) **Training on a 2,000-dimensional dense dataset (epsilon) is >>>>> very slow:* >>>>> >>>>> >>>>> >>>>> While training on a 2,000-dimensional dense dataset >>>>> (epsilon_sample_10) with only *10 tuples* as follows, MADlib does not >>>>> finish in 5 hours* for only 1 iteration*. The CPU usage is always >>>>> 100% during the execution. The dataset is available at >>>>> https://github.com/JerryLead/Misc/blob/master/MADlib/train.sql >>>>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FJerryLead%2FMisc%2Fblob%2Fmaster%2FMADlib%2Ftrain.sql&data=04%7C01%7Cfmcquillan%40vmware.com%7C4b68d873a6434f4a8ddd08d93e603b96%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637609405309019768%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Ga6INXkQiBvI8RWAfCEZI5uSFOscUdG6RPR0NiuKWjU%3D&reserved=0> >>>>> . >>>>> >>>>> >>>>> >>>>> <code> >>>>> >>>>> mldb=# \d epsilon_sample_10 >>>>> >>>>> Table "public.epsilon_sample_10" >>>>> >>>>> Column | Type | >>>>> Modifiers >>>>> >>>>> >>>>> --------+--------------------+----------------------------------------------------------------- >>>>> >>>>> did | integer | not null default >>>>> nextval('epsilon_sample_10_did_seq'::regclass) >>>>> >>>>> vec | double precision[] | >>>>> >>>>> labeli | integer | >>>>> >>>>> >>>>> >>>>> mldb=# SELECT count(*) from epsilon_sample_10; >>>>> >>>>> count >>>>> >>>>> ------- >>>>> >>>>> 10 >>>>> >>>>> (1 row) >>>>> >>>>> >>>>> >>>>> Time: 1.456 ms >>>>> >>>>> >>>>> >>>>> mldb=# SELECT madlib.logregr_train('epsilon_sample_10', >>>>> 'epsilon_sample_10_logregr_out', 'labeli', 'vec', NULL, 1, 'igd'); >>>>> >>>>> </code> >>>>> >>>>> >>>>> >>>>> *In this case, it is not possible to train the whole epsilon dataset >>>>> (with 400,000 tuples) in a reasonable time. I guess that this problem is >>>>> related to TOAST, since epsilon has a high dimension and it is compressed >>>>> by TOAST. However, are there any other reasons for this so slow >>>>> execution?* >>>>> >>>>> >>>>> >>>>> *(3) **For MADlib, is the dataset table scanned once or twice in >>>>> each iteration?* >>>>> >>>>> I know that, in each iteration, MADlib needs to scan the dataset table >>>>> once to perform IGD/SGD on the whole dataset. My question is that, *at >>>>> the end of each iteration*, will MADlib scan the table again to >>>>> compute the training loss/accuracy? >>>>> >>>>> >>>>> >>>>> *(4) **Is it possible to output the training metrics, such as >>>>> training loss and accuracy after each iteration?* >>>>> >>>>> Currently, it seems that MADlib only outputs the log-likelihood at the >>>>> end of the SQL execution. >>>>> >>>>> >>>>> >>>>> *(5) **Do MADlib’s Logistic Regression and SVM support sparse >>>>> datasets?* >>>>> >>>>> I also have some sparse datasets denoted as ‘feature_index_vec_array, >>>>> feature_value_array, label’, such as ‘[1, 3, 5], [0.1, 0.2, 0.3], -1’. Can >>>>> I train these sparse datasets on MADlib using LR and SVM? >>>>> >>>>> >>>>> >>>>> Many thanks for reviewing my questions. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Best regards, >>>>> >>>>> >>>>> >>>>> Lijie >>>>> >>>>
