(1) If you want to play around with different data sizes, I would recommend to use our data generator for linear regression https://github.com/apache/incubator-systemml/blob/master/scripts/datagen/genRandData4LinearRegression.dml
(2) Well, SystemML does not expose these physical data properties to the user (on purpose to ensure data independence). However, if you are curious, SystemML does a coalesce on checkpoints to reduce the number of partitions to <data size> / <hdfs block size> but only if this does not reduce the effective degree of parallelism (e.g., if you have only 8GB data, 128MB hdfs block size, 100 cores, and currently 90 partitions, we would not reduce this to 8GB/128MB=64 partitions because 64<100). (3) The whole point of running stepwise linear regression is feature selection, so you'll get the selected features and estimated model parameters of these features as well as some information on the selection process. You can evaluate this model on a hold out test set or run some form of cross validation. However, keep in mind that for accuracy experiments, you might want to be very careful with random data. Regards, Matthias From: Wenjie Zhuang <ka...@vt.edu> To: Matthias Boehm/Almaden/IBM@IBMUS Cc: dev@systemml.incubator.apache.org Date: 04/03/2016 06:29 AM Subject: Re: Gxuides about running SystemML by spark cluster Thanks a lot. I also have some other questions. Could you please help me figure them out? 1. If I want the input size is 30G, how can I set it? I guess I should change parameters X, Y and B. But I'm not sure which script I can use. 2. Do you know how to control the partition number when I run StepLinearRgDS.dml on Spark? Is there a configuration file where I can set partition number? 3. What should the correct result be after running StepLinearRgDS.dml? When the program ends, what can we get? Thanks & Have a nice day! 2016年4月3日 1:08 AM,"Matthias Boehm" <mbo...@us.ibm.com>写道: thanks again for catching https://issues.apache.org/jira/browse/SYSTEMML-609, yes the change is in SystemML head now, so please rebuild SystemML or use one of our nightly builds (https://sparktc.ibmcloud.com/repo/latest/). Thanks. For running SystemML on Spark, you have multiple options ( http://apache.github.io/incubator-systemml/#running-systemml). Either use MLContext or spark-submit. Since our documentation does not show many examples for spark-submit yet, here is a typical command line invocation: ../spark/bin/spark-submit \ --class org.apache.sysml.api.DMLScript \ --master yarn-client \ --num-executors 10 \ --driver-memory 20g \ --executor-memory 60g \ --executor-cores 24 \ --queue default \ --conf spark.driver.maxResultSize=0 \ ./SystemML.jar \ -f test.dml -stats -exec hybrid_spark -nvargs ... Everything else is similar to the hadoop invocation. We also provide you a script that simplifies this configuration: https://github.com/apache/incubator-systemml/blob/master/scripts/sparkDML.sh . Keep in mind that if you want to run in yarn-cluster, you should put the DML script and potentially SystemML-config into HDFS too. Regards, Matthias Inactive hide details for Wenjie Zhuang ---04/02/2016 07:50:35 PM---Hi, I try to run StepLinearRegDS.dml by spark yarn mode todWenjie Zhuang ---04/02/2016 07:50:35 PM---Hi, I try to run StepLinearRegDS.dml by spark yarn mode today. And I get the From: Wenjie Zhuang <ka...@vt.edu> To: dev@systemml.incubator.apache.org Cc: Matthias Boehm/Almaden/IBM@IBMUS Date: 04/02/2016 07:50 PM Subject: Re: Gxuides about running SystemML by spark cluster Hi, I try to run StepLinearRegDS.dml by spark yarn mode today. And I get the following result. Is it correct? Thanks. BEGIN STEPWISE LINEAR REGRESSION SCRIPT Reading X and Y... Best AIC without any features: 4123.134539784949 Best AIC 4068.2916533784332 achieved with feature: 22 Running linear regression with selected features... Computing the statistics... Writing the output matrix... On Sat, Apr 2, 2016 at 8:37 AM, Wenjie Zhuang <ka...@vt.edu> wrote: Hi, I am now trying to run experiments about SystemML on spark cluster. Could you please share some guides about how to run StepLinearRegDS.dml by spark cluster? The official guide I find is most about hadoop. Thanks & Have a good weekend!