Re: Gxuides about running SystemML by spark cluster

Matthias Boehm Sun, 03 Apr 2016 22:36:58 -0700

(1) If you want to play around with different data sizes, I would recommend
to use our data generator for linear regression
https://github.com/apache/incubator-systemml/blob/master/scripts/datagen/genRandData4LinearRegression.dml


(2) Well, SystemML does not expose these physical data properties to the
user (on purpose to ensure data independence). However, if you are curious,
SystemML does a coalesce on checkpoints to reduce the number of partitions
to <data size> / <hdfs block size> but only if this does not reduce the
effective degree of parallelism (e.g., if you have only 8GB data, 128MB
hdfs block size, 100 cores, and currently 90 partitions, we would not
reduce this to 8GB/128MB=64 partitions because 64<100).

(3) The whole point of running stepwise linear regression is feature
selection, so you'll get the selected features and estimated model
parameters of these features as well as some information on the selection
process. You can evaluate this model on a hold out test set or run some
form of cross validation. However, keep in mind that for accuracy
experiments, you might want to be very careful with random data.

Regards,
Matthias



From:   Wenjie Zhuang <[email protected]>
To:     Matthias Boehm/Almaden/IBM@IBMUS
Cc:     [email protected]
Date:   04/03/2016 06:29 AM
Subject:        Re: Gxuides about running SystemML by spark cluster



Thanks a lot. I also have some other  questions. Could you please help me
figure them out?


1.  If I want the input size is 30G, how can I set it? I guess I should
change parameters X, Y and B. But I'm not sure which script I can use.


2. Do you know how to control the partition number when I run
StepLinearRgDS.dml on Spark? Is there a configuration file where I can set
partition number?


3. What should the correct result be after  running StepLinearRgDS.dml?
When the program ends, what can we get?


Thanks & Have a nice day!


2016年4月3日 1:08 AM，"Matthias Boehm" <[email protected]>写道：
  thanks again for catching
  https://issues.apache.org/jira/browse/SYSTEMML-609, yes the change is in
  SystemML head now, so please rebuild SystemML or use one of our nightly
  builds (https://sparktc.ibmcloud.com/repo/latest/). Thanks.

  For running SystemML on Spark, you have multiple options (
  http://apache.github.io/incubator-systemml/#running-systemml). Either use
  MLContext or spark-submit. Since our documentation does not show many
  examples for spark-submit yet, here is a typical command line invocation:


  ../spark/bin/spark-submit \
  --class org.apache.sysml.api.DMLScript \
  --master yarn-client \
  --num-executors 10 \
  --driver-memory 20g \
  --executor-memory 60g \
  --executor-cores 24 \
  --queue default \
  --conf spark.driver.maxResultSize=0 \
  ./SystemML.jar \
  -f test.dml -stats -exec hybrid_spark -nvargs ...

  Everything else is similar to the hadoop invocation. We also provide you
  a script that simplifies this configuration:
  https://github.com/apache/incubator-systemml/blob/master/scripts/sparkDML.sh
  . Keep in mind that if you want to run in yarn-cluster, you should put
  the DML script and potentially SystemML-config into HDFS too.

  Regards,
  Matthias


  Inactive hide details for Wenjie Zhuang ---04/02/2016 07:50:35 PM---Hi, I
  try to run StepLinearRegDS.dml by spark yarn mode todWenjie Zhuang
  ---04/02/2016 07:50:35 PM---Hi, I try to run StepLinearRegDS.dml by spark
  yarn mode today. And I get the

  From: Wenjie Zhuang <[email protected]>
  To: [email protected]
  Cc: Matthias Boehm/Almaden/IBM@IBMUS
  Date: 04/02/2016 07:50 PM
  Subject: Re: Gxuides about running SystemML by spark cluster



  Hi,

  I try to run StepLinearRegDS.dml by spark yarn mode today. And I get the
  following result. Is it correct?

  Thanks.

  BEGIN STEPWISE LINEAR REGRESSION SCRIPT
  Reading X and Y...
  Best AIC without any features: 4123.134539784949
  Best AIC 4068.2916533784332 achieved with feature: 22
  Running linear regression with selected features...
  Computing the statistics...
  Writing the output matrix...



  On Sat, Apr 2, 2016 at 8:37 AM, Wenjie Zhuang <[email protected]> wrote:
        Hi,

        I am now trying to run experiments about SystemML on spark cluster.
        Could you please share some guides about how to run
        StepLinearRegDS.dml by spark cluster?  The official guide I find is
        most about hadoop.

        Thanks & Have a good weekend!

Re: Gxuides about running SystemML by spark cluster

Reply via email to