thanks for your patience with regard to those issues. (1) Issues: could you please provide the causing exception (likely at the end of the stacktrace or the first exception you see).
(2) Parameters: We support two different ways of input arguments: named parameters (to be used with -nvargs) and position parameters (to be used with -args). So if you invoke the script with "-args foo bar", $1 and $2 refer to foo and bar respectively. Regards, Matthias From: Wenjie Zhuang <[email protected]> To: Matthias Boehm/Almaden/IBM@IBMUS Cc: [email protected] Date: 04/05/2016 09:13 AM Subject: Re: Gxuides about running SystemML by spark cluster Hi, Matthias, I run genLinearRegressionData.dml and it reports the following error. Caused by: org.apache.spark.SparkException: Job 0 cancelled because SparkContext was shut down at org.apache.spark.scheduler.DAGScheduler$$anonfun $cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:806) at org.apache.spark.scheduler.DAGScheduler$$anonfun $cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:804) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop (DAGScheduler.scala:804) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop (DAGScheduler.scala:1658) at org.apache.spark.util.EventLoop.stop(EventLoop.scala:84) at org.apache.spark.scheduler.DAGScheduler.stop (DAGScheduler.scala:1581) at org.apache.spark.SparkContext$$anonfun$stop$7.apply$mcV$sp (SparkContext.scala:1731) at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1229) at org.apache.spark.SparkContext.stop(SparkContext.scala:1730) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend $MonitorThread.run(YarnClientSchedulerBackend.scala:147) at org.apache.spark.scheduler.DAGScheduler.runJob (DAGScheduler.scala:620) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929) at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927) at org.apache.spark.rdd.RDDOperationScope$.withScope (RDDOperationScope.scala:150) at org.apache.spark.rdd.RDDOperationScope$.withScope (RDDOperationScope.scala:111) at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) at org.apache.spark.rdd.RDD.collect(RDD.scala:926) at org.apache.spark.RangePartitioner$.sketch(Partitioner.scala:264) at org.apache.spark.RangePartitioner.<init>(Partitioner.scala:126) at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply (OrderedRDDFunctions.scala:62) at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply (OrderedRDDFunctions.scala:61) at org.apache.spark.rdd.RDDOperationScope$.withScope (RDDOperationScope.scala:150) at org.apache.spark.rdd.RDDOperationScope$.withScope (RDDOperationScope.scala:111) at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) at org.apache.spark.rdd.OrderedRDDFunctions.sortByKey (OrderedRDDFunctions.scala:61) at org.apache.spark.api.java.JavaPairRDD.sortByKey (JavaPairRDD.scala:902) at org.apache.spark.api.java.JavaPairRDD.sortByKey (JavaPairRDD.scala:872) at org.apache.sysml.runtime.instructions.spark.utils.RDDConverterUtils.binaryBlockToCsv (RDDConverterUtils.java:158) at org.apache.sysml.runtime.instructions.spark.WriteSPInstruction.processInstruction (WriteSPInstruction.java:187) at org.apache.sysml.runtime.controlprogram.ProgramBlock.executeSingleInstruction (ProgramBlock.java:308) ... 15 more And I also run genRandData4LinearRegression.dml, could you please tell us how to pass parameters when we use the script, especially those: # $5 is location to store generated weights # $6 is location to store generated data # $7 is location to store generated labels Besides, after I execute ./bin/systemml ./scripts/datagen/genLinearRegressionData.dml, it seems the data is generated, but it also reports error below. Failed to run SystemML. Exit code: 1 java -Xmx8g -Xms4g -Xmn1g Thanks & Have a good day! Sincerely On Mon, Apr 4, 2016 at 1:24 PM, Matthias Boehm <[email protected]> wrote: There are no practically relevant size restrictions. Also, if there are issues, please share some more information on it. Thanks. Regards, Matthias Inactive hide details for Wenjie Zhuang ---04/04/2016 04:37:46 AM---Hi, Matthias Thanks again. I used genLinearRegressionData.dWenjie Zhuang ---04/04/2016 04:37:46 AM---Hi, Matthias Thanks again. I used genLinearRegressionData.dml yeasterday. However, when From: Wenjie Zhuang <[email protected]> To: Matthias Boehm/Almaden/IBM@IBMUS Cc: [email protected] Date: 04/04/2016 04:37 AM Subject: Re: Gxuides about running SystemML by spark cluster Hi, Matthias Thanks again. I used genLinearRegressionData.dml yeasterday. However, when I set number of sample as 60G, itt reports error. Do you what the maximum input size that SystemML allows? Besides, I also try to run dml by standalone mode. But when i use ./runStandaloneSystemML.sh, it shows error: : Could not find or load main class org.apache.sysml.api.DMLScript. I download SystemML from github and mvn it again after you update. https://github.com/apache/incubator-systemml/blob/master/scripts/algorithms/StepLinearRegDS.dml Have a good week! On Sun, Apr 3, 2016 at 9:28 AM, Wenjie Zhuang <[email protected]> wrote: Thanks a lot. I also have some other questions. Could you please help me figure them out? 1. If I want the input size is 30G, how can I set it? I guess I should change parameters X, Y and B. But I'm not sure which script I can use. 2. Do you know how to control the partition number when I run StepLinearRgDS.dml on Spark? Is there a configuration file where I can set partition number? 3. What should the correct result be after running StepLinearRgDS.dml? When the program ends, what can we get? Thanks & Have a nice day! 2016年4月3日 1:08 AM,"Matthias Boehm" <[email protected]>写道: thanks again for catching https://issues.apache.org/jira/browse/SYSTEMML-609, yes the change is in SystemML head now, so please rebuild SystemML or use one of our nightly builds (https://sparktc.ibmcloud.com/repo/latest/). Thanks. For running SystemML on Spark, you have multiple options ( http://apache.github.io/incubator-systemml/#running-systemml). Either use MLContext or spark-submit. Since our documentation does not show many examples for spark-submit yet, here is a typical command line invocation: ../spark/bin/spark-submit \ --class org.apache.sysml.api.DMLScript \ --master yarn-client \ --num-executors 10 \ --driver-memory 20g \ --executor-memory 60g \ --executor-cores 24 \ --queue default \ --conf spark.driver.maxResultSize=0 \ ./SystemML.jar \ -f test.dml -stats -exec hybrid_spark -nvargs ... Everything else is similar to the hadoop invocation. We also provide you a script that simplifies this configuration: https://github.com/apache/incubator-systemml/blob/master/scripts/sparkDML.sh . Keep in mind that if you want to run in yarn-cluster, you should put the DML script and potentially SystemML-config into HDFS too. Regards, Matthias Inactive hide details for Wenjie Zhuang ---04/02/2016 07:50:35 PM---Hi, I try to run StepLinearRegDS.dml by spark yarn mode tod Wenjie Zhuang ---04/02/2016 07:50:35 PM---Hi, I try to run StepLinearRegDS.dml by spark yarn mode today. And I get the From: Wenjie Zhuang <[email protected]> To: [email protected] Cc: Matthias Boehm/Almaden/IBM@IBMUS Date: 04/02/2016 07:50 PM Subject: Re: Gxuides about running SystemML by spark cluster Hi, I try to run StepLinearRegDS.dml by spark yarn mode today. And I get the following result. Is it correct? Thanks. BEGIN STEPWISE LINEAR REGRESSION SCRIPT Reading X and Y... Best AIC without any features: 4123.134539784949 Best AIC 4068.2916533784332 achieved with feature: 22 Running linear regression with selected features... Computing the statistics... Writing the output matrix... On Sat, Apr 2, 2016 at 8:37 AM, Wenjie Zhuang <[email protected]> wrote: Hi, I am now trying to run experiments about SystemML on spark cluster. Could you please share some guides about how to run StepLinearRegDS.dml by spark cluster? The official guide I find is most about hadoop. Thanks & Have a good weekend!
