[ https://issues.apache.org/jira/browse/SYSTEMML-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871170#comment-15871170 ]
Niketan Pansare edited comment on SYSTEMML-1238 at 2/17/17 5:36 AM: -------------------------------------------------------------------- I am able to reproduce this bug (not sure if it is) with command-line as well. Here is the output of GLM-predict (after running LinRegDS): {code} $ cat y_predicted.csv 189.09660701586185 133.3260601238074 157.3739106185465 132.8144037303023 135.88434209133283 154.81562865102103 194.2131709509127 136.3959984848379 125.13955782772601 137.41931127184807 178.35182275225503 123.60458864721075 152.7690030770007 141.0009060263837 116.95305553164462 161.46716176658717 144.58250078091928 144.58250078091928 170.67697684967874 117.4647119251497 {code} Here is the output of Python mllearn: {code} >>> import numpy as np >>> from pyspark.context import SparkContext >>> from pyspark.ml import Pipeline >>> from pyspark.ml.feature import HashingTF, Tokenizer from pyspark.sql import SparkSession from sklearn import datasets, metrics, neighbors >>> from pyspark.sql import SparkSession >>> from sklearn import datasets, metrics, neighbors from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import TfidfVectorizer from systemml.mllearn import LinearRegression, LogisticRegression, NaiveBayes, SVM diabetes = datasets.load_diabetes() diabetes_X = diabetes.data[:, np.newaxis, 2] diabetes_X_train = diabetes_X[:-20] diabetes_X_test = diabetes_X[-20:] diabetes_y_train = diabetes.target[:-20] diabetes_y_test = diabetes.target[-20:] sparkSession = SparkSession.builder.getOrCreate() regr = LinearRegression(sparkSession, solver="direct-solve") regr.fit(diabetes_X_train, diabetes_y_train)>>> from sklearn.datasets import fetch_20newsgroups >>> from sklearn.feature_extraction.text import TfidfVectorizer >>> >>> from systemml.mllearn import LinearRegression, LogisticRegression, >>> NaiveBayes, SVM >>> diabetes = datasets.load_diabetes() >>> diabetes_X = diabetes.data[:, np.newaxis, 2] >>> diabetes_X_train = diabetes_X[:-20] >>> diabetes_X_test = diabetes_X[-20:] >>> diabetes_y_train = diabetes.target[:-20] >>> diabetes_y_test = diabetes.target[-20:] >>> sparkSession = SparkSession.builder.getOrCreate() >>> regr = LinearRegression(sparkSession, solver="direct-solve") >>> regr.fit(diabetes_X_train, diabetes_y_train) Welcome to Apache SystemML! 17/02/16 22:39:21 WARN RewriteRemovePersistentReadWrite: Non-registered persistent write of variable 'X' (line 87). 17/02/16 22:39:21 WARN RewriteRemovePersistentReadWrite: Non-registered persistent write of variable 'y' (line 88). BEGIN LINEAR REGRESSION SCRIPT Reading X and Y... Calling the Direct Solver... Computing the statistics... AVG_TOT_Y,153.36255924170615 STDEV_TOT_Y,77.21853383600028 AVG_RES_Y,4.8020565933360324E-14 STDEV_RES_Y,67.06389890324985 DISPERSION,4497.566536105316 PLAIN_R2,0.24750834362605834 ADJUSTED_R2,0.24571669682516795 PLAIN_R2_NOBIAS,0.24750834362605834 ADJUSTED_R2_NOBIAS,0.24571669682516795 Writing the output matrix... END LINEAR REGRESSION SCRIPT lr >>> regr.predict(diabetes_X_test) 17/02/16 22:39:35 WARN Expression: WARNING: null -- line 149, column 4 -- Read input file does not exist on FS (local mode): 17/02/16 22:39:35 WARN Expression: Metadata file: .mtd not provided array([[ 188.84521284], [ 134.98127765], [ 158.20701117], [ 134.4871131 ], [ 137.45210036], [ 155.73618846], [ 193.78685827], [ 137.94626491], [ 127.07464496], [ 138.93459399], [ 178.46775744], [ 125.59215133], [ 153.75953028], [ 142.39374579], [ 119.16801227], [ 162.16032752], [ 145.8528976 ], [ 145.8528976 ], [ 171.05528929], [ 119.66217681]]) {code} To reproduce the command-line output, please dump the test data into csv: {code} import numpy as np from sklearn import datasets diabetes = datasets.load_diabetes() diabetes_X = diabetes.data[:, np.newaxis, 2] diabetes_X_train = diabetes_X[:-20] diabetes_X_test = diabetes_X[-20:] diabetes_y_train = diabetes.target[:-20] diabetes_y_test = diabetes.target[-20:] diabetes_X_test.tofile('X_test.csv', sep="\n") diabetes_X.tofile('X.csv', sep="\n") diabetes.target.tofile('y.csv', sep="\n") {code} And execute following commands (you may have to edit dml script to add format or create metadata file): {code} ~/spark-2.1.0-bin-hadoop2.7/bin/spark-submit SystemML.jar -f LinearRegDS.dml -nvargs X=X.csv Y=y.csv B=B.csv fmt=csv icpt=1 tol=0.000001 reg=1 ~/spark-2.1.0-bin-hadoop2.7/bin/spark-submit SystemML.jar -f GLM-predict.dml -nvargs X=X_test.csv M=y_predicted.csv B=B.csv fmt=csv icpt=1 tol=0.000001 reg=1 {code} I also tested using SystemML 0.12.0 and got the same predictions: {code} $ ~/spark-1.6.1-bin-hadoop2.6/bin/spark-submit systemml-0.12.0-incubating.jar -f LinearRegDS.dml -nvargs X=X.csv Y=y.csv B=B.csv fmt=csv icpt=1 tol=0.000001 reg=1 $ ~/spark-1.6.1-bin-hadoop2.6/bin/spark-submit systemml-0.12.0-incubating.jar -f GLM-predict.dml -nvargs X=X_test.csv M=y_predicted.csv B=B.csv fmt=csv icpt=1 tol=0.000001 reg=1 $ cat y_predicted.csv 189.09660701586185 133.3260601238074 157.3739106185465 132.8144037303023 135.88434209133283 154.81562865102103 194.2131709509127 136.3959984848379 125.13955782772601 137.41931127184807 178.35182275225503 123.60458864721075 152.7690030770007 141.0009060263837 116.95305553164462 161.46716176658717 144.58250078091928 144.58250078091928 170.67697684967874 117.4647119251497 {code} And here is the output of 0.12.0 mllearn: {code} Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 1.6.1 /_/ Using Python version 2.7.12 (default, Jul 2 2016 17:42:40) SparkContext available as sc, HiveContext available as sqlContext. >>> import numpy as np >>> from sklearn import datasets from systemml.mllearn import LinearRegression from pyspark.sql import SQLContext # Load the diabetes dataset diabetes = datasets.load_diabetes() # Use only one feature diabetes_X = diabetes.data[:, np.newaxis, 2] # Split the data into training/testing sets diabetes_X_train = diabetes_X[:-20] diabetes_X_test = diabetes_X[-20:] # Split the targets into training/testing sets diabetes_y_train = diabetes.target[:-20] diabetes_y_test = diabetes.target[-20:] # Create linear regression object regr = LinearRegression(sqlCtx, solver='direct-solve') # Train the model using the training sets regr.fit(diabetes_X_train, diabetes_y_train)>>> from systemml.mllearn import LinearRegression >>> from pyspark.sql import SQLContext >>> # Load the diabetes dataset ... diabetes = datasets.load_diabetes() >>> # Use only one feature ... diabetes_X = diabetes.data[:, np.newaxis, 2] >>> # Split the data into training/testing sets ... diabetes_X_train = diabetes_X[:-20] >>> diabetes_X_test = diabetes_X[-20:] >>> # Split the targets into training/testing sets ... diabetes_y_train = diabetes.target[:-20] >>> diabetes_y_test = diabetes.target[-20:] >>> # Create linear regression object ... regr = LinearRegression(sqlCtx, solver='direct-solve') 17/02/16 23:34:34 INFO spark.HttpFileServer: HTTP File server directory is /tmp/spark-d7505265-08a5-41f5-804c-484e1de7881e/httpd-c0b01ae9-c212-4373-ab09-cc0390bcd1dd 17/02/16 23:34:34 INFO spark.HttpServer: Starting HTTP Server 17/02/16 23:34:34 INFO server.Server: jetty-8.y.z-SNAPSHOT 17/02/16 23:34:34 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:46590 17/02/16 23:34:34 INFO util.Utils: Successfully started service 'HTTP file server' on port 46590. 17/02/16 23:34:34 INFO spark.SparkContext: Added JAR /home/biuser/anaconda2/lib/python2.7/site-packages/systemml/systemml-java/systemml-0.12.0-incubating.jar at http://localhost:46590/jars/systemml-0.12.0-incubating.jar with timestamp 1487309674061 >>> # Train the model using the training sets ... regr.fit(diabetes_X_train, diabetes_y_train) Welcome to Apache SystemML! BEGIN LINEAR REGRESSION SCRIPT Reading X and Y... Calling the Direct Solver... Computing the statistics... AVG_TOT_Y,153.36255924170615 STDEV_TOT_Y,77.21853383600028 AVG_RES_Y,3.633533705616816E-14 STDEV_RES_Y,63.038506337610244 DISPERSION,3973.853281276927 PLAIN_R2,0.3351312506863875 ADJUSTED_R2,0.33354822985468835 PLAIN_R2_NOBIAS,0.3351312506863875 ADJUSTED_R2_NOBIAS,0.33354822985468835 Writing the output matrix... END LINEAR REGRESSION SCRIPT lr >>> regr.predict(diabetes_X_test) array([[ 225.97316413], [ 115.7476731 ], [ 163.27609584], [ 114.73643007], [ 120.80388829], [ 158.21988065], [ 236.08559449], [ 121.81513133], [ 99.56778451], [ 123.8376174 ], [ 204.73706035], [ 96.5340554 ], [ 154.17490851], [ 130.91631866], [ 83.38789592], [ 171.36604013], [ 137.99501992], [ 137.99501992], [ 189.5684148 ], [ 84.39913896]]) {code} was (Author: niketanpansare): I am able to reproduce this bug (not sure if it is) with command-line as well. Here is the output of GLM-predict (after running LinRegDS): {code} $ cat y_predicted.csv 189.09660701586185 133.3260601238074 157.3739106185465 132.8144037303023 135.88434209133283 154.81562865102103 194.2131709509127 136.3959984848379 125.13955782772601 137.41931127184807 178.35182275225503 123.60458864721075 152.7690030770007 141.0009060263837 116.95305553164462 161.46716176658717 144.58250078091928 144.58250078091928 170.67697684967874 117.4647119251497 {code} Here is the output of Python mllearn: {code} >>> import numpy as np >>> from pyspark.context import SparkContext >>> from pyspark.ml import Pipeline >>> from pyspark.ml.feature import HashingTF, Tokenizer from pyspark.sql import SparkSession from sklearn import datasets, metrics, neighbors >>> from pyspark.sql import SparkSession >>> from sklearn import datasets, metrics, neighbors from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import TfidfVectorizer from systemml.mllearn import LinearRegression, LogisticRegression, NaiveBayes, SVM diabetes = datasets.load_diabetes() diabetes_X = diabetes.data[:, np.newaxis, 2] diabetes_X_train = diabetes_X[:-20] diabetes_X_test = diabetes_X[-20:] diabetes_y_train = diabetes.target[:-20] diabetes_y_test = diabetes.target[-20:] sparkSession = SparkSession.builder.getOrCreate() regr = LinearRegression(sparkSession, solver="direct-solve") regr.fit(diabetes_X_train, diabetes_y_train)>>> from sklearn.datasets import fetch_20newsgroups >>> from sklearn.feature_extraction.text import TfidfVectorizer >>> >>> from systemml.mllearn import LinearRegression, LogisticRegression, >>> NaiveBayes, SVM >>> diabetes = datasets.load_diabetes() >>> diabetes_X = diabetes.data[:, np.newaxis, 2] >>> diabetes_X_train = diabetes_X[:-20] >>> diabetes_X_test = diabetes_X[-20:] >>> diabetes_y_train = diabetes.target[:-20] >>> diabetes_y_test = diabetes.target[-20:] >>> sparkSession = SparkSession.builder.getOrCreate() >>> regr = LinearRegression(sparkSession, solver="direct-solve") >>> regr.fit(diabetes_X_train, diabetes_y_train) Welcome to Apache SystemML! 17/02/16 22:39:21 WARN RewriteRemovePersistentReadWrite: Non-registered persistent write of variable 'X' (line 87). 17/02/16 22:39:21 WARN RewriteRemovePersistentReadWrite: Non-registered persistent write of variable 'y' (line 88). BEGIN LINEAR REGRESSION SCRIPT Reading X and Y... Calling the Direct Solver... Computing the statistics... AVG_TOT_Y,153.36255924170615 STDEV_TOT_Y,77.21853383600028 AVG_RES_Y,4.8020565933360324E-14 STDEV_RES_Y,67.06389890324985 DISPERSION,4497.566536105316 PLAIN_R2,0.24750834362605834 ADJUSTED_R2,0.24571669682516795 PLAIN_R2_NOBIAS,0.24750834362605834 ADJUSTED_R2_NOBIAS,0.24571669682516795 Writing the output matrix... END LINEAR REGRESSION SCRIPT lr >>> regr.predict(diabetes_X_test) 17/02/16 22:39:35 WARN Expression: WARNING: null -- line 149, column 4 -- Read input file does not exist on FS (local mode): 17/02/16 22:39:35 WARN Expression: Metadata file: .mtd not provided array([[ 188.84521284], [ 134.98127765], [ 158.20701117], [ 134.4871131 ], [ 137.45210036], [ 155.73618846], [ 193.78685827], [ 137.94626491], [ 127.07464496], [ 138.93459399], [ 178.46775744], [ 125.59215133], [ 153.75953028], [ 142.39374579], [ 119.16801227], [ 162.16032752], [ 145.8528976 ], [ 145.8528976 ], [ 171.05528929], [ 119.66217681]]) {code} To reproduce the command-line output, please dump the test data into csv: {code} import numpy as np from sklearn import datasets diabetes = datasets.load_diabetes() diabetes_X = diabetes.data[:, np.newaxis, 2] diabetes_X_train = diabetes_X[:-20] diabetes_X_test = diabetes_X[-20:] diabetes_y_train = diabetes.target[:-20] diabetes_y_test = diabetes.target[-20:] diabetes_X_test.tofile('X_test.csv', sep="\n") diabetes_X.tofile('X.csv', sep="\n") diabetes.target.tofile('y.csv', sep="\n") {code} And execute following commands (you may have to edit dml script to add format or create metadata file): {code} ~/spark-2.1.0-bin-hadoop2.7/bin/spark-submit SystemML.jar -f LinearRegDS.dml -nvargs X=X.csv Y=y.csv B=B.csv fmt=csv icpt=1 tol=0.000001 reg=1 ~/spark-2.1.0-bin-hadoop2.7/bin/spark-submit SystemML.jar -f GLM-predict.dml -nvargs X=X_test.csv M=y_predicted.csv B=B.csv fmt=csv icpt=1 tol=0.000001 reg=1 {code} I also tested using SystemML 0.12.0 and got the same predictions: {code} $ ~/spark-1.6.1-bin-hadoop2.6/bin/spark-submit systemml-0.12.0-incubating.jar -f LinearRegDS.dml -nvargs X=X.csv Y=y.csv B=B.csv fmt=csv icpt=1 tol=0.000001 reg=1 $ ~/spark-1.6.1-bin-hadoop2.6/bin/spark-submit systemml-0.12.0-incubating.jar -f GLM-predict.dml -nvargs X=X_test.csv M=y_predicted.csv B=B.csv fmt=csv icpt=1 tol=0.000001 reg=1 $ cat y_predicted.csv 189.09660701586185 133.3260601238074 157.3739106185465 132.8144037303023 135.88434209133283 154.81562865102103 194.2131709509127 136.3959984848379 125.13955782772601 137.41931127184807 178.35182275225503 123.60458864721075 152.7690030770007 141.0009060263837 116.95305553164462 161.46716176658717 144.58250078091928 144.58250078091928 170.67697684967874 117.4647119251497 {code} > Python test failing for LinearRegCG > ----------------------------------- > > Key: SYSTEMML-1238 > URL: https://issues.apache.org/jira/browse/SYSTEMML-1238 > Project: SystemML > Issue Type: Bug > Components: Algorithms, APIs > Affects Versions: SystemML 0.13 > Reporter: Imran Younus > Assignee: Niketan Pansare > Attachments: python_LinearReg_test_spark.1.6.log, > python_LinearReg_test_spark.2.1.log > > > [~deron] discovered that the one of the python test ({{test_mllearn_df.py}}) > with spark 2.1.0 was failing because the test score from linear regression > was very low ({{~ 0.24}}). I did a some investigation and it turns out the > the model parameters computed by the dml script are incorrect. In > systemml.12, the values of betas from linear regression model are > {{\[152.919, 938.237\]}}. This is what we expect from normal equation. (I > also tested this with sklearn). But the values of betas from systemml.13 > (with spark 2.1.0) come out to be {{\[153.146, 458.489\]}}. These are not > correct and therefore the test score is much lower than expected. The data > going into DML script is correct. I printed out the valued of {{X}} and {{Y}} > in dml and I didn't see any issue there. > Attached are the log files for two different tests (systemml0.12 and 0.13) > with explain flag. -- This message was sent by Atlassian JIRA (v6.3.15#6346)