[ 
https://issues.apache.org/jira/browse/SYSTEMML-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871170#comment-15871170
 ] 

Niketan Pansare edited comment on SYSTEMML-1238 at 2/17/17 5:36 AM:
--------------------------------------------------------------------

I am able to reproduce this bug (not sure if it is) with command-line as well. 
Here is the output of GLM-predict (after running LinRegDS):
{code}
$ cat y_predicted.csv
189.09660701586185
133.3260601238074
157.3739106185465
132.8144037303023
135.88434209133283
154.81562865102103
194.2131709509127
136.3959984848379
125.13955782772601
137.41931127184807
178.35182275225503
123.60458864721075
152.7690030770007
141.0009060263837
116.95305553164462
161.46716176658717
144.58250078091928
144.58250078091928
170.67697684967874
117.4647119251497
{code}

Here is the output of Python mllearn:
{code}
>>> import numpy as np
>>> from pyspark.context import SparkContext
>>> from pyspark.ml import Pipeline
>>> from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.sql import SparkSession
from sklearn import datasets, metrics, neighbors
>>> from pyspark.sql import SparkSession
>>> from sklearn import datasets, metrics, neighbors
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

from systemml.mllearn import LinearRegression, LogisticRegression, NaiveBayes, 
SVM
diabetes = datasets.load_diabetes()
diabetes_X = diabetes.data[:, np.newaxis, 2]
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]
sparkSession = SparkSession.builder.getOrCreate()
regr = LinearRegression(sparkSession, solver="direct-solve")
regr.fit(diabetes_X_train, diabetes_y_train)>>> from sklearn.datasets import 
fetch_20newsgroups
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>>
>>> from systemml.mllearn import LinearRegression, LogisticRegression, 
>>> NaiveBayes, SVM
>>> diabetes = datasets.load_diabetes()
>>> diabetes_X = diabetes.data[:, np.newaxis, 2]
>>> diabetes_X_train = diabetes_X[:-20]
>>> diabetes_X_test = diabetes_X[-20:]
>>> diabetes_y_train = diabetes.target[:-20]
>>> diabetes_y_test = diabetes.target[-20:]
>>> sparkSession = SparkSession.builder.getOrCreate()
>>> regr = LinearRegression(sparkSession, solver="direct-solve")
>>> regr.fit(diabetes_X_train, diabetes_y_train)

Welcome to Apache SystemML!

17/02/16 22:39:21 WARN RewriteRemovePersistentReadWrite: Non-registered 
persistent write of variable 'X' (line 87).
17/02/16 22:39:21 WARN RewriteRemovePersistentReadWrite: Non-registered 
persistent write of variable 'y' (line 88).
BEGIN LINEAR REGRESSION SCRIPT
Reading X and Y...
Calling the Direct Solver...
Computing the statistics...
AVG_TOT_Y,153.36255924170615
STDEV_TOT_Y,77.21853383600028
AVG_RES_Y,4.8020565933360324E-14
STDEV_RES_Y,67.06389890324985
DISPERSION,4497.566536105316
PLAIN_R2,0.24750834362605834
ADJUSTED_R2,0.24571669682516795
PLAIN_R2_NOBIAS,0.24750834362605834
ADJUSTED_R2_NOBIAS,0.24571669682516795
Writing the output matrix...
END LINEAR REGRESSION SCRIPT
lr
>>> regr.predict(diabetes_X_test)
17/02/16 22:39:35 WARN Expression: WARNING: null -- line 149, column 4 -- Read 
input file does not exist on FS (local mode):
17/02/16 22:39:35 WARN Expression: Metadata file:  .mtd not provided
array([[ 188.84521284],
       [ 134.98127765],
       [ 158.20701117],
       [ 134.4871131 ],
       [ 137.45210036],
       [ 155.73618846],
       [ 193.78685827],
       [ 137.94626491],
       [ 127.07464496],
       [ 138.93459399],
       [ 178.46775744],
       [ 125.59215133],
       [ 153.75953028],
       [ 142.39374579],
       [ 119.16801227],
       [ 162.16032752],
       [ 145.8528976 ],
       [ 145.8528976 ],
       [ 171.05528929],
       [ 119.66217681]])
{code}

To reproduce the command-line output, please dump the test data into csv:
{code}
import numpy as np
from sklearn import datasets
diabetes = datasets.load_diabetes()
diabetes_X = diabetes.data[:, np.newaxis, 2]
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]
diabetes_X_test.tofile('X_test.csv', sep="\n")
diabetes_X.tofile('X.csv', sep="\n")
diabetes.target.tofile('y.csv', sep="\n")
{code}

And execute following commands (you may have to edit dml script to add format 
or create metadata file):
{code}
~/spark-2.1.0-bin-hadoop2.7/bin/spark-submit SystemML.jar -f LinearRegDS.dml 
-nvargs X=X.csv Y=y.csv B=B.csv fmt=csv icpt=1 tol=0.000001 reg=1 
~/spark-2.1.0-bin-hadoop2.7/bin/spark-submit SystemML.jar -f GLM-predict.dml 
-nvargs X=X_test.csv M=y_predicted.csv B=B.csv fmt=csv icpt=1 tol=0.000001 reg=1
{code}

I also tested using SystemML 0.12.0 and got the same predictions:
{code}
$ ~/spark-1.6.1-bin-hadoop2.6/bin/spark-submit systemml-0.12.0-incubating.jar 
-f LinearRegDS.dml -nvargs X=X.csv Y=y.csv B=B.csv fmt=csv icpt=1 tol=0.000001 
reg=1
$ ~/spark-1.6.1-bin-hadoop2.6/bin/spark-submit systemml-0.12.0-incubating.jar 
-f GLM-predict.dml -nvargs X=X_test.csv M=y_predicted.csv B=B.csv fmt=csv 
icpt=1 tol=0.000001 reg=1
$ cat y_predicted.csv
189.09660701586185
133.3260601238074
157.3739106185465
132.8144037303023
135.88434209133283
154.81562865102103
194.2131709509127
136.3959984848379
125.13955782772601
137.41931127184807
178.35182275225503
123.60458864721075
152.7690030770007
141.0009060263837
116.95305553164462
161.46716176658717
144.58250078091928
144.58250078091928
170.67697684967874
117.4647119251497
{code}

And here is the output of 0.12.0 mllearn:
{code}
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.6.1
      /_/

Using Python version 2.7.12 (default, Jul  2 2016 17:42:40)
SparkContext available as sc, HiveContext available as sqlContext.
>>> import numpy as np
>>> from sklearn import datasets
from systemml.mllearn import LinearRegression
from pyspark.sql import SQLContext
# Load the diabetes dataset
diabetes = datasets.load_diabetes()
# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 2]
# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]
# Create linear regression object
regr = LinearRegression(sqlCtx, solver='direct-solve')
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)>>> from systemml.mllearn import 
LinearRegression
>>> from pyspark.sql import SQLContext
>>> # Load the diabetes dataset
... diabetes = datasets.load_diabetes()
>>> # Use only one feature
... diabetes_X = diabetes.data[:, np.newaxis, 2]
>>> # Split the data into training/testing sets
... diabetes_X_train = diabetes_X[:-20]
>>> diabetes_X_test = diabetes_X[-20:]
>>> # Split the targets into training/testing sets
... diabetes_y_train = diabetes.target[:-20]
>>> diabetes_y_test = diabetes.target[-20:]
>>> # Create linear regression object
... regr = LinearRegression(sqlCtx, solver='direct-solve')
17/02/16 23:34:34 INFO spark.HttpFileServer: HTTP File server directory is 
/tmp/spark-d7505265-08a5-41f5-804c-484e1de7881e/httpd-c0b01ae9-c212-4373-ab09-cc0390bcd1dd
17/02/16 23:34:34 INFO spark.HttpServer: Starting HTTP Server
17/02/16 23:34:34 INFO server.Server: jetty-8.y.z-SNAPSHOT
17/02/16 23:34:34 INFO server.AbstractConnector: Started 
SocketConnector@0.0.0.0:46590
17/02/16 23:34:34 INFO util.Utils: Successfully started service 'HTTP file 
server' on port 46590.
17/02/16 23:34:34 INFO spark.SparkContext: Added JAR 
/home/biuser/anaconda2/lib/python2.7/site-packages/systemml/systemml-java/systemml-0.12.0-incubating.jar
 at http://localhost:46590/jars/systemml-0.12.0-incubating.jar with timestamp 
1487309674061
>>> # Train the model using the training sets
... regr.fit(diabetes_X_train, diabetes_y_train)

Welcome to Apache SystemML!

BEGIN LINEAR REGRESSION SCRIPT
Reading X and Y...
Calling the Direct Solver...
Computing the statistics...
AVG_TOT_Y,153.36255924170615
STDEV_TOT_Y,77.21853383600028
AVG_RES_Y,3.633533705616816E-14
STDEV_RES_Y,63.038506337610244
DISPERSION,3973.853281276927
PLAIN_R2,0.3351312506863875
ADJUSTED_R2,0.33354822985468835
PLAIN_R2_NOBIAS,0.3351312506863875
ADJUSTED_R2_NOBIAS,0.33354822985468835
Writing the output matrix...
END LINEAR REGRESSION SCRIPT
lr
>>> regr.predict(diabetes_X_test)
array([[ 225.97316413],
       [ 115.7476731 ],
       [ 163.27609584],
       [ 114.73643007],
       [ 120.80388829],
       [ 158.21988065],
       [ 236.08559449],
       [ 121.81513133],
       [  99.56778451],
       [ 123.8376174 ],
       [ 204.73706035],
       [  96.5340554 ],
       [ 154.17490851],
       [ 130.91631866],
       [  83.38789592],
       [ 171.36604013],
       [ 137.99501992],
       [ 137.99501992],
       [ 189.5684148 ],
       [  84.39913896]])
{code}


was (Author: niketanpansare):
I am able to reproduce this bug (not sure if it is) with command-line as well. 
Here is the output of GLM-predict (after running LinRegDS):
{code}
$ cat y_predicted.csv
189.09660701586185
133.3260601238074
157.3739106185465
132.8144037303023
135.88434209133283
154.81562865102103
194.2131709509127
136.3959984848379
125.13955782772601
137.41931127184807
178.35182275225503
123.60458864721075
152.7690030770007
141.0009060263837
116.95305553164462
161.46716176658717
144.58250078091928
144.58250078091928
170.67697684967874
117.4647119251497
{code}

Here is the output of Python mllearn:
{code}
>>> import numpy as np
>>> from pyspark.context import SparkContext
>>> from pyspark.ml import Pipeline
>>> from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.sql import SparkSession
from sklearn import datasets, metrics, neighbors
>>> from pyspark.sql import SparkSession
>>> from sklearn import datasets, metrics, neighbors
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

from systemml.mllearn import LinearRegression, LogisticRegression, NaiveBayes, 
SVM
diabetes = datasets.load_diabetes()
diabetes_X = diabetes.data[:, np.newaxis, 2]
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]
sparkSession = SparkSession.builder.getOrCreate()
regr = LinearRegression(sparkSession, solver="direct-solve")
regr.fit(diabetes_X_train, diabetes_y_train)>>> from sklearn.datasets import 
fetch_20newsgroups
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>>
>>> from systemml.mllearn import LinearRegression, LogisticRegression, 
>>> NaiveBayes, SVM
>>> diabetes = datasets.load_diabetes()
>>> diabetes_X = diabetes.data[:, np.newaxis, 2]
>>> diabetes_X_train = diabetes_X[:-20]
>>> diabetes_X_test = diabetes_X[-20:]
>>> diabetes_y_train = diabetes.target[:-20]
>>> diabetes_y_test = diabetes.target[-20:]
>>> sparkSession = SparkSession.builder.getOrCreate()
>>> regr = LinearRegression(sparkSession, solver="direct-solve")
>>> regr.fit(diabetes_X_train, diabetes_y_train)

Welcome to Apache SystemML!

17/02/16 22:39:21 WARN RewriteRemovePersistentReadWrite: Non-registered 
persistent write of variable 'X' (line 87).
17/02/16 22:39:21 WARN RewriteRemovePersistentReadWrite: Non-registered 
persistent write of variable 'y' (line 88).
BEGIN LINEAR REGRESSION SCRIPT
Reading X and Y...
Calling the Direct Solver...
Computing the statistics...
AVG_TOT_Y,153.36255924170615
STDEV_TOT_Y,77.21853383600028
AVG_RES_Y,4.8020565933360324E-14
STDEV_RES_Y,67.06389890324985
DISPERSION,4497.566536105316
PLAIN_R2,0.24750834362605834
ADJUSTED_R2,0.24571669682516795
PLAIN_R2_NOBIAS,0.24750834362605834
ADJUSTED_R2_NOBIAS,0.24571669682516795
Writing the output matrix...
END LINEAR REGRESSION SCRIPT
lr
>>> regr.predict(diabetes_X_test)
17/02/16 22:39:35 WARN Expression: WARNING: null -- line 149, column 4 -- Read 
input file does not exist on FS (local mode):
17/02/16 22:39:35 WARN Expression: Metadata file:  .mtd not provided
array([[ 188.84521284],
       [ 134.98127765],
       [ 158.20701117],
       [ 134.4871131 ],
       [ 137.45210036],
       [ 155.73618846],
       [ 193.78685827],
       [ 137.94626491],
       [ 127.07464496],
       [ 138.93459399],
       [ 178.46775744],
       [ 125.59215133],
       [ 153.75953028],
       [ 142.39374579],
       [ 119.16801227],
       [ 162.16032752],
       [ 145.8528976 ],
       [ 145.8528976 ],
       [ 171.05528929],
       [ 119.66217681]])
{code}

To reproduce the command-line output, please dump the test data into csv:
{code}
import numpy as np
from sklearn import datasets
diabetes = datasets.load_diabetes()
diabetes_X = diabetes.data[:, np.newaxis, 2]
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]
diabetes_X_test.tofile('X_test.csv', sep="\n")
diabetes_X.tofile('X.csv', sep="\n")
diabetes.target.tofile('y.csv', sep="\n")
{code}

And execute following commands (you may have to edit dml script to add format 
or create metadata file):
{code}
~/spark-2.1.0-bin-hadoop2.7/bin/spark-submit SystemML.jar -f LinearRegDS.dml 
-nvargs X=X.csv Y=y.csv B=B.csv fmt=csv icpt=1 tol=0.000001 reg=1 
~/spark-2.1.0-bin-hadoop2.7/bin/spark-submit SystemML.jar -f GLM-predict.dml 
-nvargs X=X_test.csv M=y_predicted.csv B=B.csv fmt=csv icpt=1 tol=0.000001 reg=1
{code}

I also tested using SystemML 0.12.0 and got the same predictions:
{code}
$ ~/spark-1.6.1-bin-hadoop2.6/bin/spark-submit systemml-0.12.0-incubating.jar 
-f LinearRegDS.dml -nvargs X=X.csv Y=y.csv B=B.csv fmt=csv icpt=1 tol=0.000001 
reg=1
$ ~/spark-1.6.1-bin-hadoop2.6/bin/spark-submit systemml-0.12.0-incubating.jar 
-f GLM-predict.dml -nvargs X=X_test.csv M=y_predicted.csv B=B.csv fmt=csv 
icpt=1 tol=0.000001 reg=1
$ cat y_predicted.csv
189.09660701586185
133.3260601238074
157.3739106185465
132.8144037303023
135.88434209133283
154.81562865102103
194.2131709509127
136.3959984848379
125.13955782772601
137.41931127184807
178.35182275225503
123.60458864721075
152.7690030770007
141.0009060263837
116.95305553164462
161.46716176658717
144.58250078091928
144.58250078091928
170.67697684967874
117.4647119251497
{code}

> Python test failing for LinearRegCG
> -----------------------------------
>
>                 Key: SYSTEMML-1238
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-1238
>             Project: SystemML
>          Issue Type: Bug
>          Components: Algorithms, APIs
>    Affects Versions: SystemML 0.13
>            Reporter: Imran Younus
>            Assignee: Niketan Pansare
>         Attachments: python_LinearReg_test_spark.1.6.log, 
> python_LinearReg_test_spark.2.1.log
>
>
> [~deron] discovered that the one of the python test ({{test_mllearn_df.py}}) 
> with spark 2.1.0 was failing because the test score from linear regression 
> was very low ({{~ 0.24}}). I did a some investigation and it turns out the 
> the model parameters computed by the dml script are incorrect. In 
> systemml.12, the values of betas from linear regression model are 
> {{\[152.919, 938.237\]}}. This is what we expect from normal equation. (I 
> also tested this with sklearn). But the values of betas from systemml.13 
> (with spark 2.1.0) come out to be {{\[153.146, 458.489\]}}. These are not 
> correct and therefore the test score is much lower than expected. The data 
> going into DML script is correct. I printed out the valued of {{X}} and {{Y}} 
> in dml and I didn't see any issue there.
> Attached are the log files for two different tests (systemml0.12 and 0.13) 
> with explain flag.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to