RE: Spark 1.0 and Logistic Regression Python Example
Thanks Xiangrui, your suggestion fixed the problem. I will see if I can upgrade the numpy/python for a permanent fix. My current versions of python and numpy are 2.6 and 4.1.9 respectively. Thanks, Sam -Original Message- From: Xiangrui Meng [mailto:men...@gmail.com] Sent: Tuesday, July 01, 2014 12:14 AM To: user@spark.apache.org Subject: Re: Spark 1.0 and Logistic Regression Python Example You were using an old version of numpy, 1.4? I think this is fixed in the latest master. Try to replace vec.dot(target) by numpy.dot(vec, target), or use the latest master. -Xiangrui On Mon, Jun 30, 2014 at 2:04 PM, Sam Jacobs sam.jac...@us.abb.com wrote: Hi, I modified the example code for logistic regression to compute the error in classification. Please see below. However the code is failing when it makes a call to: labelsAndPreds.filter(lambda (v, p): v != p).count() with the error message (something related to numpy or dot product): File /opt/spark-1.0.0-bin-hadoop2/python/pyspark/mllib/classification.py, line 65, in predict margin = _dot(x, self._coeff) + self._intercept File /opt/spark-1.0.0-bin-hadoop2/python/pyspark/mllib/_common.py, line 443, in _dot return vec.dot(target) AttributeError: 'numpy.ndarray' object has no attribute 'dot' FYI, I am running the code using spark-submit i.e. ./bin/spark-submit examples/src/main/python/mllib/logistic_regression2.py The code is posted below if it will be useful in any way: from math import exp import sys import time from pyspark import SparkContext from pyspark.mllib.classification import LogisticRegressionWithSGD from pyspark.mllib.regression import LabeledPoint from numpy import array # Load and parse the data def parsePoint(line): values = [float(x) for x in line.split(',')] if values[0] == -1: # Convert -1 labels to 0 for MLlib values[0] = 0 return LabeledPoint(values[0], values[1:]) sc = SparkContext(appName=PythonLR) # start timing start = time.time() #start = time.clock() data = sc.textFile(sWAMSpark_train.csv) parsedData = data.map(parsePoint) # Build the model model = LogisticRegressionWithSGD.train(parsedData) #load test data testdata = sc.textFile(sWSpark_test.csv) parsedTestData = testdata.map(parsePoint) # Evaluating the model on test data labelsAndPreds = parsedTestData.map(lambda p: (p.label, model.predict(p.features))) trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(parsedData.count()) print(Training Error = + str(trainErr)) end = time.time() print(Time is = + str(end - start))
Spark 1.0 and Logistic Regression Python Example
Hi, I modified the example code for logistic regression to compute the error in classification. Please see below. However the code is failing when it makes a call to: labelsAndPreds.filter(lambda (v, p): v != p).count() with the error message (something related to numpy or dot product): File /opt/spark-1.0.0-bin-hadoop2/python/pyspark/mllib/classification.py, line 65, in predict margin = _dot(x, self._coeff) + self._intercept File /opt/spark-1.0.0-bin-hadoop2/python/pyspark/mllib/_common.py, line 443, in _dot return vec.dot(target) AttributeError: 'numpy.ndarray' object has no attribute 'dot' FYI, I am running the code using spark-submit i.e. ./bin/spark-submit examples/src/main/python/mllib/logistic_regression2.py The code is posted below if it will be useful in any way: from math import exp import sys import time from pyspark import SparkContext from pyspark.mllib.classification import LogisticRegressionWithSGD from pyspark.mllib.regression import LabeledPoint from numpy import array # Load and parse the data def parsePoint(line): values = [float(x) for x in line.split(',')] if values[0] == -1: # Convert -1 labels to 0 for MLlib values[0] = 0 return LabeledPoint(values[0], values[1:]) ? sc = SparkContext(appName=PythonLR) # start timing start = time.time() #start = time.clock() data = sc.textFile(sWAMSpark_train.csv) parsedData = data.map(parsePoint) # Build the model model = LogisticRegressionWithSGD.train(parsedData) #load test data testdata = sc.textFile(sWSpark_test.csv) parsedTestData = testdata.map(parsePoint) # Evaluating the model on test data labelsAndPreds = parsedTestData.map(lambda p: (p.label, model.predict(p.features))) trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(parsedData.count()) print(Training Error = + str(trainErr)) end = time.time() print(Time is = + str(end - start))
Re: Spark 1.0 and Logistic Regression Python Example
You were using an old version of numpy, 1.4? I think this is fixed in the latest master. Try to replace vec.dot(target) by numpy.dot(vec, target), or use the latest master. -Xiangrui On Mon, Jun 30, 2014 at 2:04 PM, Sam Jacobs sam.jac...@us.abb.com wrote: Hi, I modified the example code for logistic regression to compute the error in classification. Please see below. However the code is failing when it makes a call to: labelsAndPreds.filter(lambda (v, p): v != p).count() with the error message (something related to numpy or dot product): File /opt/spark-1.0.0-bin-hadoop2/python/pyspark/mllib/classification.py, line 65, in predict margin = _dot(x, self._coeff) + self._intercept File /opt/spark-1.0.0-bin-hadoop2/python/pyspark/mllib/_common.py, line 443, in _dot return vec.dot(target) AttributeError: 'numpy.ndarray' object has no attribute 'dot' FYI, I am running the code using spark-submit i.e. ./bin/spark-submit examples/src/main/python/mllib/logistic_regression2.py The code is posted below if it will be useful in any way: from math import exp import sys import time from pyspark import SparkContext from pyspark.mllib.classification import LogisticRegressionWithSGD from pyspark.mllib.regression import LabeledPoint from numpy import array # Load and parse the data def parsePoint(line): values = [float(x) for x in line.split(',')] if values[0] == -1: # Convert -1 labels to 0 for MLlib values[0] = 0 return LabeledPoint(values[0], values[1:]) sc = SparkContext(appName=PythonLR) # start timing start = time.time() #start = time.clock() data = sc.textFile(sWAMSpark_train.csv) parsedData = data.map(parsePoint) # Build the model model = LogisticRegressionWithSGD.train(parsedData) #load test data testdata = sc.textFile(sWSpark_test.csv) parsedTestData = testdata.map(parsePoint) # Evaluating the model on test data labelsAndPreds = parsedTestData.map(lambda p: (p.label, model.predict(p.features))) trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(parsedData.count()) print(Training Error = + str(trainErr)) end = time.time() print(Time is = + str(end - start))