RE: Spark 1.0 and Logistic Regression Python Example

2014-07-01 Thread Sam Jacobs
Thanks Xiangrui, your suggestion fixed the problem. I will see if I can upgrade 
the numpy/python for a permanent fix. My current versions of python and numpy 
are 2.6 and 4.1.9 respectively.

Thanks,

Sam  

-Original Message-
From: Xiangrui Meng [mailto:men...@gmail.com] 
Sent: Tuesday, July 01, 2014 12:14 AM
To: user@spark.apache.org
Subject: Re: Spark 1.0 and Logistic Regression Python Example

You were using an old version of numpy, 1.4? I think this is fixed in the 
latest master. Try to replace vec.dot(target) by numpy.dot(vec, target), or use 
the latest master. -Xiangrui

On Mon, Jun 30, 2014 at 2:04 PM, Sam Jacobs sam.jac...@us.abb.com wrote:
 Hi,


 I modified the example code for logistic regression to compute the 
 error in classification. Please see below. However the code is failing 
 when it makes a call to:


 labelsAndPreds.filter(lambda (v, p): v != p).count()


 with the error message (something related to numpy or dot product):


 File 
 /opt/spark-1.0.0-bin-hadoop2/python/pyspark/mllib/classification.py,
 line 65, in predict

 margin = _dot(x, self._coeff) + self._intercept

   File /opt/spark-1.0.0-bin-hadoop2/python/pyspark/mllib/_common.py, 
 line 443, in _dot

 return vec.dot(target)

 AttributeError: 'numpy.ndarray' object has no attribute 'dot'


 FYI, I am running the code using spark-submit i.e.


 ./bin/spark-submit 
 examples/src/main/python/mllib/logistic_regression2.py



 The code is posted below if it will be useful in any way:


 from math import exp

 import sys
 import time

 from pyspark import SparkContext

 from pyspark.mllib.classification import LogisticRegressionWithSGD 
 from pyspark.mllib.regression import LabeledPoint from numpy import 
 array


 # Load and parse the data
 def parsePoint(line):
 values = [float(x) for x in line.split(',')]
 if values[0] == -1:   # Convert -1 labels to 0 for MLlib
 values[0] = 0
 return LabeledPoint(values[0], values[1:])

 sc = SparkContext(appName=PythonLR)
 # start timing
 start = time.time()
 #start = time.clock()

 data = sc.textFile(sWAMSpark_train.csv)
 parsedData = data.map(parsePoint)

 # Build the model
 model = LogisticRegressionWithSGD.train(parsedData)

 #load test data

 testdata = sc.textFile(sWSpark_test.csv) parsedTestData = 
 testdata.map(parsePoint)

 # Evaluating the model on test data
 labelsAndPreds = parsedTestData.map(lambda p: (p.label,
 model.predict(p.features)))
 trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() /
 float(parsedData.count())
 print(Training Error =  + str(trainErr)) end = time.time() 
 print(Time is =  + str(end - start))









Spark 1.0 and Logistic Regression Python Example

2014-06-30 Thread Sam Jacobs
Hi,


I modified the example code for logistic regression to compute the error in 
classification. Please see below. However the code is failing when it makes a 
call to:


labelsAndPreds.filter(lambda (v, p): v != p).count()


with the error message (something related to numpy or dot product):


File /opt/spark-1.0.0-bin-hadoop2/python/pyspark/mllib/classification.py, 
line 65, in predict

margin = _dot(x, self._coeff) + self._intercept

  File /opt/spark-1.0.0-bin-hadoop2/python/pyspark/mllib/_common.py, line 
443, in _dot

return vec.dot(target)

AttributeError: 'numpy.ndarray' object has no attribute 'dot'


FYI, I am running the code using spark-submit i.e.


./bin/spark-submit examples/src/main/python/mllib/logistic_regression2.py



The code is posted below if it will be useful in any way:


from math import exp

import sys
import time

from pyspark import SparkContext

from pyspark.mllib.classification import LogisticRegressionWithSGD
from pyspark.mllib.regression import LabeledPoint
from numpy import array


# Load and parse the data
def parsePoint(line):
values = [float(x) for x in line.split(',')]
if values[0] == -1:   # Convert -1 labels to 0 for MLlib
values[0] = 0
return LabeledPoint(values[0], values[1:])
?
sc = SparkContext(appName=PythonLR)
# start timing
start = time.time()
#start = time.clock()

data = sc.textFile(sWAMSpark_train.csv)
parsedData = data.map(parsePoint)

# Build the model
model = LogisticRegressionWithSGD.train(parsedData)

#load test data

testdata = sc.textFile(sWSpark_test.csv)
parsedTestData = testdata.map(parsePoint)

# Evaluating the model on test data
labelsAndPreds = parsedTestData.map(lambda p: (p.label, 
model.predict(p.features)))
trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / 
float(parsedData.count())
print(Training Error =  + str(trainErr))
end = time.time()
print(Time is =  + str(end - start))









Re: Spark 1.0 and Logistic Regression Python Example

2014-06-30 Thread Xiangrui Meng
You were using an old version of numpy, 1.4? I think this is fixed in
the latest master. Try to replace vec.dot(target) by numpy.dot(vec,
target), or use the latest master. -Xiangrui

On Mon, Jun 30, 2014 at 2:04 PM, Sam Jacobs sam.jac...@us.abb.com wrote:
 Hi,


 I modified the example code for logistic regression to compute the error in
 classification. Please see below. However the code is failing when it makes
 a call to:


 labelsAndPreds.filter(lambda (v, p): v != p).count()


 with the error message (something related to numpy or dot product):


 File /opt/spark-1.0.0-bin-hadoop2/python/pyspark/mllib/classification.py,
 line 65, in predict

 margin = _dot(x, self._coeff) + self._intercept

   File /opt/spark-1.0.0-bin-hadoop2/python/pyspark/mllib/_common.py, line
 443, in _dot

 return vec.dot(target)

 AttributeError: 'numpy.ndarray' object has no attribute 'dot'


 FYI, I am running the code using spark-submit i.e.


 ./bin/spark-submit examples/src/main/python/mllib/logistic_regression2.py



 The code is posted below if it will be useful in any way:


 from math import exp

 import sys
 import time

 from pyspark import SparkContext

 from pyspark.mllib.classification import LogisticRegressionWithSGD
 from pyspark.mllib.regression import LabeledPoint
 from numpy import array


 # Load and parse the data
 def parsePoint(line):
 values = [float(x) for x in line.split(',')]
 if values[0] == -1:   # Convert -1 labels to 0 for MLlib
 values[0] = 0
 return LabeledPoint(values[0], values[1:])

 sc = SparkContext(appName=PythonLR)
 # start timing
 start = time.time()
 #start = time.clock()

 data = sc.textFile(sWAMSpark_train.csv)
 parsedData = data.map(parsePoint)

 # Build the model
 model = LogisticRegressionWithSGD.train(parsedData)

 #load test data

 testdata = sc.textFile(sWSpark_test.csv)
 parsedTestData = testdata.map(parsePoint)

 # Evaluating the model on test data
 labelsAndPreds = parsedTestData.map(lambda p: (p.label,
 model.predict(p.features)))
 trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() /
 float(parsedData.count())
 print(Training Error =  + str(trainErr))
 end = time.time()
 print(Time is =  + str(end - start))