Thanks Xiangrui, your suggestion fixed the problem. I will see if I can upgrade the numpy/python for a permanent fix. My current versions of python and numpy are 2.6 and 4.1.9 respectively.
Thanks, Sam -----Original Message----- From: Xiangrui Meng [mailto:men...@gmail.com] Sent: Tuesday, July 01, 2014 12:14 AM To: user@spark.apache.org Subject: Re: Spark 1.0 and Logistic Regression Python Example You were using an old version of numpy, 1.4? I think this is fixed in the latest master. Try to replace vec.dot(target) by numpy.dot(vec, target), or use the latest master. -Xiangrui On Mon, Jun 30, 2014 at 2:04 PM, Sam Jacobs <sam.jac...@us.abb.com> wrote: > Hi, > > > I modified the example code for logistic regression to compute the > error in classification. Please see below. However the code is failing > when it makes a call to: > > > labelsAndPreds.filter(lambda (v, p): v != p).count() > > > with the error message (something related to numpy or dot product): > > > File > "/opt/spark-1.0.0-bin-hadoop2/python/pyspark/mllib/classification.py", > line 65, in predict > > margin = _dot(x, self._coeff) + self._intercept > > File "/opt/spark-1.0.0-bin-hadoop2/python/pyspark/mllib/_common.py", > line 443, in _dot > > return vec.dot(target) > > AttributeError: 'numpy.ndarray' object has no attribute 'dot' > > > FYI, I am running the code using spark-submit i.e. > > > ./bin/spark-submit > examples/src/main/python/mllib/logistic_regression2.py > > > > The code is posted below if it will be useful in any way: > > > from math import exp > > import sys > import time > > from pyspark import SparkContext > > from pyspark.mllib.classification import LogisticRegressionWithSGD > from pyspark.mllib.regression import LabeledPoint from numpy import > array > > > # Load and parse the data > def parsePoint(line): > values = [float(x) for x in line.split(',')] > if values[0] == -1: # Convert -1 labels to 0 for MLlib > values[0] = 0 > return LabeledPoint(values[0], values[1:]) > > sc = SparkContext(appName="PythonLR") > # start timing > start = time.time() > #start = time.clock() > > data = sc.textFile("sWAMSpark_train.csv") > parsedData = data.map(parsePoint) > > # Build the model > model = LogisticRegressionWithSGD.train(parsedData) > > #load test data > > testdata = sc.textFile("sWSpark_test.csv") parsedTestData = > testdata.map(parsePoint) > > # Evaluating the model on test data > labelsAndPreds = parsedTestData.map(lambda p: (p.label, > model.predict(p.features))) > trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / > float(parsedData.count()) > print("Training Error = " + str(trainErr)) end = time.time() > print("Time is = " + str(end - start)) > > > > > > >