Re: Python Logistic Regression error

2014-11-24 Thread Xiangrui Meng
The data is in LIBSVM format. So this line won't work:

values = [float(s) for s in line.split(' ')]

Please use the util function in MLUtils to load it as an RDD of LabeledPoint.

http://spark.apache.org/docs/latest/mllib-data-types.html#labeled-point

from pyspark.mllib.util import MLUtils

examples = MLUtils.loadLibSVMFile(sc, data/mllib/sample_libsvm_data.txt)

-Xiangrui

On Sun, Nov 23, 2014 at 11:38 AM, Venkat, Ankam
ankam.ven...@centurylink.com wrote:
 Can you please suggest sample data for running the logistic_regression.py?



 I am trying to use a sample data file at
 https://github.com/apache/spark/blob/master/data/mllib/sample_linear_regression_data.txt



 I am running this on CDH5.2 Quickstart VM.



 [cloudera@quickstart mllib]$ spark-submit logistic_regression.py lr.txt 3



 But, getting below error.



 14/11/23 11:23:55 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times;
 aborting job

 14/11/23 11:23:55 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks
 have all completed, from pool

 14/11/23 11:23:55 INFO TaskSchedulerImpl: Cancelling stage 0

 14/11/23 11:23:55 INFO DAGScheduler: Failed to run runJob at
 PythonRDD.scala:296

 Traceback (most recent call last):

   File /usr/lib/spark/examples/lib/mllib/logistic_regression.py, line 50,
 in module

 model = LogisticRegressionWithSGD.train(points, iterations)

   File /usr/lib/spark/python/pyspark/mllib/classification.py, line 110, in
 train

 initialWeights)

   File /usr/lib/spark/python/pyspark/mllib/_common.py, line 430, in
 _regression_train_wrapper

 initial_weights = _get_initial_weights(initial_weights, data)

   File /usr/lib/spark/python/pyspark/mllib/_common.py, line 415, in
 _get_initial_weights

 initial_weights = _convert_vector(data.first().features)

   File /usr/lib/spark/python/pyspark/rdd.py, line 1127, in first

 rs = self.take(1)

   File /usr/lib/spark/python/pyspark/rdd.py, line 1109, in take

 res = self.context.runJob(self, takeUpToNumLeft, p, True)

   File /usr/lib/spark/python/pyspark/context.py, line 770, in runJob

 it = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd,
 javaPartitions, allowLocal)

   File
 /usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line
 538, in __call__

   File /usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py,
 line 300, in get_return_value

 py4j.protocol.Py4JJavaError: An error occurred while calling
 z:org.apache.spark.api.python.PythonRDD.runJob.

 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0
 (TID 3, 192.168.139.145): org.apache.spark.api.python.PythonException:
 Traceback (most recent call last):

   File /usr/lib/spark/python/pyspark/worker.py, line 79, in main

 serializer.dump_stream(func(split_index, iterator), outfile)

   File /usr/lib/spark/python/pyspark/serializers.py, line 196, in
 dump_stream

 self.serializer.dump_stream(self._batched(iterator), stream)

   File /usr/lib/spark/python/pyspark/serializers.py, line 127, in
 dump_stream

 for obj in iterator:

   File /usr/lib/spark/python/pyspark/serializers.py, line 185, in _batched

 for item in iterator:

   File /usr/lib/spark/python/pyspark/rdd.py, line 1105, in takeUpToNumLeft

 yield next(iterator)

   File /usr/lib/spark/examples/lib/mllib/logistic_regression.py, line 37,
 in parsePoint

 values = [float(s) for s in line.split(' ')]

 ValueError: invalid literal for float(): 1:0.4551273600657362



 Regards,

 Venkat

 This communication is the property of CenturyLink and may contain
 confidential or privileged information. Unauthorized use of this
 communication is strictly prohibited and may be unlawful. If you have
 received this communication in error, please immediately notify the sender
 by reply e-mail and destroy all copies of the communication and any
 attachments.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Python Logistic Regression error

2014-11-23 Thread Venkat, Ankam
Can you please suggest sample data for running the logistic_regression.py?

I am trying to use a sample data file at  
https://github.com/apache/spark/blob/master/data/mllib/sample_linear_regression_data.txt

I am running this on CDH5.2 Quickstart VM.

[cloudera@quickstart mllib]$ spark-submit logistic_regression.py lr.txt 3

But, getting below error.

14/11/23 11:23:55 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times; 
aborting job
14/11/23 11:23:55 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have 
all completed, from pool
14/11/23 11:23:55 INFO TaskSchedulerImpl: Cancelling stage 0
14/11/23 11:23:55 INFO DAGScheduler: Failed to run runJob at PythonRDD.scala:296
Traceback (most recent call last):
  File /usr/lib/spark/examples/lib/mllib/logistic_regression.py, line 50, in 
module
model = LogisticRegressionWithSGD.train(points, iterations)
  File /usr/lib/spark/python/pyspark/mllib/classification.py, line 110, in 
train
initialWeights)
  File /usr/lib/spark/python/pyspark/mllib/_common.py, line 430, in 
_regression_train_wrapper
initial_weights = _get_initial_weights(initial_weights, data)
  File /usr/lib/spark/python/pyspark/mllib/_common.py, line 415, in 
_get_initial_weights
initial_weights = _convert_vector(data.first().features)
  File /usr/lib/spark/python/pyspark/rdd.py, line 1127, in first
rs = self.take(1)
  File /usr/lib/spark/python/pyspark/rdd.py, line 1109, in take
res = self.context.runJob(self, takeUpToNumLeft, p, True)
  File /usr/lib/spark/python/pyspark/context.py, line 770, in runJob
it = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, 
javaPartitions, allowLocal)
  File /usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, 
line 538, in __call__
  File /usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 
300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 
3, 192.168.139.145): org.apache.spark.api.python.PythonException: Traceback 
(most recent call last):
  File /usr/lib/spark/python/pyspark/worker.py, line 79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
  File /usr/lib/spark/python/pyspark/serializers.py, line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File /usr/lib/spark/python/pyspark/serializers.py, line 127, in dump_stream
for obj in iterator:
  File /usr/lib/spark/python/pyspark/serializers.py, line 185, in _batched
for item in iterator:
  File /usr/lib/spark/python/pyspark/rdd.py, line 1105, in takeUpToNumLeft
yield next(iterator)
  File /usr/lib/spark/examples/lib/mllib/logistic_regression.py, line 37, in 
parsePoint
values = [float(s) for s in line.split(' ')]
ValueError: invalid literal for float(): 1:0.4551273600657362

Regards,
Venkat
This communication is the property of CenturyLink and may contain confidential 
or privileged information. Unauthorized use of this communication is strictly 
prohibited and may be unlawful. If you have received this communication in 
error, please immediately notify the sender by reply e-mail and destroy all 
copies of the communication and any attachments.