The data is in LIBSVM format. So this line won't work: values = [float(s) for s in line.split(' ')]
Please use the util function in MLUtils to load it as an RDD of LabeledPoint. http://spark.apache.org/docs/latest/mllib-data-types.html#labeled-point from pyspark.mllib.util import MLUtils examples = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt") -Xiangrui On Sun, Nov 23, 2014 at 11:38 AM, Venkat, Ankam <ankam.ven...@centurylink.com> wrote: > Can you please suggest sample data for running the logistic_regression.py? > > > > I am trying to use a sample data file at > https://github.com/apache/spark/blob/master/data/mllib/sample_linear_regression_data.txt > > > > I am running this on CDH5.2 Quickstart VM. > > > > [cloudera@quickstart mllib]$ spark-submit logistic_regression.py lr.txt 3 > > > > But, getting below error. > > > > 14/11/23 11:23:55 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times; > aborting job > > 14/11/23 11:23:55 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks > have all completed, from pool > > 14/11/23 11:23:55 INFO TaskSchedulerImpl: Cancelling stage 0 > > 14/11/23 11:23:55 INFO DAGScheduler: Failed to run runJob at > PythonRDD.scala:296 > > Traceback (most recent call last): > > File "/usr/lib/spark/examples/lib/mllib/logistic_regression.py", line 50, > in <module> > > model = LogisticRegressionWithSGD.train(points, iterations) > > File "/usr/lib/spark/python/pyspark/mllib/classification.py", line 110, in > train > > initialWeights) > > File "/usr/lib/spark/python/pyspark/mllib/_common.py", line 430, in > _regression_train_wrapper > > initial_weights = _get_initial_weights(initial_weights, data) > > File "/usr/lib/spark/python/pyspark/mllib/_common.py", line 415, in > _get_initial_weights > > initial_weights = _convert_vector(data.first().features) > > File "/usr/lib/spark/python/pyspark/rdd.py", line 1127, in first > > rs = self.take(1) > > File "/usr/lib/spark/python/pyspark/rdd.py", line 1109, in take > > res = self.context.runJob(self, takeUpToNumLeft, p, True) > > File "/usr/lib/spark/python/pyspark/context.py", line 770, in runJob > > it = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, > javaPartitions, allowLocal) > > File > "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line > 538, in __call__ > > File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", > line 300, in get_return_value > > py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.runJob. > > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 > (TID 3, 192.168.139.145): org.apache.spark.api.python.PythonException: > Traceback (most recent call last): > > File "/usr/lib/spark/python/pyspark/worker.py", line 79, in main > > serializer.dump_stream(func(split_index, iterator), outfile) > > File "/usr/lib/spark/python/pyspark/serializers.py", line 196, in > dump_stream > > self.serializer.dump_stream(self._batched(iterator), stream) > > File "/usr/lib/spark/python/pyspark/serializers.py", line 127, in > dump_stream > > for obj in iterator: > > File "/usr/lib/spark/python/pyspark/serializers.py", line 185, in _batched > > for item in iterator: > > File "/usr/lib/spark/python/pyspark/rdd.py", line 1105, in takeUpToNumLeft > > yield next(iterator) > > File "/usr/lib/spark/examples/lib/mllib/logistic_regression.py", line 37, > in parsePoint > > values = [float(s) for s in line.split(' ')] > > ValueError: invalid literal for float(): 1:0.4551273600657362 > > > > Regards, > > Venkat > > This communication is the property of CenturyLink and may contain > confidential or privileged information. Unauthorized use of this > communication is strictly prohibited and may be unlawful. If you have > received this communication in error, please immediately notify the sender > by reply e-mail and destroy all copies of the communication and any > attachments. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org