The data is in LIBSVM format. So this line won't work:

values = [float(s) for s in line.split(' ')]

Please use the util function in MLUtils to load it as an RDD of LabeledPoint.

http://spark.apache.org/docs/latest/mllib-data-types.html#labeled-point

from pyspark.mllib.util import MLUtils

examples = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")

-Xiangrui

On Sun, Nov 23, 2014 at 11:38 AM, Venkat, Ankam
<ankam.ven...@centurylink.com> wrote:
> Can you please suggest sample data for running the logistic_regression.py?
>
>
>
> I am trying to use a sample data file at
> https://github.com/apache/spark/blob/master/data/mllib/sample_linear_regression_data.txt
>
>
>
> I am running this on CDH5.2 Quickstart VM.
>
>
>
> [cloudera@quickstart mllib]$ spark-submit logistic_regression.py lr.txt 3
>
>
>
> But, getting below error.
>
>
>
> 14/11/23 11:23:55 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times;
> aborting job
>
> 14/11/23 11:23:55 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks
> have all completed, from pool
>
> 14/11/23 11:23:55 INFO TaskSchedulerImpl: Cancelling stage 0
>
> 14/11/23 11:23:55 INFO DAGScheduler: Failed to run runJob at
> PythonRDD.scala:296
>
> Traceback (most recent call last):
>
>   File "/usr/lib/spark/examples/lib/mllib/logistic_regression.py", line 50,
> in <module>
>
>     model = LogisticRegressionWithSGD.train(points, iterations)
>
>   File "/usr/lib/spark/python/pyspark/mllib/classification.py", line 110, in
> train
>
>     initialWeights)
>
>   File "/usr/lib/spark/python/pyspark/mllib/_common.py", line 430, in
> _regression_train_wrapper
>
>     initial_weights = _get_initial_weights(initial_weights, data)
>
>   File "/usr/lib/spark/python/pyspark/mllib/_common.py", line 415, in
> _get_initial_weights
>
>     initial_weights = _convert_vector(data.first().features)
>
>   File "/usr/lib/spark/python/pyspark/rdd.py", line 1127, in first
>
>     rs = self.take(1)
>
>   File "/usr/lib/spark/python/pyspark/rdd.py", line 1109, in take
>
>     res = self.context.runJob(self, takeUpToNumLeft, p, True)
>
>   File "/usr/lib/spark/python/pyspark/context.py", line 770, in runJob
>
>     it = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd,
> javaPartitions, allowLocal)
>
>   File
> "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line
> 538, in __call__
>
>   File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
> line 300, in get_return_value
>
> py4j.protocol.Py4JJavaError: An error occurred while calling
> z:org.apache.spark.api.python.PythonRDD.runJob.
>
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
> in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0
> (TID 3, 192.168.139.145): org.apache.spark.api.python.PythonException:
> Traceback (most recent call last):
>
>   File "/usr/lib/spark/python/pyspark/worker.py", line 79, in main
>
>     serializer.dump_stream(func(split_index, iterator), outfile)
>
>   File "/usr/lib/spark/python/pyspark/serializers.py", line 196, in
> dump_stream
>
>     self.serializer.dump_stream(self._batched(iterator), stream)
>
>   File "/usr/lib/spark/python/pyspark/serializers.py", line 127, in
> dump_stream
>
>     for obj in iterator:
>
>   File "/usr/lib/spark/python/pyspark/serializers.py", line 185, in _batched
>
>     for item in iterator:
>
>   File "/usr/lib/spark/python/pyspark/rdd.py", line 1105, in takeUpToNumLeft
>
>     yield next(iterator)
>
>   File "/usr/lib/spark/examples/lib/mllib/logistic_regression.py", line 37,
> in parsePoint
>
>     values = [float(s) for s in line.split(' ')]
>
> ValueError: invalid literal for float(): 1:0.4551273600657362
>
>
>
> Regards,
>
> Venkat
>
> This communication is the property of CenturyLink and may contain
> confidential or privileged information. Unauthorized use of this
> communication is strictly prohibited and may be unlawful. If you have
> received this communication in error, please immediately notify the sender
> by reply e-mail and destroy all copies of the communication and any
> attachments.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to