[ 
https://issues.apache.org/jira/browse/SPARK-6362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6362:
-----------------------------------
    Component/s: PySpark

> Broken pipe error when training a RandomForest on a union of two RDDs
> ---------------------------------------------------------------------
>
>                 Key: SPARK-6362
>                 URL: https://issues.apache.org/jira/browse/SPARK-6362
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib, PySpark
>    Affects Versions: 1.2.0
>         Environment: Kubuntu 14.04, local driver
>            Reporter: Pavel Laskov
>            Priority: Minor
>
> Training a RandomForest classifier on a dataset obtained as a union of two 
> RDDs throws a broken pipe error:
> Traceback (most recent call last):
>   File "/home/laskov/code/spark-1.2.1/python/pyspark/daemon.py", line 162, in 
> manager
>     code = worker(sock)
>   File "/home/laskov/code/spark-1.2.1/python/pyspark/daemon.py", line 64, in 
> worker
>     outfile.flush()
> IOError: [Errno 32] Broken pipe
> Despite an error the job runs to completion. 
> The following code reproduces the error:
> from pyspark.context import SparkContext
> from pyspark.mllib.rand import RandomRDDs
> from pyspark.mllib.tree import RandomForest
> from pyspark.mllib.linalg import DenseVector
> from pyspark.mllib.regression import LabeledPoint
> import random
> if __name__ == "__main__":
>     sc = SparkContext(appName="Union bug test")
>     data1 = RandomRDDs.normalVectorRDD(sc,numRows=10000,numCols=200)
>     data1 = data1.map(lambda x: LabeledPoint(random.randint(0,1),\
>                                              DenseVector(x)))
>     data2 = RandomRDDs.normalVectorRDD(sc,numRows=10000,numCols=200)
>     data2 = data2.map(lambda x: LabeledPoint(random.randint(0,1),\
>                                             DenseVector(x)))
>     training_data = data1.union(data2)
>     #training_data = training_data.repartition(2)
>     model = RandomForest.trainClassifier(training_data, numClasses=2,
>                                          categoricalFeaturesInfo={},
>                                          numTrees=50, maxDepth=30)
> Interestingly, re-partitioning the data after the union operation rectifies 
> the problem (uncomment the line before training in the code above). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to