[ https://issues.apache.org/jira/browse/SPARK-6362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Patrick Wendell updated SPARK-6362: ----------------------------------- Component/s: PySpark > Broken pipe error when training a RandomForest on a union of two RDDs > --------------------------------------------------------------------- > > Key: SPARK-6362 > URL: https://issues.apache.org/jira/browse/SPARK-6362 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark > Affects Versions: 1.2.0 > Environment: Kubuntu 14.04, local driver > Reporter: Pavel Laskov > Priority: Minor > > Training a RandomForest classifier on a dataset obtained as a union of two > RDDs throws a broken pipe error: > Traceback (most recent call last): > File "/home/laskov/code/spark-1.2.1/python/pyspark/daemon.py", line 162, in > manager > code = worker(sock) > File "/home/laskov/code/spark-1.2.1/python/pyspark/daemon.py", line 64, in > worker > outfile.flush() > IOError: [Errno 32] Broken pipe > Despite an error the job runs to completion. > The following code reproduces the error: > from pyspark.context import SparkContext > from pyspark.mllib.rand import RandomRDDs > from pyspark.mllib.tree import RandomForest > from pyspark.mllib.linalg import DenseVector > from pyspark.mllib.regression import LabeledPoint > import random > if __name__ == "__main__": > sc = SparkContext(appName="Union bug test") > data1 = RandomRDDs.normalVectorRDD(sc,numRows=10000,numCols=200) > data1 = data1.map(lambda x: LabeledPoint(random.randint(0,1),\ > DenseVector(x))) > data2 = RandomRDDs.normalVectorRDD(sc,numRows=10000,numCols=200) > data2 = data2.map(lambda x: LabeledPoint(random.randint(0,1),\ > DenseVector(x))) > training_data = data1.union(data2) > #training_data = training_data.repartition(2) > model = RandomForest.trainClassifier(training_data, numClasses=2, > categoricalFeaturesInfo={}, > numTrees=50, maxDepth=30) > Interestingly, re-partitioning the data after the union operation rectifies > the problem (uncomment the line before training in the code above). -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org