Hi, I am using SPARK 1.4.0, Python and Decision Trees to perform machine learning classification.
I test it by creating the predictions and zip it to the test data, as following: *predictions = tree_model.predict(test_data.map(lambda a: a.features)) labels = test_data.map(lambda a: a.label).zip(predictions) correct = 100 * (labels.filter(lambda (v, p): v == p).count() / float(test_data.count()))* I always get this error in the zipping phase: *Can not deserialize RDD with different number of items in pair: (3, 2)* To avoid zipping, I tried to do it in a different way, as follows: *labels = test_data.map(lambda a: (a.label, tree_model.predict(a.features))) correct = 100 * (labels.filter(lambda (v, p): v == p).count() / float(test_data.count()))* However, I always get this error: *in __getnewargs__(self) 250 # This method is called when attempting to pickle SparkContext, which is always an error: 251 raise Exception( --> 252 "It appears that you are attempting to reference SparkContext from a broadcast " 253 "variable, action, or transforamtion. SparkContext can only be used on the driver, " 254 "not in code that it run on workers. For more information, see SPARK-5063." Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.* Is the DecisionTreeModel part of the SparkContext ?! I found that using Scala, we can apply the second approach with no problem. So, how can I solve the two problems ? Thanks and Regards, Hisham -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Decision-Tree-Model-tp24899.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org