pickling error with PySpark and Elasticsearch-py analyzer
Reposting my question from SO: http://stackoverflow.com/questions/32161865/elasticsearch-analyze-not-compatible-with-spark-in-python I'm using the elasticsearch-py client within PySpark using Python 3 and I'm running into a problem using the analyze() function with ES in conjunction with an RDD. In particular, each record in my RDD is a string of text and I'm trying to analyze it to get out the token information, but I'm getting an error when trying to use it within a map function in Spark. For example, this works perfectly fine: >> from elasticsearch import Elasticsearch >> es = Elasticsearch() >> t = 'the quick brown fox' >> es.indices.analyze(text=t)['tokens'][0] {'end_offset': 3, 'position': 1, 'start_offset': 0, 'token': 'the', 'type': ''} However, when I try this: >> trdd = sc.parallelize(['the quick brown fox']) >> trdd.map(lambda x: es.indices.analyze(text=x)['tokens'][0]).collect() I get a really really long error message related to pickling (Here's the end of it): (self, obj)109if'recursion'in.[0]:110="""Could not pickle object as excessively deep recursion required."""--> 111 picklePicklingErrormsg save_memoryviewself obj : Could not pickle object as excessively deep recursion required. raise.()112113def(,):PicklingError I'm not sure what the error means. Am I doing something wrong? Is there a way to map the ES analyze function onto records of an RDD? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pickling-error-with-PySpark-and-Elasticsearch-py-analyzer-tp24402.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Random Forest and StringIndexer in pyspark ML Pipeline
Hi, If I understand the RandomForest model in the ML Pipeline implementation in the ml package correctly, I have to first run my outcome label variable through the StringIndexer, even if my labels are numeric. The StringIndexer then converts the labels into numeric indices based on frequency of the label. This could create situations where if I'm classifying binary outcomes where my original labels are simply 0 and 1, the StringIndexer may actually flip my labels such that 0s become 1s and 1s become 0s if my original 1s were more frequent. This transformation would then extend itself to the predictions. In the old mllib implementation, the RF does not require the labels to be changed and I could use 0/1 labels without worrying about them being transformed. I was wondering: 1. Why is this the default implementation for the Pipeline RF? This seems like it could cause a lot of confusion in cases like the one I outlined above. 2. Is there a way to avoid this by either controlling how the indices are created in StringIndexer or bypassing StringIndexer altogether? 3. If 2 is not possible, is there an easy way to see how my original labels mapped onto the indices so that I can revert the predictions back to the original labels rather than the transformed labels? I suppose I could do this by counting the original labels and mapping by frequency, but it seems like there should be a more straightforward way to get it out of the StringIndexer. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Random-Forest-and-StringIndexer-in-pyspark-ML-Pipeline-tp24200.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Extremely poor predictive performance with RF in mllib
Hi, This might be a long shot, but has anybody run into very poor predictive performance using RandomForest with Mllib? Here is what I'm doing: - Spark 1.4.1 with PySpark - Python 3.4.2 - ~30,000 Tweets of text - 12289 1s and 15956 0s - Whitespace tokenization and then hashing trick for feature selection using 10,000 features - Run RF with 100 trees and maxDepth of 4 and then predict using the features from all the 1s observations. So in theory, I should get predictions of close to 12289 1s (especially if the model overfits). But I'm getting exactly 0 1s, which sounds ludicrous to me and makes me suspect something is wrong with my code or I'm missing something. I notice similar behavior (although not as extreme) if I play around with the settings. But I'm getting normal behavior with other classifiers, so I don't think it's my setup that's the problem. For example: >>> lrm = LogisticRegressionWithSGD.train(lp, iterations=10) >>> logit_predict = lrm.predict(predict_feat) >>> logit_predict.sum() 9077 >>> nb = NaiveBayes.train(lp) >>> nb_predict = nb.predict(predict_feat) >>> nb_predict.sum() 10287.0 >>> rf = RandomForest.trainClassifier(lp, numClasses=2, >>> categoricalFeaturesInfo={}, numTrees=100, seed=422) >>> rf_predict = rf.predict(predict_feat) >>> rf_predict.sum() 0.0 This code was all run back to back so I didn't change anything in between. Does anybody have a possible explanation for this? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Extremely-poor-predictive-performance-with-RF-in-mllib-tp24112.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org