pickling error with PySpark and Elasticsearch-py analyzer

2015-08-22 Thread pkphlam
Reposting my question from SO:
http://stackoverflow.com/questions/32161865/elasticsearch-analyze-not-compatible-with-spark-in-python

I'm using the elasticsearch-py client within PySpark using Python 3 and I'm
running into a problem using the analyze() function with ES in conjunction
with an RDD. In particular, each record in my RDD is a string of text and
I'm trying to analyze it to get out the token information, but I'm getting
an error when trying to use it within a map function in Spark.

For example, this works perfectly fine:

>> from elasticsearch import Elasticsearch
>> es = Elasticsearch()
>> t = 'the quick brown fox'
>> es.indices.analyze(text=t)['tokens'][0]

{'end_offset': 3,
'position': 1,
'start_offset': 0,
'token': 'the',
'type': ''}

However, when I try this:

>> trdd = sc.parallelize(['the quick brown fox'])
>> trdd.map(lambda x: es.indices.analyze(text=x)['tokens'][0]).collect()

I get a really really long error message related to pickling (Here's the end
of it):

(self, obj)109if'recursion'in.[0]:110="""Could not pickle object as
excessively deep recursion required."""--> 111 
picklePicklingErrormsg

  save_memoryviewself obj

: Could not pickle object as excessively deep recursion required.

raise.()112113def(,):PicklingError


I'm not sure what the error means. Am I doing something wrong? Is there a
way to map the ES analyze function onto records of an RDD?






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/pickling-error-with-PySpark-and-Elasticsearch-py-analyzer-tp24402.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Random Forest and StringIndexer in pyspark ML Pipeline

2015-08-10 Thread pkphlam
Hi,

If I understand the RandomForest model in the ML Pipeline implementation in
the ml package correctly, I have to first run my outcome label variable
through the StringIndexer, even if my labels are numeric. The StringIndexer
then converts the labels into numeric indices based on frequency of the
label. 

This could create situations where if I'm classifying binary outcomes where
my original labels are simply 0 and 1, the StringIndexer may actually flip
my labels such that 0s become 1s and 1s become 0s if my original 1s were
more frequent. This transformation would then extend itself to the
predictions. In the old mllib implementation, the RF does not require the
labels to be changed and I could use 0/1 labels without worrying about them
being transformed.

I was wondering:
1. Why is this the default implementation for the Pipeline RF? This seems
like it could cause a lot of confusion in cases like the one I outlined
above.
2. Is there a way to avoid this by either controlling how the indices are
created in StringIndexer or bypassing StringIndexer altogether?
3. If 2 is not possible, is there an easy way to see how my original labels
mapped onto the indices so that I can revert the predictions back to the
original labels rather than the transformed labels? I suppose I could do
this by counting the original labels and mapping by frequency, but it seems
like there should be a more straightforward way to get it out of the
StringIndexer.

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Random-Forest-and-StringIndexer-in-pyspark-ML-Pipeline-tp24200.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Extremely poor predictive performance with RF in mllib

2015-08-02 Thread pkphlam
Hi,

This might be a long shot, but has anybody run into very poor predictive
performance using RandomForest with Mllib? Here is what I'm doing:

- Spark 1.4.1 with PySpark
- Python 3.4.2
- ~30,000 Tweets of text
- 12289 1s and 15956 0s
- Whitespace tokenization and then hashing trick for feature selection using
10,000 features
- Run RF with 100 trees and maxDepth of 4 and then predict using the
features from all the 1s observations.

So in theory, I should get predictions of close to 12289 1s (especially if
the model overfits). But I'm getting exactly 0 1s, which sounds ludicrous to
me and makes me suspect something is wrong with my code or I'm missing
something. I notice similar behavior (although not as extreme) if I play
around with the settings. But I'm getting normal behavior with other
classifiers, so I don't think it's my setup that's the problem.

For example:

>>> lrm = LogisticRegressionWithSGD.train(lp, iterations=10)
>>> logit_predict = lrm.predict(predict_feat)
>>> logit_predict.sum()
9077

>>> nb = NaiveBayes.train(lp)
>>> nb_predict = nb.predict(predict_feat)
>>> nb_predict.sum()
10287.0

>>> rf = RandomForest.trainClassifier(lp, numClasses=2,
>>> categoricalFeaturesInfo={}, numTrees=100, seed=422)
>>> rf_predict = rf.predict(predict_feat)
>>> rf_predict.sum()
0.0

This code was all run back to back so I didn't change anything in between.
Does anybody have a possible explanation for this?

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Extremely-poor-predictive-performance-with-RF-in-mllib-tp24112.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org