Nearest neighbors in Spark with Annoy

2016-02-03 Thread apu mishra . rr
As mllib doesn't have nearest-neighbors functionality, I'm trying to use Annoy for Approximate Nearest Neighbors. I try to broadcast the Annoy object and pass it to workers; however, it does not operate as expected. Below is complete code for reproducibility.

Getting the size of a broadcast variable

2016-02-01 Thread apu mishra . rr
How can I determine the size (in bytes) of a broadcast variable? Do I need to use the .dump method and then look at the size of the result, or is there an easier way? Using PySpark with Spark 1.6. Thanks! Apu

StackOverflowError when writing dataframe to table

2015-12-09 Thread apu mishra . rr
The command mydataframe.write.saveAsTable(name="tablename") sometimes results in java.lang.StackOverflowError (see below for fuller error message). This is after I am able to successfully run cache() and show() methods on mydataframe. The issue is not deterministic, i.e. the same code

Where does mllib's .save method save a model to?

2015-11-02 Thread apu mishra . rr
I want to save an mllib model to disk, and am trying the model.save operation as described in http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html#examples: model.save(sc, "myModelPath") But after running it, I am unable to find any newly created file or dir by the name