PySpark: use one column to index another (udf of two columns?)

2017-02-14 Thread apu
Let's say I have a Spark (PySpark) dataframe with the following schema: root |-- myarray: array (nullable = true) ||-- element: string (containsNull = true) |-- myindices: array (nullable = true) ||-- element: integer (containsNull = true) It looks like:

Adding Hive support to existing SparkSession (or starting PySpark with Hive support)

2016-12-19 Thread apu
Hive support to an existing SparkSession, or (b) Configure PySpark so that the SparkSession it creates at startup has Hive support enabled? Thanks! Apu

Is cache() still necessary for Spark DataFrames?

2016-09-02 Thread apu
1.6)?* Thanks!! Apu

Re: How can I use pyspark.ml.evaluation.BinaryClassificationEvaluator with point predictions instead of confidence intervals?

2016-06-24 Thread apu
nEvaluator evaluator = BinaryClassificationEvaluator() evaluator.evaluate(vectorizedpredictions) On Fri, Jun 24, 2016 at 10:42 AM, apu <apumishra...@gmail.com> wrote: > pyspark.ml.evaluation.BinaryClassificationEvaluator expects > predictions in the form of vectors (apparently de

How can I use pyspark.ml.evaluation.BinaryClassificationEvaluator with point predictions instead of confidence intervals?

2016-06-24 Thread apu
BinaryClassificationMetrics.) Thanks! Apu - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Nearest neighbors in Spark with Annoy

2016-02-03 Thread apu mishra . rr
As mllib doesn't have nearest-neighbors functionality, I'm trying to use Annoy for Approximate Nearest Neighbors. I try to broadcast the Annoy object and pass it to workers; however, it does not operate as expected. Below is complete code for reproducibility.

Getting the size of a broadcast variable

2016-02-01 Thread apu mishra . rr
How can I determine the size (in bytes) of a broadcast variable? Do I need to use the .dump method and then look at the size of the result, or is there an easier way? Using PySpark with Spark 1.6. Thanks! Apu

StackOverflowError when writing dataframe to table

2015-12-09 Thread apu mishra . rr
same code sometimes works fine, sometimes not. I am running PySpark with: spark-submit --master local[*] --driver-memory 24g --executor-memory 24g Any help understanding this issue would be appreciated! Thanks, Apu Fuller error message: Exception in thread "dag-scheduler-event-loop" java.lan

Where does mllib's .save method save a model to?

2015-11-02 Thread apu mishra . rr
y the name "myModelPath" in any obvious places. Any ideas where it might lie? Thanks, -Apu To reproduce: # In PySpark, create ALS or other mllib model, then model.save(sc, "myModelPath") # In Unix environ