spark.pyspark.python is ignored?

2017-06-29 Thread Jason White
According to the documentation, `spark.pyspark.python` configures which python executable is run on the workers. It seems to be ignored in my simple test cast. I'm running on a pip-installed Pyspark 2.1.1, completely stock. The only customization at this point is my Hadoop configuration directory.

Case class with POJO - encoder issues

2017-02-11 Thread Jason White
I'd like to create a Dataset using some classes from Geotools to do some geospatial analysis. In particular, I'm trying to use Spark to distribute the work based on ID and label fields that I extract from the polygon data. My simplified case class looks like this: implicit val geometryEncoder:

Efficient join multiple times

2016-01-08 Thread Jason White
I'm trying to join a contant large-ish RDD to each RDD in a DStream, and I'm trying to keep the join as efficient as possible so each batch finishes within the batch window. I'm using PySpark on 1.6 I've tried the trick of keying the large RDD into (k, v) pairs and using

Re: PySpark + Streaming + DataFrames

2015-11-02 Thread Jason White
o, for other folks who may read this, could reply back > with the trusted conversion that worked for you (for a clear solution)? > > TD > > > On Mon, Oct 19, 2015 at 3:08 PM, Jason White <jason.wh...@shopify.com> > wrote: > >> Ah, that makes sense then, thanks

Re: PySpark + Streaming + DataFrames

2015-10-19 Thread Jason White
015 at 5:23:59 PM, Tathagata Das (t...@databricks.com) wrote: RDD and DF are not compatible data types. So you cannot return a DF when you have to return an RDD. What rather you can do is return the underlying RDD of the dataframe by dataframe.rdd().  On Fri, Oct 16, 2015 at 12:07 PM, Jason Wh

PySpark + Streaming + DataFrames

2015-10-16 Thread Jason White
I'm trying to create a DStream of DataFrames using PySpark. I receive data from Kafka in the form of a JSON string, and I'm parsing these RDDs of Strings into DataFrames. My code is: I get the following error at pyspark/streaming/util.py, line 64: I've verified that the sqlContext is properly

Re: PySpark + Streaming + DataFrames

2015-10-16 Thread Jason White
Hi Ken, thanks for replying. Unless I'm misunderstanding something, I don't believe that's correct. Dstream.transform() accepts a single argument, func. func should be a function that accepts a single RDD, and returns a single RDD. That's what transform_to_df does, except the RDD it returns is a

PySpark Checkpoints with Broadcast Variables

2015-09-29 Thread Jason White
I'm having trouble loading a streaming job from a checkpoint when a broadcast variable is defined. I've seen the solution by TD in Scala ( https://issues.apache.org/jira/browse/SPARK-5206) that uses a singleton to get/create an accumulator, but I can't seem to get it to work in PySpark with a

PySpark Checkpoints with Broadcast Variables

2015-09-29 Thread Jason White
I'm having trouble loading a streaming job from a checkpoint when a broadcast variable is defined. I've seen the solution by TD in Scala ( https://issues.apache.org/jira/browse/SPARK-5206) that uses a singleton to get/create an accumulator, but I can't seem to get it to work in PySpark with a