Re: Call Scala API from PySpark

2016-06-30 Thread Pedro Rodriguez
That was indeed the case, using UTF8Deserializer makes everything work correctly. Thanks for the tips! On Thu, Jun 30, 2016 at 3:32 PM, Pedro Rodriguez wrote: > Quick update, I was able to get most of the plumbing to work thanks to the > code Holden posted and browsing

Re: Call Scala API from PySpark

2016-06-30 Thread Pedro Rodriguez
Quick update, I was able to get most of the plumbing to work thanks to the code Holden posted and browsing more source code. I am running into this error which makes me think that maybe I shouldn't be leaving the default python RDD serializer/pickler in place and do something else

Re: Call Scala API from PySpark

2016-06-30 Thread Pedro Rodriguez
Thanks Jeff and Holden, A little more context here probably helps. I am working on implementing the idea from this article to make reads from S3 faster: http://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219 (although my name is Pedro, I am not the author of the article). The

Re: Call Scala API from PySpark

2016-06-30 Thread Holden Karau
So I'm a little biased - I think the bet bride between the two is using DataFrames. I've got some examples in my talk and on the high performance spark GitHub https://github.com/high-performance-spark/high-performance-spark-examples/blob/master/high_performance_pyspark/simple_perf_test.py calls

Re: Call Scala API from PySpark

2016-06-30 Thread Jeff Zhang
Hi Pedro, Your use case is interesting. I think launching java gateway is the same as native SparkContext, the only difference is on creating your custom SparkContext instead of native SparkContext. You might also need to wrap it using java.

Call Scala API from PySpark

2016-06-30 Thread Pedro Rodriguez
Hi All, I have written a Scala package which essentially wraps the SparkContext around a custom class that adds some functionality specific to our internal use case. I am trying to figure out the best way to call this from PySpark. I would like to do this similarly to how Spark itself calls the