What is the relationship between reduceByKey and spark.driver.maxResultSize?

2015-12-11 Thread Tom Seddon
I have a job that is running into intermittent errors with [SparkDriver] java.lang.OutOfMemoryError: Java heap space. Before I was getting this error I was getting errors saying the result size exceed the spark.driver.maxResultSize. This does not make any sense to me, as there are no actions in

java.lang.NoSuchMethodError and yarn-client mode

2015-09-09 Thread Tom Seddon
Hi, I have a problem trying to get a fairly simple app working which makes use of native avro libraries. The app runs fine on my local machine and in yarn-cluster mode, but when I try to run it on EMR yarn-client mode I get the error below. I'm aware this is a version problem, as EMR runs an

Re: java.lang.NoSuchMethodError and yarn-client mode

2015-09-09 Thread Tom Seddon
t; Aniket > > On Wed, Sep 9, 2015, 2:11 PM Tom Seddon <mr.tom.sed...@gmail.com> wrote: > >> Hi, >> >> I have a problem trying to get a fairly simple app working which makes >> use of native avro libraries. The app runs fine on my local machine and in >

Re: SparkSQL DF.explode with Nulls

2015-06-05 Thread Tom Seddon
at 12:05 PM Tom Seddon mr.tom.sed...@gmail.com wrote: Hi, I've worked out how to use explode on my input avro dataset with the following structure root |-- pageViewId: string (nullable = false) |-- components: array (nullable = true) ||-- element: struct (containsNull = false

SparkSQL DF.explode with Nulls

2015-06-04 Thread Tom Seddon
Hi, I've worked out how to use explode on my input avro dataset with the following structure root |-- pageViewId: string (nullable = false) |-- components: array (nullable = true) ||-- element: struct (containsNull = false) |||-- name: string (nullable = false) |||--

PySpark saveAsTextFile gzip

2015-01-15 Thread Tom Seddon
Hi, I've searched but can't seem to find a PySpark example. How do I write compressed text file output to S3 using PySpark saveAsTextFile? Thanks, Tom

Efficient way to split an input data set into different output files

2014-11-19 Thread Tom Seddon
I'm trying to set up a PySpark ETL job that takes in JSON log files and spits out fact table files for upload to Redshift. Is there an efficient way to send different event types to different outputs without having to just read the same cached RDD twice? I have my first RDD which is just a json

Re: Broadcast failure with variable size of ~ 500mb with key already cancelled ?

2014-11-11 Thread Tom Seddon
Hi, Just wondering if anyone has any advice about this issue, as I am experiencing the same thing. I'm working with multiple broadcast variables in PySpark, most of which are small, but one of around 4.5GB, using 10 workers at 31GB memory each and driver with same spec. It's not running out of

Re: ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId

2014-11-11 Thread Tom Seddon
Yes please can you share. I am getting this error after expanding my application to include a large broadcast variable. Would be good to know if it can be fixed with configuration. On 23 October 2014 18:04, Michael Campbell michael.campb...@gmail.com wrote: Can you list what your fix was so