Re: Optimisation advice for Avro-Parquet merge job

2015-06-12 Thread James Aley
Hey Kiran, Thanks very much for the response. I left for vacation before I could try this out, but I'll experiment once I get back and let you know how it goes. Thanks! James. On 8 June 2015 at 12:34, kiran lonikar loni...@gmail.com wrote: It turns out my assumption on load and unionAll

Re: Optimisation advice for Avro-Parquet merge job

2015-06-08 Thread kiran lonikar
James, As I can see, there are three distinct parts to your program: - for loop - synchronized block - final outputFrame.save statement Can you do a separate timing measurement by putting a simple System.currentTimeMillis() around these blocks to know how much they are taking and then

Re: Optimisation advice for Avro-Parquet merge job

2015-06-08 Thread kiran lonikar
It turns out my assumption on load and unionAll being blocking is not correct. They are transformations. So instead of just running only the load and unionAll in the run() methods, I think you will have to save the intermediate dfInput[i] to temp (parquet) files (possibly to in memory DFS like

Re: Optimisation advice for Avro-Parquet merge job

2015-06-04 Thread James Aley
Thanks for the confirmation! We're quite new to Spark, so a little reassurance is a good thing to have sometimes :-) The thing that's concerning me at the moment is that my job doesn't seem to run any faster with more compute resources added to the cluster, and this is proving a little tricky to

Re: Optimisation advice for Avro-Parquet merge job

2015-06-04 Thread Eugen Cepoi
Hi 2015-06-04 15:29 GMT+02:00 James Aley james.a...@swiftkey.com: Hi, We have a load of Avro data coming into our data systems in the form of relatively small files, which we're merging into larger Parquet files with Spark. I've been following the docs and the approach I'm taking seemed

Optimisation advice for Avro-Parquet merge job

2015-06-04 Thread James Aley
Hi, We have a load of Avro data coming into our data systems in the form of relatively small files, which we're merging into larger Parquet files with Spark. I've been following the docs and the approach I'm taking seemed fairly obvious, and pleasingly simple, but I'm wondering if perhaps it's