Hey Kiran,
Thanks very much for the response. I left for vacation before I could try
this out, but I'll experiment once I get back and let you know how it goes.
Thanks!
James.
On 8 June 2015 at 12:34, kiran lonikar loni...@gmail.com wrote:
It turns out my assumption on load and unionAll
James,
As I can see, there are three distinct parts to your program:
- for loop
- synchronized block
- final outputFrame.save statement
Can you do a separate timing measurement by putting a simple
System.currentTimeMillis() around these blocks to know how much they are
taking and then
It turns out my assumption on load and unionAll being blocking is not
correct. They are transformations. So instead of just running only the load
and unionAll in the run() methods, I think you will have to save the
intermediate dfInput[i] to temp (parquet) files (possibly to in memory DFS
like
Thanks for the confirmation! We're quite new to Spark, so a little
reassurance is a good thing to have sometimes :-)
The thing that's concerning me at the moment is that my job doesn't seem to
run any faster with more compute resources added to the cluster, and this
is proving a little tricky to
Hi
2015-06-04 15:29 GMT+02:00 James Aley james.a...@swiftkey.com:
Hi,
We have a load of Avro data coming into our data systems in the form of
relatively small files, which we're merging into larger Parquet files with
Spark. I've been following the docs and the approach I'm taking seemed
Hi,
We have a load of Avro data coming into our data systems in the form of
relatively small files, which we're merging into larger Parquet files with
Spark. I've been following the docs and the approach I'm taking seemed
fairly obvious, and pleasingly simple, but I'm wondering if perhaps it's