Re: Process large JSON file without causing OOM

2017-11-21 Thread Alec Swan
Pinging back to see if anybody could provide me with some pointers on hot to stream/batch JSON-to-ORC conversion in Spark SQL or why I get an OOM dump with such small memory footprint? Thanks, Alec On Wed, Nov 15, 2017 at 11:03 AM, Alec Swan wrote: > Thanks Steve and Vadim

Re: Process large JSON file without causing OOM

2017-11-15 Thread Alec Swan
Thanks Steve and Vadim for the feedback. @Steve, are you suggesting creating a custom receiver and somehow piping it through Spark Streaming/Spark SQL? Or are you suggesting creating smaller datasets from the stream and using my original code to process smaller datasets? It'd be very helpful for

Re: Process large JSON file without causing OOM

2017-11-15 Thread Vadim Semenov
There's a lot of off-heap memory involved in decompressing Snappy, compressing ZLib. Since you're running using `local[*]`, you process multiple tasks simultaneously, so they all might consume memory. I don't think that increasing heap will help, since it looks like you're hitting system memory

Re: Process large JSON file without causing OOM

2017-11-15 Thread Steve Loughran
On 14 Nov 2017, at 15:32, Alec Swan > wrote: But I wonder if there is a way to stream/batch the content of JSON file in order to convert it to ORC piecemeal and avoid reading the whole JSON file in memory in the first place? That is what

Re: Process large JSON file without causing OOM

2017-11-14 Thread Alec Swan
Thanks all. I am not submitting a spark job explicitly. Instead, I am using the Spark library functionality embedded in my web service as shown in the code I included in the previous email. So, effectively Spark SQL runs in the web service's JVM. Therefore, --driver-memory option would not (and

Re: Process large JSON file without causing OOM

2017-11-13 Thread Sonal Goyal
If you are running Spark with local[*] as master, there will be a single process whose memory will be controlled by --driver-memory command line option to spark submit. Check http://spark.apache.org/docs/latest/configuration.html spark.driver.memory 1g Amount of memory to use for the driver

Re: Process large JSON file without causing OOM

2017-11-13 Thread vaquar khan
https://stackoverflow.com/questions/26562033/how-to-set-apache-spark-executor-memory Regards, Vaquar khan On Mon, Nov 13, 2017 at 6:22 PM, Alec Swan wrote: > Hello, > > I am using the Spark library to convert JSON/Snappy files to ORC/ZLIB > format. Effectively, my Java

Re: Process large JSON file without causing OOM

2017-11-13 Thread Alec Swan
Hi Joel, Here are the relevant snippets of my code and an OOM error thrown in frameWriter.save(..). Surprisingly, the heap dump is pretty small ~60MB even though I am running with -Xmx10G and 4G executor and driver memory as shown below. SparkConf sparkConf = new SparkConf()

Re: Process large JSON file without causing OOM

2017-11-13 Thread Joel D
Have you tried increasing driver, exec mem (gc overhead too if required)? your code snippet and stack trace will be helpful. On Mon, Nov 13, 2017 at 7:23 PM Alec Swan wrote: > Hello, > > I am using the Spark library to convert JSON/Snappy files to ORC/ZLIB > format.

Process large JSON file without causing OOM

2017-11-13 Thread Alec Swan
Hello, I am using the Spark library to convert JSON/Snappy files to ORC/ZLIB format. Effectively, my Java service starts up an embedded Spark cluster (master=local[*]) and uses Spark SQL to convert JSON to ORC. However, I keep getting OOM errors with large (~1GB) files. I've tried different ways