Re: Process large JSON file without causing OOM

Alec Swan Tue, 21 Nov 2017 08:18:52 -0800

Pinging back to see if anybody could provide me with some pointers on hot
to stream/batch JSON-to-ORC conversion in Spark SQL or why I get an OOM
dump with such small memory footprint?


Thanks,

Alec

On Wed, Nov 15, 2017 at 11:03 AM, Alec Swan <alecs...@gmail.com> wrote:

> Thanks Steve and Vadim for the feedback.
>
> @Steve, are you suggesting creating a custom receiver and somehow piping
> it through Spark Streaming/Spark SQL? Or are you suggesting creating
> smaller datasets from the stream and using my original code to process
> smaller datasets? It'd be very helpful for a novice, like myself, if you
> could provide code samples or links to docs/articles.
>
> @Vadim, I ran my test with local[1] and got OOM in the same place. What
> puzzles me is that when I expect the heap dump with VisualVM (see below) it
> says that the heap is pretty small ~35MB. I am running my test with
> "-Xmx10G -Dspark.executor.memory=6g  -Dspark.driver.memory=6g" JVM opts and
> I can see them reflected in Spark UI. Am I missing some memory settings?
>
>     Date taken: Wed Nov 15 10:46:06 MST 2017
>     File: /tmp/java_pid69786.hprof
>     File size: 59.5 MB
>
>     Total bytes: 39,728,337
>     Total classes: 15,749
>     Total instances: 437,979
>     Classloaders: 123
>     GC roots: 2,831
>     Number of objects pending for finalization: 5,198
>
>
> Thanks,
>
> Alec
>
> On Wed, Nov 15, 2017 at 11:15 AM, Vadim Semenov <
> vadim.seme...@datadoghq.com> wrote:
>
>> There's a lot of off-heap memory involved in decompressing Snappy,
>> compressing ZLib.
>>
>> Since you're running using `local[*]`, you process multiple tasks
>> simultaneously, so they all might consume memory.
>>
>> I don't think that increasing heap will help, since it looks like you're
>> hitting system memory limits.
>>
>> I'd suggest trying to run with `local[2]` and checking what's the memory
>> usage of the jvm process.
>>
>> On Mon, Nov 13, 2017 at 7:22 PM, Alec Swan <alecs...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I am using the Spark library to convert JSON/Snappy files to ORC/ZLIB
>>> format. Effectively, my Java service starts up an embedded Spark cluster
>>> (master=local[*]) and uses Spark SQL to convert JSON to ORC. However, I
>>> keep getting OOM errors with large (~1GB) files.
>>>
>>> I've tried different ways to reduce memory usage, e.g. by partitioning
>>> data with dataSet.partitionBy("customer).save(filePath), or capping
>>> memory usage by setting spark.executor.memory=1G, but to no vail.
>>>
>>> I am wondering if there is a way to avoid OOM besides splitting the
>>> source JSON file into multiple smaller ones and processing the small ones
>>> individually? Does Spark SQL have to read the JSON/Snappy (row-based) file
>>> in it's entirety before converting it to ORC (columnar)? If so, would it
>>> make sense to create a custom receiver that reads the Snappy file and use
>>> Spark streaming for ORC conversion?
>>>
>>> Thanks,
>>>
>>> Alec
>>>
>>
>>
>

Re: Process large JSON file without causing OOM

Reply via email to