Pinging back to see if anybody could provide me with some pointers on hot to stream/batch JSON-to-ORC conversion in Spark SQL or why I get an OOM dump with such small memory footprint?
Thanks, Alec On Wed, Nov 15, 2017 at 11:03 AM, Alec Swan <alecs...@gmail.com> wrote: > Thanks Steve and Vadim for the feedback. > > @Steve, are you suggesting creating a custom receiver and somehow piping > it through Spark Streaming/Spark SQL? Or are you suggesting creating > smaller datasets from the stream and using my original code to process > smaller datasets? It'd be very helpful for a novice, like myself, if you > could provide code samples or links to docs/articles. > > @Vadim, I ran my test with local[1] and got OOM in the same place. What > puzzles me is that when I expect the heap dump with VisualVM (see below) it > says that the heap is pretty small ~35MB. I am running my test with > "-Xmx10G -Dspark.executor.memory=6g -Dspark.driver.memory=6g" JVM opts and > I can see them reflected in Spark UI. Am I missing some memory settings? > > Date taken: Wed Nov 15 10:46:06 MST 2017 > File: /tmp/java_pid69786.hprof > File size: 59.5 MB > > Total bytes: 39,728,337 > Total classes: 15,749 > Total instances: 437,979 > Classloaders: 123 > GC roots: 2,831 > Number of objects pending for finalization: 5,198 > > > Thanks, > > Alec > > On Wed, Nov 15, 2017 at 11:15 AM, Vadim Semenov < > vadim.seme...@datadoghq.com> wrote: > >> There's a lot of off-heap memory involved in decompressing Snappy, >> compressing ZLib. >> >> Since you're running using `local[*]`, you process multiple tasks >> simultaneously, so they all might consume memory. >> >> I don't think that increasing heap will help, since it looks like you're >> hitting system memory limits. >> >> I'd suggest trying to run with `local[2]` and checking what's the memory >> usage of the jvm process. >> >> On Mon, Nov 13, 2017 at 7:22 PM, Alec Swan <alecs...@gmail.com> wrote: >> >>> Hello, >>> >>> I am using the Spark library to convert JSON/Snappy files to ORC/ZLIB >>> format. Effectively, my Java service starts up an embedded Spark cluster >>> (master=local[*]) and uses Spark SQL to convert JSON to ORC. However, I >>> keep getting OOM errors with large (~1GB) files. >>> >>> I've tried different ways to reduce memory usage, e.g. by partitioning >>> data with dataSet.partitionBy("customer).save(filePath), or capping >>> memory usage by setting spark.executor.memory=1G, but to no vail. >>> >>> I am wondering if there is a way to avoid OOM besides splitting the >>> source JSON file into multiple smaller ones and processing the small ones >>> individually? Does Spark SQL have to read the JSON/Snappy (row-based) file >>> in it's entirety before converting it to ORC (columnar)? If so, would it >>> make sense to create a custom receiver that reads the Snappy file and use >>> Spark streaming for ORC conversion? >>> >>> Thanks, >>> >>> Alec >>> >> >> >