Hello there, I am trying to write a program in Spark that is attempting to load multiple json files (with undefined schemas) into a dataframe and then write it out to a parquet file. When doing so, I am running into a number of garbage collection issues as a result of my JVM running out of heap space. The heap space that I have on both my driver and executors seems reasonable considering the size of the files that I am reading in. I was wondering if anyone could provide me some insight if this could be due to the schema inference that the parquet conversion is doing (am I trying to do something that it is not designed to do) of if I should be looking elsewhere for performance issues? If it is due to schema inference, is there anything I should be looking at that might alleviate the pressure on the garbage collector and heap?
<code snippet> val fileRdd = sc.textFile(new Path("hdfsDirectory/*").toString) val df = sqlContext.read.json(fileRdd) df.write.parquet("hdfsDirectory/file.parquet") </code snippet> Each json file is of a single object and has the potential to have variance in the schema. They are around the size of about 3-5KB each. My JVM heap memory for each executor and driver is around 1GB. The number of files I am trying to take in pushes quite a few thousand. One approach I had considered was to merge these files together into a single json document then attempt to load the merged document into a dataframe, then write out a parquet file, but I am not sure this will be of any benefit. Any advice or guidance that anyone could provide me would be greatly appreciated,