If you are running on 64-bit JVM with less than 32G heap, you might want to enable -XX:+UseCompressedOops[1]. And if your dataframe is somehow generating more than 2^31-1 number of arrays, you might have to rethink your options.
[1] https://spark.apache.org/docs/latest/tuning.html On Wed, May 4, 2016 at 9:44 PM, Bijay Kumar Pathak <bkpat...@mtu.edu> wrote: > Hi, > > I am reading the parquet file around 50+ G which has 4013 partitions with > 240 columns. Below is my configuration > > driver : 20G memory with 4 cores > executors: 45 executors with 15G memory and 4 cores. > > I tried to read the data using both Dataframe read and using hive context > to read the data using hive SQL but for the both cases, it throws me below > error with no further description on error. > > hive_context.sql("select * from test.base_table where > date='{0}'".format(part_dt)) > sqlcontext.read.parquet("/path/to/partion/") > > # > # java.lang.OutOfMemoryError: Requested array size exceeds VM limit > # -XX:OnOutOfMemoryError="kill -9 %p" > # Executing /bin/sh -c "kill -9 16953"... > > > What could be wrong over here since I think increasing memory only will > not help in this case since it reached the array size limit. > > Thanks, > Bijay > -- -- Cheers, Praj