Thanks for the suggestions and links. The problem arises when I used DataFrame api to write but it works fine when doing insert overwrite in hive table.
# Works good hive_context.sql("insert overwrite table {0} partiton (e_dt, c_dt) select * from temp_table".format(table_name)) # Doesn't work, throws java.lang.OutOfMemoryError: Requested array size exceeds VM limit df.write.mode('overwrite').partitionBy('e_dt','c_dt').parquet("/path/to/file/") Thanks, Bijay On Wed, May 4, 2016 at 3:02 PM, Prajwal Tuladhar <p...@infynyxx.com> wrote: > If you are running on 64-bit JVM with less than 32G heap, you might want > to enable -XX:+UseCompressedOops[1]. And if your dataframe is somehow > generating more than 2^31-1 number of arrays, you might have to rethink > your options. > > [1] https://spark.apache.org/docs/latest/tuning.html > > On Wed, May 4, 2016 at 9:44 PM, Bijay Kumar Pathak <bkpat...@mtu.edu> > wrote: > >> Hi, >> >> I am reading the parquet file around 50+ G which has 4013 partitions with >> 240 columns. Below is my configuration >> >> driver : 20G memory with 4 cores >> executors: 45 executors with 15G memory and 4 cores. >> >> I tried to read the data using both Dataframe read and using hive context >> to read the data using hive SQL but for the both cases, it throws me below >> error with no further description on error. >> >> hive_context.sql("select * from test.base_table where >> date='{0}'".format(part_dt)) >> sqlcontext.read.parquet("/path/to/partion/") >> >> # >> # java.lang.OutOfMemoryError: Requested array size exceeds VM limit >> # -XX:OnOutOfMemoryError="kill -9 %p" >> # Executing /bin/sh -c "kill -9 16953"... >> >> >> What could be wrong over here since I think increasing memory only will >> not help in this case since it reached the array size limit. >> >> Thanks, >> Bijay >> > > > > -- > -- > Cheers, > Praj >