SqlContext parquet read OutOfMemoryError: Requested array size exceeds VM limit error
Hi, I am reading the parquet file around 50+ G which has 4013 partitions with 240 columns. Below is my configuration driver : 20G memory with 4 cores executors: 45 executors with 15G memory and 4 cores. I tried to read the data using both Dataframe read and using hive context to read the data using hive SQL but for the both cases, it throws me below error with no further description on error. hive_context.sql("select * from test.base_table where date='{0}'".format(part_dt)) sqlcontext.read.parquet("/path/to/partion/") # # java.lang.OutOfMemoryError: Requested array size exceeds VM limit # -XX:OnOutOfMemoryError="kill -9 %p" # Executing /bin/sh -c "kill -9 16953"... What could be wrong over here since I think increasing memory only will not help in this case since it reached the array size limit. Thanks, Bijay
Re: SqlContext parquet read OutOfMemoryError: Requested array size exceeds VM limit error
Have you seen this thread ? http://search-hadoop.com/m/q3RTtyXr2N13hf9O&subj=java+lang+OutOfMemoryError+Requested+array+size+exceeds+VM+limit On Wed, May 4, 2016 at 2:44 PM, Bijay Kumar Pathak wrote: > Hi, > > I am reading the parquet file around 50+ G which has 4013 partitions with > 240 columns. Below is my configuration > > driver : 20G memory with 4 cores > executors: 45 executors with 15G memory and 4 cores. > > I tried to read the data using both Dataframe read and using hive context > to read the data using hive SQL but for the both cases, it throws me below > error with no further description on error. > > hive_context.sql("select * from test.base_table where > date='{0}'".format(part_dt)) > sqlcontext.read.parquet("/path/to/partion/") > > # > # java.lang.OutOfMemoryError: Requested array size exceeds VM limit > # -XX:OnOutOfMemoryError="kill -9 %p" > # Executing /bin/sh -c "kill -9 16953"... > > > What could be wrong over here since I think increasing memory only will > not help in this case since it reached the array size limit. > > Thanks, > Bijay >
Re: SqlContext parquet read OutOfMemoryError: Requested array size exceeds VM limit error
If you are running on 64-bit JVM with less than 32G heap, you might want to enable -XX:+UseCompressedOops[1]. And if your dataframe is somehow generating more than 2^31-1 number of arrays, you might have to rethink your options. [1] https://spark.apache.org/docs/latest/tuning.html On Wed, May 4, 2016 at 9:44 PM, Bijay Kumar Pathak wrote: > Hi, > > I am reading the parquet file around 50+ G which has 4013 partitions with > 240 columns. Below is my configuration > > driver : 20G memory with 4 cores > executors: 45 executors with 15G memory and 4 cores. > > I tried to read the data using both Dataframe read and using hive context > to read the data using hive SQL but for the both cases, it throws me below > error with no further description on error. > > hive_context.sql("select * from test.base_table where > date='{0}'".format(part_dt)) > sqlcontext.read.parquet("/path/to/partion/") > > # > # java.lang.OutOfMemoryError: Requested array size exceeds VM limit > # -XX:OnOutOfMemoryError="kill -9 %p" > # Executing /bin/sh -c "kill -9 16953"... > > > What could be wrong over here since I think increasing memory only will > not help in this case since it reached the array size limit. > > Thanks, > Bijay > -- -- Cheers, Praj
Re: SqlContext parquet read OutOfMemoryError: Requested array size exceeds VM limit error
Thanks for the suggestions and links. The problem arises when I used DataFrame api to write but it works fine when doing insert overwrite in hive table. # Works good hive_context.sql("insert overwrite table {0} partiton (e_dt, c_dt) select * from temp_table".format(table_name)) # Doesn't work, throws java.lang.OutOfMemoryError: Requested array size exceeds VM limit df.write.mode('overwrite').partitionBy('e_dt','c_dt').parquet("/path/to/file/") Thanks, Bijay On Wed, May 4, 2016 at 3:02 PM, Prajwal Tuladhar wrote: > If you are running on 64-bit JVM with less than 32G heap, you might want > to enable -XX:+UseCompressedOops[1]. And if your dataframe is somehow > generating more than 2^31-1 number of arrays, you might have to rethink > your options. > > [1] https://spark.apache.org/docs/latest/tuning.html > > On Wed, May 4, 2016 at 9:44 PM, Bijay Kumar Pathak > wrote: > >> Hi, >> >> I am reading the parquet file around 50+ G which has 4013 partitions with >> 240 columns. Below is my configuration >> >> driver : 20G memory with 4 cores >> executors: 45 executors with 15G memory and 4 cores. >> >> I tried to read the data using both Dataframe read and using hive context >> to read the data using hive SQL but for the both cases, it throws me below >> error with no further description on error. >> >> hive_context.sql("select * from test.base_table where >> date='{0}'".format(part_dt)) >> sqlcontext.read.parquet("/path/to/partion/") >> >> # >> # java.lang.OutOfMemoryError: Requested array size exceeds VM limit >> # -XX:OnOutOfMemoryError="kill -9 %p" >> # Executing /bin/sh -c "kill -9 16953"... >> >> >> What could be wrong over here since I think increasing memory only will >> not help in this case since it reached the array size limit. >> >> Thanks, >> Bijay >> > > > > -- > -- > Cheers, > Praj >