Re: SqlContext parquet read OutOfMemoryError: Requested array size exceeds VM limit error

2016-05-04 Thread Bijay Kumar Pathak
Thanks for the suggestions and links. The problem arises when I used
DataFrame api to write but it works fine when doing insert overwrite in
hive table.

# Works good
hive_context.sql("insert overwrite table {0} partiton (e_dt, c_dt) select *
from temp_table".format(table_name))
# Doesn't work, throws java.lang.OutOfMemoryError: Requested array size
exceeds VM limit
df.write.mode('overwrite').partitionBy('e_dt','c_dt').parquet("/path/to/file/")

Thanks,
Bijay

On Wed, May 4, 2016 at 3:02 PM, Prajwal Tuladhar  wrote:

> If you are running on 64-bit JVM with less than 32G heap, you might want
> to enable -XX:+UseCompressedOops[1]. And if your dataframe is somehow
> generating more than 2^31-1 number of arrays, you might have to rethink
> your options.
>
> [1] https://spark.apache.org/docs/latest/tuning.html
>
> On Wed, May 4, 2016 at 9:44 PM, Bijay Kumar Pathak 
> wrote:
>
>> Hi,
>>
>> I am reading the parquet file around 50+ G which has 4013 partitions with
>> 240 columns. Below is my configuration
>>
>> driver : 20G memory with 4 cores
>> executors: 45 executors with 15G memory and 4 cores.
>>
>> I tried to read the data using both Dataframe read and using hive context
>> to read the data using hive SQL but for the both cases, it throws me below
>> error with no  further description on error.
>>
>> hive_context.sql("select * from test.base_table where
>> date='{0}'".format(part_dt))
>> sqlcontext.read.parquet("/path/to/partion/")
>>
>> #
>> # java.lang.OutOfMemoryError: Requested array size exceeds VM limit
>> # -XX:OnOutOfMemoryError="kill -9 %p"
>> #   Executing /bin/sh -c "kill -9 16953"...
>>
>>
>> What could be wrong over here since I think increasing memory only will
>> not help in this case since it reached the array size limit.
>>
>> Thanks,
>> Bijay
>>
>
>
>
> --
> --
> Cheers,
> Praj
>


Re: SqlContext parquet read OutOfMemoryError: Requested array size exceeds VM limit error

2016-05-04 Thread Prajwal Tuladhar
If you are running on 64-bit JVM with less than 32G heap, you might want to
enable -XX:+UseCompressedOops[1]. And if your dataframe is somehow
generating more than 2^31-1 number of arrays, you might have to rethink
your options.

[1] https://spark.apache.org/docs/latest/tuning.html

On Wed, May 4, 2016 at 9:44 PM, Bijay Kumar Pathak  wrote:

> Hi,
>
> I am reading the parquet file around 50+ G which has 4013 partitions with
> 240 columns. Below is my configuration
>
> driver : 20G memory with 4 cores
> executors: 45 executors with 15G memory and 4 cores.
>
> I tried to read the data using both Dataframe read and using hive context
> to read the data using hive SQL but for the both cases, it throws me below
> error with no  further description on error.
>
> hive_context.sql("select * from test.base_table where
> date='{0}'".format(part_dt))
> sqlcontext.read.parquet("/path/to/partion/")
>
> #
> # java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> # -XX:OnOutOfMemoryError="kill -9 %p"
> #   Executing /bin/sh -c "kill -9 16953"...
>
>
> What could be wrong over here since I think increasing memory only will
> not help in this case since it reached the array size limit.
>
> Thanks,
> Bijay
>



-- 
--
Cheers,
Praj


Re: SqlContext parquet read OutOfMemoryError: Requested array size exceeds VM limit error

2016-05-04 Thread Ted Yu
Have you seen this thread ?

http://search-hadoop.com/m/q3RTtyXr2N13hf9O=java+lang+OutOfMemoryError+Requested+array+size+exceeds+VM+limit

On Wed, May 4, 2016 at 2:44 PM, Bijay Kumar Pathak  wrote:

> Hi,
>
> I am reading the parquet file around 50+ G which has 4013 partitions with
> 240 columns. Below is my configuration
>
> driver : 20G memory with 4 cores
> executors: 45 executors with 15G memory and 4 cores.
>
> I tried to read the data using both Dataframe read and using hive context
> to read the data using hive SQL but for the both cases, it throws me below
> error with no  further description on error.
>
> hive_context.sql("select * from test.base_table where
> date='{0}'".format(part_dt))
> sqlcontext.read.parquet("/path/to/partion/")
>
> #
> # java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> # -XX:OnOutOfMemoryError="kill -9 %p"
> #   Executing /bin/sh -c "kill -9 16953"...
>
>
> What could be wrong over here since I think increasing memory only will
> not help in this case since it reached the array size limit.
>
> Thanks,
> Bijay
>