Re: SqlContext parquet read OutOfMemoryError: Requested array size exceeds VM limit error

2016-05-04 Thread Bijay Kumar Pathak
Thanks for the suggestions and links. The problem arises when I used
DataFrame api to write but it works fine when doing insert overwrite in
hive table.

# Works good
hive_context.sql("insert overwrite table {0} partiton (e_dt, c_dt) select *
from temp_table".format(table_name))
# Doesn't work, throws java.lang.OutOfMemoryError: Requested array size
exceeds VM limit
df.write.mode('overwrite').partitionBy('e_dt','c_dt').parquet("/path/to/file/")

Thanks,
Bijay

On Wed, May 4, 2016 at 3:02 PM, Prajwal Tuladhar  wrote:

> If you are running on 64-bit JVM with less than 32G heap, you might want
> to enable -XX:+UseCompressedOops[1]. And if your dataframe is somehow
> generating more than 2^31-1 number of arrays, you might have to rethink
> your options.
>
> [1] https://spark.apache.org/docs/latest/tuning.html
>
> On Wed, May 4, 2016 at 9:44 PM, Bijay Kumar Pathak 
> wrote:
>
>> Hi,
>>
>> I am reading the parquet file around 50+ G which has 4013 partitions with
>> 240 columns. Below is my configuration
>>
>> driver : 20G memory with 4 cores
>> executors: 45 executors with 15G memory and 4 cores.
>>
>> I tried to read the data using both Dataframe read and using hive context
>> to read the data using hive SQL but for the both cases, it throws me below
>> error with no  further description on error.
>>
>> hive_context.sql("select * from test.base_table where
>> date='{0}'".format(part_dt))
>> sqlcontext.read.parquet("/path/to/partion/")
>>
>> #
>> # java.lang.OutOfMemoryError: Requested array size exceeds VM limit
>> # -XX:OnOutOfMemoryError="kill -9 %p"
>> #   Executing /bin/sh -c "kill -9 16953"...
>>
>>
>> What could be wrong over here since I think increasing memory only will
>> not help in this case since it reached the array size limit.
>>
>> Thanks,
>> Bijay
>>
>
>
>
> --
> --
> Cheers,
> Praj
>


Re: SqlContext parquet read OutOfMemoryError: Requested array size exceeds VM limit error

2016-05-04 Thread Prajwal Tuladhar
If you are running on 64-bit JVM with less than 32G heap, you might want to
enable -XX:+UseCompressedOops[1]. And if your dataframe is somehow
generating more than 2^31-1 number of arrays, you might have to rethink
your options.

[1] https://spark.apache.org/docs/latest/tuning.html

On Wed, May 4, 2016 at 9:44 PM, Bijay Kumar Pathak  wrote:

> Hi,
>
> I am reading the parquet file around 50+ G which has 4013 partitions with
> 240 columns. Below is my configuration
>
> driver : 20G memory with 4 cores
> executors: 45 executors with 15G memory and 4 cores.
>
> I tried to read the data using both Dataframe read and using hive context
> to read the data using hive SQL but for the both cases, it throws me below
> error with no  further description on error.
>
> hive_context.sql("select * from test.base_table where
> date='{0}'".format(part_dt))
> sqlcontext.read.parquet("/path/to/partion/")
>
> #
> # java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> # -XX:OnOutOfMemoryError="kill -9 %p"
> #   Executing /bin/sh -c "kill -9 16953"...
>
>
> What could be wrong over here since I think increasing memory only will
> not help in this case since it reached the array size limit.
>
> Thanks,
> Bijay
>



-- 
--
Cheers,
Praj


Re: SqlContext parquet read OutOfMemoryError: Requested array size exceeds VM limit error

2016-05-04 Thread Ted Yu
Have you seen this thread ?

http://search-hadoop.com/m/q3RTtyXr2N13hf9O=java+lang+OutOfMemoryError+Requested+array+size+exceeds+VM+limit

On Wed, May 4, 2016 at 2:44 PM, Bijay Kumar Pathak  wrote:

> Hi,
>
> I am reading the parquet file around 50+ G which has 4013 partitions with
> 240 columns. Below is my configuration
>
> driver : 20G memory with 4 cores
> executors: 45 executors with 15G memory and 4 cores.
>
> I tried to read the data using both Dataframe read and using hive context
> to read the data using hive SQL but for the both cases, it throws me below
> error with no  further description on error.
>
> hive_context.sql("select * from test.base_table where
> date='{0}'".format(part_dt))
> sqlcontext.read.parquet("/path/to/partion/")
>
> #
> # java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> # -XX:OnOutOfMemoryError="kill -9 %p"
> #   Executing /bin/sh -c "kill -9 16953"...
>
>
> What could be wrong over here since I think increasing memory only will
> not help in this case since it reached the array size limit.
>
> Thanks,
> Bijay
>


SqlContext parquet read OutOfMemoryError: Requested array size exceeds VM limit error

2016-05-04 Thread Bijay Kumar Pathak
Hi,

I am reading the parquet file around 50+ G which has 4013 partitions with
240 columns. Below is my configuration

driver : 20G memory with 4 cores
executors: 45 executors with 15G memory and 4 cores.

I tried to read the data using both Dataframe read and using hive context
to read the data using hive SQL but for the both cases, it throws me below
error with no  further description on error.

hive_context.sql("select * from test.base_table where
date='{0}'".format(part_dt))
sqlcontext.read.parquet("/path/to/partion/")

#
# java.lang.OutOfMemoryError: Requested array size exceeds VM limit
# -XX:OnOutOfMemoryError="kill -9 %p"
#   Executing /bin/sh -c "kill -9 16953"...


What could be wrong over here since I think increasing memory only will not
help in this case since it reached the array size limit.

Thanks,
Bijay


Requested array size exceeds VM limit Error

2015-02-05 Thread Muttineni, Vinay
Hi,
I have a 170GB data tab limited data set which I am converting into the 
RDD[LabeledPoint] format. I am then taking a 60% sample of this data set to be 
used for training a GBT model.
I got the Size exceeds Integer.MAX_VALUE error which I fixed by repartitioning 
the data set to 1000 partitions.
Now, the GBT code caches the data set, if it's not already cached, with this 
operation input.persist(StorageLevel.MEMORY_AND_DISK) 
(https://github.com/apache/spark/blob/branch-1.2/mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala
 ).
To pre-empt this caching so I can better control it, I am caching the RDD 
(after repartition) with this command,
trainingData.persist(StorageLevel.MEMORY_AND_DISK_SER_2)

But now, I get the following error on one executor and the application fails 
after a retry. I am not sure how to fix this. Could someone help with this?
One possible reason could be that I submit my job with --driver-memory 11G 
--executor-memory 11G  but I am allotted only 5.7GB. I am not sure if this 
could actually cause an affect.

My runtime environment: 120 executors with 5.7 GB each, Driver has 5.3 GB.

My Spark Config: set(spark.default.parallelism, 
300).set(spark.akka.frameSize, 256).set(spark.akka.timeout, 
1000).set(spark.core.connection.ack.wait.timeout,200).set(spark.akka.threads,
 10).set(spark.serializer, 
org.apache.spark.serializer.KryoSerializer).set(spark.kryoserializer.buffer.mb,
 256)

java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at java.util.Arrays.copyOf(Arrays.java:2271)
at 
java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
at 
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at 
java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
at 
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at 
java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at com.esotericsoftware.kryo.io.Output.flush(Output.java:155)
at com.esotericsoftware.kryo.io.Output.require(Output.java:135)
at 
com.esotericsoftware.kryo.io.Output.writeLong(Output.java:477)
at 
com.esotericsoftware.kryo.io.Output.writeDouble(Output.java:596)
at 
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$DoubleArraySerializer.write(DefaultArraySerializers.java:212)
at 
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$DoubleArraySerializer.write(DefaultArraySerializers.java:200)
at 
com.esotericsoftware.kryo.Kryo.writeObjectOrNull(Kryo.java:549)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:570)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
at 
com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at 
org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:128)
at 
org.apache.spark.serializer.SerializationStream.writeAll(Serializer.scala:110)
at 
org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:1175)
at 
org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:1184)
at 
org.apache.spark.storage.MemoryStore.putArray(MemoryStore.scala:103)
at 
org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:789)
at 
org.apache.spark.storage.BlockManager.putArray(BlockManager.scala:669)
at 
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:167)
at 
org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)


Thank You!
Vinay