Re: How to unit test HiveContext without OutOfMemoryError (using sbt)

2015-08-26 Thread Mike Trienis
Thanks for your response Yana, I can increase the MaxPermSize parameter and it will allow me to run the unit test a few more times before I run out of memory. However, the primary issue is that running the same unit test in the same JVM (multiple times) results in increased memory (each run of th

Re: How to unit test HiveContext without OutOfMemoryError (using sbt)

2015-08-26 Thread Michael Armbrust
I'd suggest setting sbt to fork when running tests. On Wed, Aug 26, 2015 at 10:51 AM, Mike Trienis wrote: > Thanks for your response Yana, > > I can increase the MaxPermSize parameter and it will allow me to run the > unit test a few more times before I run out of memory. > > However, the primar

Re: Spark-xml - OutOfMemoryError: Requested array size exceeds VM limit

2016-11-15 Thread Hyukjin Kwon
Hi Arun, I have few questions. Dose your XML file have like few huge documents? In this case of a row having a huge size like (like 500MB), it would consume a lot of memory becuase at least it should hold a row to iterate if I remember correctly. I remember this happened to me before while proc

Re: Spark-xml - OutOfMemoryError: Requested array size exceeds VM limit

2016-11-15 Thread Arun Patel
Thanks for the quick response. Its a single XML file and I am using a top level rowTag. So, it creates only one row in a Dataframe with 5 columns. One of these columns will contain most of the data as StructType. Is there a limitation to store data in a cell of a Dataframe? I will check with ne

Re: Spark-xml - OutOfMemoryError: Requested array size exceeds VM limit

2016-11-16 Thread Arun Patel
I tried below options. 1) Increase executor memory. Increased up to maximum possibility 14GB. Same error. 2) Tried new version - spark-xml_2.10:0.4.1. Same error. 3) Tried with low level rowTags. It worked for lower level rowTag and returned 16000 rows. Are there any workarounds for this issue

Re: Spark-xml - OutOfMemoryError: Requested array size exceeds VM limit

2016-11-16 Thread Hyukjin Kwon
It seems a bit weird. Could we open an issue and talk in the repository link I sent? Let me try to reproduce your case with your data if possible. On 17 Nov 2016 2:26 a.m., "Arun Patel" wrote: > I tried below options. > > 1) Increase executor memory. Increased up to maximum possibility 14GB. >

Re: Driver OutOfMemoryError in MapOutputTracker$.serializeMapStatuses for 40 TB shuffle.

2018-09-07 Thread Vadim Semenov
You have too many partitions, so when the driver is trying to gather the status of all map outputs and send back to executors it chokes on the size of the structure that needs to be GZipped, and since it's bigger than 2GiB, it produces OOM. On Fri, Sep 7, 2018 at 10:35 AM Harel Gliksman wrote: > >

Re: Driver OutOfMemoryError in MapOutputTracker$.serializeMapStatuses for 40 TB shuffle.

2018-09-07 Thread Harel Gliksman
I understand the error is because the number of partitions is very high, yet when processing 40 TB (and this number is expected to grow) this number seems reasonable: 40TB / 300,000 will result in partitions size of ~ 130MB (data should be evenly distributed). On Fri, Sep 7, 2018 at 6:28 PM Vadim

Re: Driver OutOfMemoryError in MapOutputTracker$.serializeMapStatuses for 40 TB shuffle.

2019-11-07 Thread abeboparebop
I ran into the same issue processing 20TB of data, with 200k tasks on both the map and reduce sides. Reducing to 100k tasks each resolved the issue. But this could/would be a major problem in cases where the data is bigger or the computation is heavier, since reducing the number of partitions may n

Re: Driver OutOfMemoryError in MapOutputTracker$.serializeMapStatuses for 40 TB shuffle.

2019-11-08 Thread Spico Florin
Hi! What file system are you using: EMRFS or HDFS? Also what memory are you using for the reducer ? On Thu, Nov 7, 2019 at 8:37 PM abeboparebop wrote: > I ran into the same issue processing 20TB of data, with 200k tasks on both > the map and reduce sides. Reducing to 100k tasks each resolved the

Re: Driver OutOfMemoryError in MapOutputTracker$.serializeMapStatuses for 40 TB shuffle.

2019-11-08 Thread Jacob Lynn
File system is HDFS. Executors are 2 cores, 14GB RAM. But I don't think either of these relate to the problem -- this is a memory allocation issue on the driver side, and happens in an intermediate stage that has no HDFS read/write. On Fri, Nov 8, 2019 at 10:01 AM Spico Florin wrote: > Hi! > Wha

Re: Driver OutOfMemoryError in MapOutputTracker$.serializeMapStatuses for 40 TB shuffle.

2019-11-08 Thread Vadim Semenov
Basically, the driver tracks partitions and sends it over to executors, so what it's trying to do is to serialize and compress the map but because it's so big, it goes over 2GiB and that's Java's limit on the max size of byte arrays, so the whole thing drops. The size of data doesn't matter here m

Re: Driver OutOfMemoryError in MapOutputTracker$.serializeMapStatuses for 40 TB shuffle.

2019-11-08 Thread Jacob Lynn
Sorry for the noise, folks! I understand that reducing the number of partitions works around the issue (at the scale I'm working at, anyway) -- as I mentioned in my initial email -- and I understand the root cause. I'm not looking for advice on how to resolve my issue. I'm just pointing out that th

Re: Driver OutOfMemoryError in MapOutputTracker$.serializeMapStatuses for 40 TB shuffle.

2019-11-11 Thread Vadim Semenov
There's an umbrella ticket for various 2GB limitations https://issues.apache.org/jira/browse/SPARK-6235 On Fri, Nov 8, 2019 at 4:11 PM Jacob Lynn wrote: > > Sorry for the noise, folks! I understand that reducing the number of > partitions works around the issue (at the scale I'm working at, anyw

Re: Driver OutOfMemoryError in MapOutputTracker$.serializeMapStatuses for 40 TB shuffle.

2019-11-12 Thread Jacob Lynn
Thanks for the pointer, Vadim. However, I just tried it with Spark 2.4 and get the same failure. (I was previously testing with 2.2 and/or 2.3.) And I don't see this particular issue referred to there. The ticket that Harel commented on indeed appears to be the most similar one to this issue: http

Re: Unable to Hive program from Spark Programming Guide (OutOfMemoryError)

2015-03-25 Thread ๏̯͡๏
Can someone please respond to this ? On Wed, Mar 25, 2015 at 11:18 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#hive-tables > > > > I modified the Hive query but run into same error. ( > http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#hiv

Re: Unable to Hive program from Spark Programming Guide (OutOfMemoryError)

2015-03-26 Thread ๏̯͡๏
Resolved. Bold text is FIX. ./bin/spark-submit -v --master yarn-cluster --jars /home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.j

Idiomatic way to rate-limit streaming sources to avoid OutOfMemoryError?

2024-04-07 Thread Baran, Mert
Hi Spark community, I have a Spark Structured Streaming application that reads data from a socket source (implemented very similarly to the TextSocketMicroBatchStream). The issue is that the source can generate data faster than Spark can process it, eventually leading to an OutOfMemoryError

SqlContext parquet read OutOfMemoryError: Requested array size exceeds VM limit error

2016-05-04 Thread Bijay Kumar Pathak
Hi, I am reading the parquet file around 50+ G which has 4013 partitions with 240 columns. Below is my configuration driver : 20G memory with 4 cores executors: 45 executors with 15G memory and 4 cores. I tried to read the data using both Dataframe read and using hive context to read the data us

Java heap space OutOfMemoryError in pyspark spark-submit (spark version:2.2)

2018-01-04 Thread Anu B Nair
Hi, I have a data set size of 10GB(example Test.txt). I wrote my pyspark script like below(Test.py): *from pyspark import SparkConf from pyspark.sql import SparkSession from pyspark.sql import SQLContext spark = SparkSession.builder.appName("FilterProduct").getOrCreate() sc = spark.sparkContext

Re: Idiomatic way to rate-limit streaming sources to avoid OutOfMemoryError?

2024-04-07 Thread Mich Talebzadeh
t the source can generate > data faster than Spark can process it, eventually leading to an > OutOfMemoryError when Spark runs out of memory trying to queue up all > the pending data. > > I'm looking for advice on the most idiomatic/recommended way in Spark to > rate-limit da

Re: SqlContext parquet read OutOfMemoryError: Requested array size exceeds VM limit error

2016-05-04 Thread Ted Yu
Have you seen this thread ? http://search-hadoop.com/m/q3RTtyXr2N13hf9O&subj=java+lang+OutOfMemoryError+Requested+array+size+exceeds+VM+limit On Wed, May 4, 2016 at 2:44 PM, Bijay Kumar Pathak wrote: > Hi, > > I am reading the parquet file around 50+ G which has 4013 partitio

Re: SqlContext parquet read OutOfMemoryError: Requested array size exceeds VM limit error

2016-05-04 Thread Prajwal Tuladhar
If you are running on 64-bit JVM with less than 32G heap, you might want to enable -XX:+UseCompressedOops[1]. And if your dataframe is somehow generating more than 2^31-1 number of arrays, you might have to rethink your options. [1] https://spark.apache.org/docs/latest/tuning.html On Wed, May 4,

Re: SqlContext parquet read OutOfMemoryError: Requested array size exceeds VM limit error

2016-05-04 Thread Bijay Kumar Pathak
Thanks for the suggestions and links. The problem arises when I used DataFrame api to write but it works fine when doing insert overwrite in hive table. # Works good hive_context.sql("insert overwrite table {0} partiton (e_dt, c_dt) select * from temp_table".format(table_name)) # Doesn't work, thr

How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Dmitry Goldenberg
We're getting the below error. Tried increasing spark.executor.memory e.g. from 1g to 2g but the below error still happens. Any recommendations? Something to do with specifying -Xmx in the submit job scripts? Thanks. Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit excee

[GraphX] how to set memory configurations to avoid OutOfMemoryError "GC overhead limit exceeded"

2014-08-18 Thread Yifan LI
Hi, I am testing our application(similar to "personalised page rank" using Pregel, and note that each vertex property will need pretty much more space to store after new iteration), it works correctly on small graph.(we have one single machine, 8 cores, 16G memory) But when we ran it on larger

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Cody Koeninger
That looks like it's during recovery from a checkpoint, so it'd be driver memory not executor memory. How big is the checkpoint directory that you're trying to restore from? On Mon, Aug 10, 2015 at 10:57 AM, Dmitry Goldenberg < dgoldenberg...@gmail.com> wrote: > We're getting the below error. T

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Dmitry Goldenberg
Thanks, Cody, will try that. Unfortunately due to a reinstall I don't have the original checkpointing directory :( Thanks for the clarification on spark.driver.memory, I'll keep testing (at 2g things seem OK for now). On Mon, Aug 10, 2015 at 12:10 PM, Cody Koeninger wrote: > That looks like it'

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Ted Yu
I wonder during recovery from a checkpoint whether we can estimate the size of the checkpoint and compare with Runtime.getRuntime().freeMemory(). If the size of checkpoint is much bigger than free memory, log warning, etc Cheers On Mon, Aug 10, 2015 at 9:34 AM, Dmitry Goldenberg wrote: > Thank

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Dmitry Goldenberg
Would there be a way to chunk up/batch up the contents of the checkpointing directories as they're being processed by Spark Streaming? Is it mandatory to load the whole thing in one go? On Mon, Aug 10, 2015 at 12:42 PM, Ted Yu wrote: > I wonder during recovery from a checkpoint whether we can e

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Cody Koeninger
You need to keep a certain number of rdds around for checkpointing, based on e.g. the window size. Those would all need to be loaded at once. On Mon, Aug 10, 2015 at 11:49 AM, Dmitry Goldenberg < dgoldenberg...@gmail.com> wrote: > Would there be a way to chunk up/batch up the contents of the > c

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Ted Yu
Looks like workaround is to reduce *window length.* *Cheers* On Mon, Aug 10, 2015 at 10:07 AM, Cody Koeninger wrote: > You need to keep a certain number of rdds around for checkpointing, based > on e.g. the window size. Those would all need to be loaded at once. > > On Mon, Aug 10, 2015 at 11:

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Dmitry Goldenberg
"You need to keep a certain number of rdds around for checkpointing" -- that seems like a hefty expense to pay in order to achieve fault tolerance. Why does Spark persist whole RDD's of data? Shouldn't it be sufficient to just persist the offsets, to know where to resume from? Thanks. On Mon, A

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Cody Koeninger
The rdd is indeed defined by mostly just the offsets / topic partitions. On Mon, Aug 10, 2015 at 3:24 PM, Dmitry Goldenberg wrote: > "You need to keep a certain number of rdds around for checkpointing" -- > that seems like a hefty expense to pay in order to achieve fault > tolerance. Why does S

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Dmitry Goldenberg
Well, RDD"s also contain data, don't they? The question is, what can be so hefty in the checkpointing directory to cause Spark driver to run out of memory? It seems that it makes checkpointing expensive, in terms of I/O and memory consumption. Two network hops -- to driver, then to workers. Hef

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Cody Koeninger
No, it's not like a given KafkaRDD object contains an array of messages that gets serialized with the object. Its compute method generates an iterator of messages as needed, by connecting to kafka. I don't know what was so hefty in your checkpoint directory, because you deleted it. My checkpoint

Managed memory leak detected + OutOfMemoryError: Unable to acquire X bytes of memory, got 0

2016-08-03 Thread Rychnovsky, Dusan
Hi, I have a Spark workflow that when run on a relatively small portion of data works fine, but when run on big data fails with strange errors. In the log files of failed executors I found the following errors: Firstly > Managed memory leak detected; size = 263403077 bytes, TID = 6524 And

Re: [GraphX] how to set memory configurations to avoid OutOfMemoryError "GC overhead limit exceeded"

2014-08-18 Thread Ankur Dave
On Mon, Aug 18, 2014 at 6:29 AM, Yifan LI wrote: > I am testing our application(similar to "personalised page rank" using > Pregel, and note that each vertex property will need pretty much more space > to store after new iteration) [...] But when we ran it on larger graph(e.g. LiveJouranl), it

Re: [GraphX] how to set memory configurations to avoid OutOfMemoryError "GC overhead limit exceeded"

2014-09-03 Thread Yifan LI
Hi Ankur, Thanks so much for your advice. But it failed when I tried to set the storage level in constructing a graph. val graph = GraphLoader.edgeListFile(sc, edgesFile, minEdgePartitions = numPartitions).partitionBy(PartitionStrategy.EdgePartition2D).persist(StorageLevel.MEMORY_AND_DISK) Erro

Re: [GraphX] how to set memory configurations to avoid OutOfMemoryError "GC overhead limit exceeded"

2014-09-03 Thread Ankur Dave
At 2014-09-03 17:58:09 +0200, Yifan LI wrote: > val graph = GraphLoader.edgeListFile(sc, edgesFile, minEdgePartitions = > numPartitions).partitionBy(PartitionStrategy.EdgePartition2D).persist(StorageLevel.MEMORY_AND_DISK) > > Error: java.lang.UnsupportedOperationException: Cannot change storage l

Re: [GraphX] how to set memory configurations to avoid OutOfMemoryError "GC overhead limit exceeded"

2014-09-05 Thread Yifan LI
Thank you, Ankur! :) But how to assign the storage level to a new vertices RDD that mapped from an existing vertices RDD, e.g. *val newVertexRDD = graph.collectNeighborIds(EdgeDirection.Out).map{case(id:VertexId, a:Array[VertexId]) => (id, initialHashMap(a))}* the new one will be combined with th

Re: [GraphX] how to set memory configurations to avoid OutOfMemoryError "GC overhead limit exceeded"

2014-09-08 Thread Ankur Dave
At 2014-09-05 12:13:18 +0200, Yifan LI wrote: > But how to assign the storage level to a new vertices RDD that mapped from > an existing vertices RDD, > e.g. > *val newVertexRDD = > graph.collectNeighborIds(EdgeDirection.Out).map{case(id:VertexId, > a:Array[VertexId]) => (id, initialHashMap(a))}*

Re: Managed memory leak detected + OutOfMemoryError: Unable to acquire X bytes of memory, got 0

2016-08-03 Thread Ted Yu
Are you using Spark 1.6+ ? See SPARK-11293 On Wed, Aug 3, 2016 at 5:03 AM, Rychnovsky, Dusan < dusan.rychnov...@firma.seznam.cz> wrote: > Hi, > > > I have a Spark workflow that when run on a relatively small portion of > data works fine, but when run on big data fails with strange errors. In the

Re: Managed memory leak detected + OutOfMemoryError: Unable to acquire X bytes of memory, got 0

2016-08-03 Thread Rychnovsky, Dusan
.6.0". I have 1.6.0 and therefore should have it fixed, right? Or what do I do to fix it? Thanks, Dusan From: Ted Yu Sent: Wednesday, August 3, 2016 3:52 PM To: Rychnovsky, Dusan Cc: user@spark.apache.org Subject: Re: Managed memory leak detected + OutOfMemor

Re: Managed memory leak detected + OutOfMemoryError: Unable to acquire X bytes of memory, got 0

2016-08-03 Thread Rychnovsky, Dusan
3, 2016 3:58 PM To: Ted Yu Cc: user@spark.apache.org Subject: Re: Managed memory leak detected + OutOfMemoryError: Unable to acquire X bytes of memory, got 0 Yes, I believe I'm using Spark 1.6.0. > spark-sub

Re: Managed memory leak detected + OutOfMemoryError: Unable to acquire X bytes of memory, got 0

2016-08-03 Thread Ted Yu
> > > -- > *From:* Rychnovsky, Dusan > *Sent:* Wednesday, August 3, 2016 3:58 PM > *To:* Ted Yu > > *Cc:* user@spark.apache.org > *Subject:* Re: Managed memory leak detected + OutOfMemoryError: Unable to &g

Re: Managed memory leak detected + OutOfMemoryError: Unable to acquire X bytes of memory, got 0

2016-08-03 Thread Rychnovsky, Dusan
OK, thank you. What do you suggest I do to get rid of the error? From: Ted Yu Sent: Wednesday, August 3, 2016 6:10 PM To: Rychnovsky, Dusan Cc: user@spark.apache.org Subject: Re: Managed memory leak detected + OutOfMemoryError: Unable to acquire X bytes of

Re: Managed memory leak detected + OutOfMemoryError: Unable to acquire X bytes of memory, got 0

2016-08-03 Thread Ted Yu
> *Sent:* Wednesday, August 3, 2016 6:10 PM > *To:* Rychnovsky, Dusan > *Cc:* user@spark.apache.org > *Subject:* Re: Managed memory leak detected + OutOfMemoryError: Unable to > acquire X bytes of memory, got 0 > > The latest QA run was no longer accessible (error 404): >

<    1   2