Thanks for your response Yana,
I can increase the MaxPermSize parameter and it will allow me to run the
unit test a few more times before I run out of memory.
However, the primary issue is that running the same unit test in the same
JVM (multiple times) results in increased memory (each run of th
I'd suggest setting sbt to fork when running tests.
On Wed, Aug 26, 2015 at 10:51 AM, Mike Trienis
wrote:
> Thanks for your response Yana,
>
> I can increase the MaxPermSize parameter and it will allow me to run the
> unit test a few more times before I run out of memory.
>
> However, the primar
Hi Arun,
I have few questions.
Dose your XML file have like few huge documents? In this case of a row
having a huge size like (like 500MB), it would consume a lot of memory
becuase at least it should hold a row to iterate if I remember correctly. I
remember this happened to me before while proc
Thanks for the quick response.
Its a single XML file and I am using a top level rowTag. So, it creates
only one row in a Dataframe with 5 columns. One of these columns will
contain most of the data as StructType. Is there a limitation to store
data in a cell of a Dataframe?
I will check with ne
I tried below options.
1) Increase executor memory. Increased up to maximum possibility 14GB.
Same error.
2) Tried new version - spark-xml_2.10:0.4.1. Same error.
3) Tried with low level rowTags. It worked for lower level rowTag and
returned 16000 rows.
Are there any workarounds for this issue
It seems a bit weird. Could we open an issue and talk in the repository
link I sent?
Let me try to reproduce your case with your data if possible.
On 17 Nov 2016 2:26 a.m., "Arun Patel" wrote:
> I tried below options.
>
> 1) Increase executor memory. Increased up to maximum possibility 14GB.
>
You have too many partitions, so when the driver is trying to gather
the status of all map outputs and send back to executors it chokes on
the size of the structure that needs to be GZipped, and since it's
bigger than 2GiB, it produces OOM.
On Fri, Sep 7, 2018 at 10:35 AM Harel Gliksman wrote:
>
>
I understand the error is because the number of partitions is very high,
yet when processing 40 TB (and this number is expected to grow) this number
seems reasonable:
40TB / 300,000 will result in partitions size of ~ 130MB (data should be
evenly distributed).
On Fri, Sep 7, 2018 at 6:28 PM Vadim
I ran into the same issue processing 20TB of data, with 200k tasks on both
the map and reduce sides. Reducing to 100k tasks each resolved the issue.
But this could/would be a major problem in cases where the data is bigger or
the computation is heavier, since reducing the number of partitions may n
Hi!
What file system are you using: EMRFS or HDFS?
Also what memory are you using for the reducer ?
On Thu, Nov 7, 2019 at 8:37 PM abeboparebop wrote:
> I ran into the same issue processing 20TB of data, with 200k tasks on both
> the map and reduce sides. Reducing to 100k tasks each resolved the
File system is HDFS. Executors are 2 cores, 14GB RAM. But I don't think
either of these relate to the problem -- this is a memory allocation issue
on the driver side, and happens in an intermediate stage that has no HDFS
read/write.
On Fri, Nov 8, 2019 at 10:01 AM Spico Florin wrote:
> Hi!
> Wha
Basically, the driver tracks partitions and sends it over to
executors, so what it's trying to do is to serialize and compress the
map but because it's so big, it goes over 2GiB and that's Java's limit
on the max size of byte arrays, so the whole thing drops.
The size of data doesn't matter here m
Sorry for the noise, folks! I understand that reducing the number of
partitions works around the issue (at the scale I'm working at, anyway) --
as I mentioned in my initial email -- and I understand the root cause. I'm
not looking for advice on how to resolve my issue. I'm just pointing out
that th
There's an umbrella ticket for various 2GB limitations
https://issues.apache.org/jira/browse/SPARK-6235
On Fri, Nov 8, 2019 at 4:11 PM Jacob Lynn wrote:
>
> Sorry for the noise, folks! I understand that reducing the number of
> partitions works around the issue (at the scale I'm working at, anyw
Thanks for the pointer, Vadim. However, I just tried it with Spark 2.4 and
get the same failure. (I was previously testing with 2.2 and/or 2.3.) And I
don't see this particular issue referred to there. The ticket that Harel
commented on indeed appears to be the most similar one to this issue:
http
Can someone please respond to this ?
On Wed, Mar 25, 2015 at 11:18 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote:
> http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#hive-tables
>
>
>
> I modified the Hive query but run into same error. (
> http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#hiv
Resolved. Bold text is FIX.
./bin/spark-submit -v --master yarn-cluster --jars
/home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.j
Hi Spark community,
I have a Spark Structured Streaming application that reads data from a
socket source (implemented very similarly to the
TextSocketMicroBatchStream). The issue is that the source can generate
data faster than Spark can process it, eventually leading to an
OutOfMemoryError
Hi,
I am reading the parquet file around 50+ G which has 4013 partitions with
240 columns. Below is my configuration
driver : 20G memory with 4 cores
executors: 45 executors with 15G memory and 4 cores.
I tried to read the data using both Dataframe read and using hive context
to read the data us
Hi,
I have a data set size of 10GB(example Test.txt).
I wrote my pyspark script like below(Test.py):
*from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
spark = SparkSession.builder.appName("FilterProduct").getOrCreate()
sc = spark.sparkContext
t the source can generate
> data faster than Spark can process it, eventually leading to an
> OutOfMemoryError when Spark runs out of memory trying to queue up all
> the pending data.
>
> I'm looking for advice on the most idiomatic/recommended way in Spark to
> rate-limit da
Have you seen this thread ?
http://search-hadoop.com/m/q3RTtyXr2N13hf9O&subj=java+lang+OutOfMemoryError+Requested+array+size+exceeds+VM+limit
On Wed, May 4, 2016 at 2:44 PM, Bijay Kumar Pathak wrote:
> Hi,
>
> I am reading the parquet file around 50+ G which has 4013 partitio
If you are running on 64-bit JVM with less than 32G heap, you might want to
enable -XX:+UseCompressedOops[1]. And if your dataframe is somehow
generating more than 2^31-1 number of arrays, you might have to rethink
your options.
[1] https://spark.apache.org/docs/latest/tuning.html
On Wed, May 4,
Thanks for the suggestions and links. The problem arises when I used
DataFrame api to write but it works fine when doing insert overwrite in
hive table.
# Works good
hive_context.sql("insert overwrite table {0} partiton (e_dt, c_dt) select *
from temp_table".format(table_name))
# Doesn't work, thr
We're getting the below error. Tried increasing spark.executor.memory e.g.
from 1g to 2g but the below error still happens.
Any recommendations? Something to do with specifying -Xmx in the submit job
scripts?
Thanks.
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit
excee
Hi,
I am testing our application(similar to "personalised page rank" using Pregel,
and note that each vertex property will need pretty much more space to store
after new iteration), it works correctly on small graph.(we have one single
machine, 8 cores, 16G memory)
But when we ran it on larger
That looks like it's during recovery from a checkpoint, so it'd be driver
memory not executor memory.
How big is the checkpoint directory that you're trying to restore from?
On Mon, Aug 10, 2015 at 10:57 AM, Dmitry Goldenberg <
dgoldenberg...@gmail.com> wrote:
> We're getting the below error. T
Thanks, Cody, will try that. Unfortunately due to a reinstall I don't have
the original checkpointing directory :( Thanks for the clarification on
spark.driver.memory, I'll keep testing (at 2g things seem OK for now).
On Mon, Aug 10, 2015 at 12:10 PM, Cody Koeninger wrote:
> That looks like it'
I wonder during recovery from a checkpoint whether we can estimate the size
of the checkpoint and compare with Runtime.getRuntime().freeMemory().
If the size of checkpoint is much bigger than free memory, log warning, etc
Cheers
On Mon, Aug 10, 2015 at 9:34 AM, Dmitry Goldenberg wrote:
> Thank
Would there be a way to chunk up/batch up the contents of the checkpointing
directories as they're being processed by Spark Streaming? Is it mandatory
to load the whole thing in one go?
On Mon, Aug 10, 2015 at 12:42 PM, Ted Yu wrote:
> I wonder during recovery from a checkpoint whether we can e
You need to keep a certain number of rdds around for checkpointing, based
on e.g. the window size. Those would all need to be loaded at once.
On Mon, Aug 10, 2015 at 11:49 AM, Dmitry Goldenberg <
dgoldenberg...@gmail.com> wrote:
> Would there be a way to chunk up/batch up the contents of the
> c
Looks like workaround is to reduce *window length.*
*Cheers*
On Mon, Aug 10, 2015 at 10:07 AM, Cody Koeninger wrote:
> You need to keep a certain number of rdds around for checkpointing, based
> on e.g. the window size. Those would all need to be loaded at once.
>
> On Mon, Aug 10, 2015 at 11:
"You need to keep a certain number of rdds around for checkpointing" --
that seems like a hefty expense to pay in order to achieve fault
tolerance. Why does Spark persist whole RDD's of data? Shouldn't it be
sufficient to just persist the offsets, to know where to resume from?
Thanks.
On Mon, A
The rdd is indeed defined by mostly just the offsets / topic partitions.
On Mon, Aug 10, 2015 at 3:24 PM, Dmitry Goldenberg wrote:
> "You need to keep a certain number of rdds around for checkpointing" --
> that seems like a hefty expense to pay in order to achieve fault
> tolerance. Why does S
Well, RDD"s also contain data, don't they?
The question is, what can be so hefty in the checkpointing directory to
cause Spark driver to run out of memory? It seems that it makes
checkpointing expensive, in terms of I/O and memory consumption. Two
network hops -- to driver, then to workers. Hef
No, it's not like a given KafkaRDD object contains an array of messages
that gets serialized with the object. Its compute method generates an
iterator of messages as needed, by connecting to kafka.
I don't know what was so hefty in your checkpoint directory, because you
deleted it. My checkpoint
Hi,
I have a Spark workflow that when run on a relatively small portion of data
works fine, but when run on big data fails with strange errors. In the log
files of failed executors I found the following errors:
Firstly
> Managed memory leak detected; size = 263403077 bytes, TID = 6524
And
On Mon, Aug 18, 2014 at 6:29 AM, Yifan LI wrote:
> I am testing our application(similar to "personalised page rank" using
> Pregel, and note that each vertex property will need pretty much more space
> to store after new iteration)
[...]
But when we ran it on larger graph(e.g. LiveJouranl), it
Hi Ankur,
Thanks so much for your advice.
But it failed when I tried to set the storage level in constructing a graph.
val graph = GraphLoader.edgeListFile(sc, edgesFile, minEdgePartitions =
numPartitions).partitionBy(PartitionStrategy.EdgePartition2D).persist(StorageLevel.MEMORY_AND_DISK)
Erro
At 2014-09-03 17:58:09 +0200, Yifan LI wrote:
> val graph = GraphLoader.edgeListFile(sc, edgesFile, minEdgePartitions =
> numPartitions).partitionBy(PartitionStrategy.EdgePartition2D).persist(StorageLevel.MEMORY_AND_DISK)
>
> Error: java.lang.UnsupportedOperationException: Cannot change storage l
Thank you, Ankur! :)
But how to assign the storage level to a new vertices RDD that mapped from
an existing vertices RDD,
e.g.
*val newVertexRDD =
graph.collectNeighborIds(EdgeDirection.Out).map{case(id:VertexId,
a:Array[VertexId]) => (id, initialHashMap(a))}*
the new one will be combined with th
At 2014-09-05 12:13:18 +0200, Yifan LI wrote:
> But how to assign the storage level to a new vertices RDD that mapped from
> an existing vertices RDD,
> e.g.
> *val newVertexRDD =
> graph.collectNeighborIds(EdgeDirection.Out).map{case(id:VertexId,
> a:Array[VertexId]) => (id, initialHashMap(a))}*
Are you using Spark 1.6+ ?
See SPARK-11293
On Wed, Aug 3, 2016 at 5:03 AM, Rychnovsky, Dusan <
dusan.rychnov...@firma.seznam.cz> wrote:
> Hi,
>
>
> I have a Spark workflow that when run on a relatively small portion of
> data works fine, but when run on big data fails with strange errors. In the
.6.0". I have 1.6.0 and
therefore should have it fixed, right? Or what do I do to fix it?
Thanks,
Dusan
From: Ted Yu
Sent: Wednesday, August 3, 2016 3:52 PM
To: Rychnovsky, Dusan
Cc: user@spark.apache.org
Subject: Re: Managed memory leak detected + OutOfMemor
3, 2016 3:58 PM
To: Ted Yu
Cc: user@spark.apache.org
Subject: Re: Managed memory leak detected + OutOfMemoryError: Unable to acquire
X bytes of memory, got 0
Yes, I believe I'm using Spark 1.6.0.
> spark-sub
>
>
> --
> *From:* Rychnovsky, Dusan
> *Sent:* Wednesday, August 3, 2016 3:58 PM
> *To:* Ted Yu
>
> *Cc:* user@spark.apache.org
> *Subject:* Re: Managed memory leak detected + OutOfMemoryError: Unable to
&g
OK, thank you. What do you suggest I do to get rid of the error?
From: Ted Yu
Sent: Wednesday, August 3, 2016 6:10 PM
To: Rychnovsky, Dusan
Cc: user@spark.apache.org
Subject: Re: Managed memory leak detected + OutOfMemoryError: Unable to acquire
X bytes of
> *Sent:* Wednesday, August 3, 2016 6:10 PM
> *To:* Rychnovsky, Dusan
> *Cc:* user@spark.apache.org
> *Subject:* Re: Managed memory leak detected + OutOfMemoryError: Unable to
> acquire X bytes of memory, got 0
>
> The latest QA run was no longer accessible (error 404):
>
101 - 148 of 148 matches
Mail list logo