Thanks for the reply and sorry for my delayed response, had to go find the
profile data to lookup the class again.
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala
That class extends SizeEstimator and has a field "map&q
sue is
> the buffer had a million of these so SizeEstimator of the buffer had to keep
> recalculating the same elements over and over again. SizeEstimator was
> on-cpu about 30% of the time, bounding the buffer got it to be < 5% (going
> off memory so may be off).
>
advancedxy <advance...@gmail.com>, I don't remember the code as well
anymore but what we hit was a very simple schema (string, long). The issue
is the buffer had a million of these so SizeEstimator of the buffer had to
keep recalculating the same elements over and over again. SizeEst
Thanks David. Another solution is to convert the protobuf object to byte
array, It does speed up SizeEstimator
On Mon, Feb 26, 2018 at 5:34 PM, David Capwell <dcapw...@gmail.com> wrote:
> This is used to predict the current cost of memory so spark knows to flush
> or not. This is
rotobuf and normal object)?
>
> I contributed a bit to SizeEstimator years ago, and to my understanding,
> the time complexity should be O(N) where N is the num of referenced fields
> recursively.
>
> We should definitely investigate this case if it indeed takes a lot of
> time on
H Xin Liu,
Could you provide a concrete user case if possible(code to reproduce protobuf
object and comparisons between protobuf and normal object)?
I contributed a bit to SizeEstimator years ago, and to my understanding, the
time complexity should be O(N) where N is the num of referenced
Hi folks,
>
> We have a situation where, shuffled data is protobuf based, and
> SizeEstimator is taking a lot of time.
>
> We have tried to override SizeEstimator to return a constant value, which
> speeds up things a lot.
>
> My questions, what is the side effect of disabling Siz
Hi folks,
We have a situation where, shuffled data is protobuf based, and
SizeEstimator is taking a lot of time.
We have tried to override SizeEstimator to return a constant value, which
speeds up things a lot.
My questions, what is the side effect of disabling SizeEstimator? Is it
just spark
Hi,
Is there a way to estimate the size of a dataframe in python?
Something similar to
https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/util/SizeEstimator.html
?
thanks
what is your JVM heap size settings? The OOM in SIzeEstimator is caused by a
lot of entry in IdentifyHashMap.
A quick guess is that the object in your dataset is a custom class and you
didn't implement the hashCode and equals method correctly.
On Wednesday, April 15, 2015 at 3:10 PM
$.visitSingleObject(SizeEstimator.scala:177)
at
org.apache.spark.util.SizeEstimator$.org$apache$spark$util$SizeEstimator$$estimate(SizeEstimator.scala:161)
at org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:155)
at
org.apache.spark.util.collection.SizeTracker$class.takeSample
of.
Thanks,
Aniket
On Wed, Apr 15, 2015 at 1:00 PM Xianjin YE advance...@gmail.com wrote:
what is your JVM heap size settings? The OOM in SIzeEstimator is caused
by a lot of entry in IdentifyHashMap.
A quick guess is that the object in your dataset is a custom class and you
didn't implement
are being loaded from our custom HDFS 2.3 RDD and
before we are using even a fraction of the available Java Heap and the
native off heap memory the loading slows to an absolute crawl. It appears
clear from our profiling of the Spark Executor that in the Spark
SizeEstimator an extremely high cpu load
13 matches
Mail list logo