Re: SizeEstimator

2018-02-27 Thread David Capwell
Thanks for the reply and sorry for my delayed response, had to go find the profile data to lookup the class again. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala That class extends SizeEstimator and has a field "map&q

Re: SizeEstimator

2018-02-26 Thread 叶先进
sue is > the buffer had a million of these so SizeEstimator of the buffer had to keep > recalculating the same elements over and over again. SizeEstimator was > on-cpu about 30% of the time, bounding the buffer got it to be < 5% (going > off memory so may be off). >

Re: SizeEstimator

2018-02-26 Thread David Capwell
advancedxy <advance...@gmail.com>, I don't remember the code as well anymore but what we hit was a very simple schema (string, long). The issue is the buffer had a million of these so SizeEstimator of the buffer had to keep recalculating the same elements over and over again. SizeEst

Re: SizeEstimator

2018-02-26 Thread Xin Liu
Thanks David. Another solution is to convert the protobuf object to byte array, It does speed up SizeEstimator On Mon, Feb 26, 2018 at 5:34 PM, David Capwell <dcapw...@gmail.com> wrote: > This is used to predict the current cost of memory so spark knows to flush > or not. This is

Re: SizeEstimator

2018-02-26 Thread Xin Liu
rotobuf and normal object)? > > I contributed a bit to SizeEstimator years ago, and to my understanding, > the time complexity should be O(N) where N is the num of referenced fields > recursively. > > We should definitely investigate this case if it indeed takes a lot of > time on

Re: SizeEstimator

2018-02-26 Thread 叶先进
H Xin Liu, Could you provide a concrete user case if possible(code to reproduce protobuf object and comparisons between protobuf and normal object)? I contributed a bit to SizeEstimator years ago, and to my understanding, the time complexity should be O(N) where N is the num of referenced

Re: SizeEstimator

2018-02-26 Thread David Capwell
Hi folks, > > We have a situation where, shuffled data is protobuf based, and > SizeEstimator is taking a lot of time. > > We have tried to override SizeEstimator to return a constant value, which > speeds up things a lot. > > My questions, what is the side effect of disabling Siz

SizeEstimator

2018-02-26 Thread Xin Liu
Hi folks, We have a situation where, shuffled data is protobuf based, and SizeEstimator is taking a lot of time. We have tried to override SizeEstimator to return a constant value, which speeds up things a lot. My questions, what is the side effect of disabling SizeEstimator? Is it just spark

SizeEstimator for python

2016-08-15 Thread Maurin Lenglart
Hi, Is there a way to estimate the size of a dataframe in python? Something similar to https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/util/SizeEstimator.html ? thanks

Re: OOM in SizeEstimator while using combineByKey

2015-04-15 Thread Xianjin YE
what is your JVM heap size settings? The OOM in SIzeEstimator is caused by a lot of entry in IdentifyHashMap. A quick guess is that the object in your dataset is a custom class and you didn't implement the hashCode and equals method correctly. On Wednesday, April 15, 2015 at 3:10 PM

OOM in SizeEstimator while using combineByKey

2015-04-15 Thread Aniket Bhatnagar
$.visitSingleObject(SizeEstimator.scala:177) at org.apache.spark.util.SizeEstimator$.org$apache$spark$util$SizeEstimator$$estimate(SizeEstimator.scala:161) at org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:155) at org.apache.spark.util.collection.SizeTracker$class.takeSample

Re: OOM in SizeEstimator while using combineByKey

2015-04-15 Thread Aniket Bhatnagar
of. Thanks, Aniket On Wed, Apr 15, 2015 at 1:00 PM Xianjin YE advance...@gmail.com wrote: what is your JVM heap size settings? The OOM in SIzeEstimator is caused by a lot of entry in IdentifyHashMap. A quick guess is that the object in your dataset is a custom class and you didn't implement

SizeEstimator in Spark 1.1 and high load/object allocation when reading in data

2014-10-30 Thread Erik Freed
are being loaded from our custom HDFS 2.3 RDD and before we are using even a fraction of the available Java Heap and the native off heap memory the loading slows to an absolute crawl. It appears clear from our profiling of the Spark Executor that in the Spark SizeEstimator an extremely high cpu load