Re: SizeEstimator

2018-02-27 Thread David Capwell
Thanks for the reply and sorry for my delayed response, had to go find the profile data to lookup the class again. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala That class extends SizeEstimator and has a field "map" which

Re: SizeEstimator

2018-02-26 Thread 叶先进
What type is for the buffer you mentioned? > On 27 Feb 2018, at 11:46 AM, David Capwell wrote: > > advancedxy , I don't remember the code as well > anymore but what we hit was a very simple schema (string, long). The issue is > the buffer had

Re: SizeEstimator

2018-02-26 Thread David Capwell
advancedxy , I don't remember the code as well anymore but what we hit was a very simple schema (string, long). The issue is the buffer had a million of these so SizeEstimator of the buffer had to keep recalculating the same elements over and over again. SizeEstimator was

Re: SizeEstimator

2018-02-26 Thread Xin Liu
Thanks David. Another solution is to convert the protobuf object to byte array, It does speed up SizeEstimator On Mon, Feb 26, 2018 at 5:34 PM, David Capwell wrote: > This is used to predict the current cost of memory so spark knows to flush > or not. This is very costly for

Re: SizeEstimator

2018-02-26 Thread Xin Liu
Thanks! Our protobuf object is fairly complex. Even O(N) takes a lot of time. On Mon, Feb 26, 2018 at 6:33 PM, 叶先进 wrote: > H Xin Liu, > > Could you provide a concrete user case if possible(code to reproduce > protobuf object and comparisons between protobuf and normal

Re: SizeEstimator

2018-02-26 Thread 叶先进
H Xin Liu, Could you provide a concrete user case if possible(code to reproduce protobuf object and comparisons between protobuf and normal object)? I contributed a bit to SizeEstimator years ago, and to my understanding, the time complexity should be O(N) where N is the num of referenced

Re: SizeEstimator

2018-02-26 Thread David Capwell
This is used to predict the current cost of memory so spark knows to flush or not. This is very costly for us so we use a flag marked in the code as private to lower the cost spark.shuffle.spill.numElementsForceSpillThreshold (on phone hope no typo) - how many records before flush This lowers