Thanks David. Another solution is to convert the protobuf object to byte
array, It does speed up SizeEstimator

On Mon, Feb 26, 2018 at 5:34 PM, David Capwell <dcapw...@gmail.com> wrote:

> This is used to predict the current cost of memory so spark knows to flush
> or not. This is very costly for us so we use a flag marked in the code as
> private to lower the cost
>
> spark.shuffle.spill.numElementsForceSpillThreshold (on phone hope no
> typo) - how many records before flush
>
> This lowers the cost because it let's us leave data in young, if we don't
> bound we get everyone promoted to old and GC becomes a issue.  This doesn't
> solve the fact that the walk is slow, but lowers the cost of GC. For us we
> make sure to have spare memory on the system for page cache so spilling to
> disk for us is a memory write 99% of the time.  If your host has less free
> memory spilling may become more expensive.
>
>
> If the walk is your bottleneck and not GC then I would recommend JOL and
> guessing to better predict memory.
>
> On Mon, Feb 26, 2018, 4:47 PM Xin Liu <xin.e....@gmail.com> wrote:
>
>> Hi folks,
>>
>> We have a situation where, shuffled data is protobuf based, and
>> SizeEstimator is taking a lot of time.
>>
>> We have tried to override SizeEstimator to return a constant value, which
>> speeds up things a lot.
>>
>> My questions, what is the side effect of disabling SizeEstimator? Is it
>> just spark do memory reallocation, or there is more severe consequences?
>>
>> Thanks!
>>
>

Reply via email to