This is used to predict the current cost of memory so spark knows to flush or not. This is very costly for us so we use a flag marked in the code as private to lower the cost
spark.shuffle.spill.numElementsForceSpillThreshold (on phone hope no typo) - how many records before flush This lowers the cost because it let's us leave data in young, if we don't bound we get everyone promoted to old and GC becomes a issue. This doesn't solve the fact that the walk is slow, but lowers the cost of GC. For us we make sure to have spare memory on the system for page cache so spilling to disk for us is a memory write 99% of the time. If your host has less free memory spilling may become more expensive. If the walk is your bottleneck and not GC then I would recommend JOL and guessing to better predict memory. On Mon, Feb 26, 2018, 4:47 PM Xin Liu <xin.e....@gmail.com> wrote: > Hi folks, > > We have a situation where, shuffled data is protobuf based, and > SizeEstimator is taking a lot of time. > > We have tried to override SizeEstimator to return a constant value, which > speeds up things a lot. > > My questions, what is the side effect of disabling SizeEstimator? Is it > just spark do memory reallocation, or there is more severe consequences? > > Thanks! >