Thanks David. Another solution is to convert the protobuf object to byte array, It does speed up SizeEstimator
On Mon, Feb 26, 2018 at 5:34 PM, David Capwell <dcapw...@gmail.com> wrote: > This is used to predict the current cost of memory so spark knows to flush > or not. This is very costly for us so we use a flag marked in the code as > private to lower the cost > > spark.shuffle.spill.numElementsForceSpillThreshold (on phone hope no > typo) - how many records before flush > > This lowers the cost because it let's us leave data in young, if we don't > bound we get everyone promoted to old and GC becomes a issue. This doesn't > solve the fact that the walk is slow, but lowers the cost of GC. For us we > make sure to have spare memory on the system for page cache so spilling to > disk for us is a memory write 99% of the time. If your host has less free > memory spilling may become more expensive. > > > If the walk is your bottleneck and not GC then I would recommend JOL and > guessing to better predict memory. > > On Mon, Feb 26, 2018, 4:47 PM Xin Liu <xin.e....@gmail.com> wrote: > >> Hi folks, >> >> We have a situation where, shuffled data is protobuf based, and >> SizeEstimator is taking a lot of time. >> >> We have tried to override SizeEstimator to return a constant value, which >> speeds up things a lot. >> >> My questions, what is the side effect of disabling SizeEstimator? Is it >> just spark do memory reallocation, or there is more severe consequences? >> >> Thanks! >> >