SizeEstimator in Spark 1.1 and high load/object allocation when reading in data

Erik Freed Thu, 30 Oct 2014 20:35:56 -0700

Hi All,

We have recently moved to Spark 1.1 from 0.9 for an application handling a
fair number of very large datasets partitioned across multiple nodes. About
half of each of these large datasets is stored in off heap byte arrays and
about half in the standard Java heap.


While these datasets are being loaded from our custom HDFS 2.3 RDD and
before we are using even a fraction of the available Java Heap and the
native off heap memory the loading slows to an absolute crawl. It appears
clear from our profiling of the Spark Executor that in the Spark
SizeEstimator an extremely high cpu load is being demanded along with a
fast and furious allocation of Object[] instances.  We do not believe we
were seeing this sort of behavior in 0.9 and we have noticed rather
significant changes in this part of the BlockManager code going from 0.9 to
1.1 and beyond. A GC run gets rid of all of the Object[] instances.

Before we start spending large amounts of time either switching back to 0.9
or further tracing to the root cause of this, I was wondering if anyone out
there had enough experience with that part of the code (or had run into the
same problem) and could help us understand what sort of root causes might
lay behind this strange behavior and even better what we could do to
resolve them.

Any help would be very much appreciated.

cheers,
Erik

SizeEstimator in Spark 1.1 and high load/object allocation when reading in data

Reply via email to