Adam Roberts created SPARK-18231: ------------------------------------ Summary: Optimise SizeEstimator implementation Key: SPARK-18231 URL: https://issues.apache.org/jira/browse/SPARK-18231 Project: Spark Issue Type: Improvement Affects Versions: 2.0.1, 1.6.2 Reporter: Adam Roberts
The SizeEstimator is used in Spark to determine whether or not we need to spill -- we know spilling typically has an adverse impact on performance and it's something we want to minimise We can improve the implementation of SizeEstimator in a variety of ways to gain a performance and increase and ultimately a reduction in footprint by spilling less There are two phases involved here 1) refactor to use more efficient data structures, to avoid some reflection calls (expensive), to remove the use of ScalaRunTime.array_apply, to use ThreadLocalRandom, to store an array of field offsets instead of a list of pointer fields and to improve the performance of the sample method 2) add JDK specialisms to use exact object sizes to reduce overestimations for both Open/Oracle JDK users and IBM Java users. With an accurate estimator we can therefore spill less (--footprint, ++performance -- we have observed a 15% reduction in RDD sizes leading to potentially double digit performance gains on HiBench and micro benchmarks) -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org