Adam Roberts created SPARK-18231:
------------------------------------

             Summary: Optimise SizeEstimator implementation
                 Key: SPARK-18231
                 URL: https://issues.apache.org/jira/browse/SPARK-18231
             Project: Spark
          Issue Type: Improvement
    Affects Versions: 2.0.1, 1.6.2
            Reporter: Adam Roberts


The SizeEstimator is used in Spark to determine whether or not we need to spill 
-- we know spilling typically has an adverse impact on performance and it's 
something we want to minimise

We can improve the implementation of SizeEstimator in a variety of ways to gain 
a performance and increase and ultimately a reduction in footprint by spilling 
less

There are two phases involved here

1) refactor to use more efficient data structures, to avoid some reflection 
calls (expensive), to remove the use of ScalaRunTime.array_apply, to use 
ThreadLocalRandom, to store an array of field offsets instead of a list of 
pointer fields and to improve the performance of the sample method

2) add JDK specialisms to use exact object sizes to reduce overestimations for 
both Open/Oracle JDK users and IBM Java users. With an accurate estimator we 
can therefore spill less (--footprint, ++performance -- we have observed a 15% 
reduction in RDD sizes leading to potentially double digit performance gains on 
HiBench and micro benchmarks)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to