[
https://issues.apache.org/jira/browse/SPARK-6269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen resolved SPARK-6269.
------------------------------
Resolution: Fixed
Fix Version/s: 1.4.0
Issue resolved by pull request 4972
[https://github.com/apache/spark/pull/4972]
> Using a different implementation of java array reflection for size estimation
> -----------------------------------------------------------------------------
>
> Key: SPARK-6269
> URL: https://issues.apache.org/jira/browse/SPARK-6269
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Affects Versions: 1.3.0
> Reporter: Matt Cheah
> Fix For: 1.4.0
>
>
> When I use Spark, I notice that estimating the size of my multi-dimensional
> arrays tends to be significantly slow. I have RDDs that contain items of type
> List<List<String>>, for example, and YourKit indicates upwards of 15% of an
> executor's runtime is spent during size estimation of arrays.
> There is an open issue in the JDK issue tracker for improving the performance
> of Java's array reflection. It has not been merged in, but there is a block
> of sample code there that is purported to be a more performant
> implementation: https://bugs.openjdk.java.net/browse/JDK-8051447. (It seems
> like the code is in the public domain explicitly as well - "I've written a
> simple re-implementation of the methods in pure Java code, pasted below
> (released into public domain or whatever license you would need if you wanted
> to use it")).
> Spark currently uses java.lang.reflect.Array heavily to estimate the size of
> arrays. I tried using code from the JDK ticket in Spark's size estimation
> code in place of java.lang.reflect.Array. I found there to be a performance
> benefit in the size estimation code when I did the swap.
> The benchmark numbers I am posting below will also be posted in the pull
> request that will be submitted here. But just to get an idea, I did two
> tests. The first, less convincing, take-with-a-block-of-salt test I did was
> do a simple groupByKey operation to collect objects in a 4.0 GB text file RDD
> into 30,000 buckets. I ran 1 Master and 4 Spark Worker JVMs on my mac,
> fetching the RDD from a text file simply stored on disk, and saving it out to
> another file located on local disk. The wall clock times I got back before
> and after the change were:
> Before: 352.195s, 343.871s, 359.080s
> After: 342.929583s, 329.456623s, 326.151481s
> So, there is a bit of an improvement after the change. I also did some
> YourKit profiling of the executors to get an idea of how much time was spent
> in size estimation before and after the change. I roughly saw that size
> estimation took up less of the time after my change, but YourKit's profiling
> can be inconsistent and who knows if I was profiling the executors that had
> the same data between runs?
> The more convincing test I did was to run the size-estimation logic itself in
> an isolated unit test. I ran the following code:
> {code}
> val bigArray =
> Array.fill(1000)(Array.fill(1000)(java.util.UUID.randomUUID().toString()))
> test("String arrays only perf testing") {
> val startTime = System.currentTimeMillis()
> for (i <- 1 to 50000) {
> SizeEstimator.estimate(bigArray)
> }
> println("Runtime: " + (System.currentTimeMillis() - startTime) / 1000.0000)
> }
> {code}
> I wanted to use a 2D array specifically because I wanted to measure the
> performance of repeatedly calling Array.getLength. I used UUID-Strings to
> ensure that the strings were randomized (so object re-use doesn't happen),
> but that they would all be the same size. The results were as follows:
> Before change: 209.275s, 190.107s, 185.424s
> After change: 160.431s, 149.487s, 151.66s
> I will be submitting a pull request with the proposed performance-improving
> patch.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]