GitHub user mccheah opened a pull request: https://github.com/apache/spark/pull/4972
[SPARK-6269] Use a different implementation of java.lang.reflect.Array This patch uses a different implementation of java.lang.reflect.Array. The code is copied and pasted from https://bugs.openjdk.java.net/browse/JDK-8051447 with appropriate style changes for this project. The appropriate code is in the public domain, so we can use it since the proper Apache licensing is on it. The idea is to use pure Java code in implementing the methods there, as opposed to relying on native C code which ends up being ill-performing. This improves the performance of estimating the size of arrays when we are checking for spilling in Spark. Here's the benchmark discussion from the ticket: I did two tests. The first, less convincing, take-with-a-block-of-salt test I did was do a simple groupByKey operation to collect objects in a 4.0 GB text file RDD into 30,000 buckets. I ran 1 Master and 4 Spark Worker JVMs on my mac, fetching the RDD from a text file simply stored on disk, and saving it out to another file located on local disk. The wall clock times I got back before and after the change were: Before: 352.195s, 343.871s, 359.080s After: 342.929583s, 329.456623s, 326.151481s So, there is a bit of an improvement after the change. I also did some YourKit profiling of the executors to get an idea of how much time was spent in size estimation before and after the change. I roughly saw that size estimation took up less of the time after my change, but YourKit's profiling can be inconsistent and who knows if I was profiling the executors that had the same data between runs? The more convincing test I did was to run the size-estimation logic itself in an isolated unit test. I ran the following code: ``` {code} val bigArray = Array.fill(1000)(Array.fill(1000)(java.util.UUID.randomUUID().toString())) test("String arrays only perf testing") { val startTime = System.currentTimeMillis() for (i <- 1 to 50000) { SizeEstimator.estimate(bigArray) } println("Runtime: " + (System.currentTimeMillis() - startTime) / 1000.0000) } {code} ``` I wanted to use a 2D array specifically because I wanted to measure the performance of repeatedly calling Array.getLength. I used UUID-Strings to ensure that the strings were randomized (so object re-use doesn't happen), but that they would all be the same size. The results were as follows: Before change: 209.275s, 190.107s, 185.424s After change: 160.431s, 149.487s, 151.66s . You can merge this pull request into a Git repository by running: $ git pull https://github.com/palantir/spark feature/spark-6269-reflect-array Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4972.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4972 ---- ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org