[ https://issues.apache.org/jira/browse/SPARK-6269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14356073#comment-14356073 ]
Apache Spark commented on SPARK-6269: ------------------------------------- User 'mccheah' has created a pull request for this issue: https://github.com/apache/spark/pull/4972 > Using a different implementation of java array reflection for size estimation > ----------------------------------------------------------------------------- > > Key: SPARK-6269 > URL: https://issues.apache.org/jira/browse/SPARK-6269 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 1.3.0 > Reporter: Matt Cheah > > When I use Spark, I notice that estimating the size of my multi-dimensional > arrays tends to be significantly slow. I have RDDs that contain items of type > List<List<String>>, for example, and YourKit indicates upwards of 15% of an > executor's runtime is spent during size estimation of arrays. > There is an open issue in the JDK issue tracker for improving the performance > of Java's array reflection. It has not been merged in, but there is a block > of sample code there that is purported to be a more performant > implementation: https://bugs.openjdk.java.net/browse/JDK-8051447. (It seems > like the code is in the public domain explicitly as well - "I've written a > simple re-implementation of the methods in pure Java code, pasted below > (released into public domain or whatever license you would need if you wanted > to use it")). > Spark currently uses java.lang.reflect.Array heavily to estimate the size of > arrays. I tried using code from the JDK ticket in Spark's size estimation > code in place of java.lang.reflect.Array. I found there to be a performance > benefit in the size estimation code when I did the swap. > The benchmark numbers I am posting below will also be posted in the pull > request that will be submitted here. But just to get an idea, I did two > tests. The first, less convincing, take-with-a-block-of-salt test I did was > do a simple groupByKey operation to collect objects in a 4.0 GB text file RDD > into 30,000 buckets. I ran 1 Master and 4 Spark Worker JVMs on my mac, > fetching the RDD from a text file simply stored on disk, and saving it out to > another file located on local disk. The wall clock times I got back before > and after the change were: > Before: 352.195s, 343.871s, 359.080s > After: 342.929583s, 329.456623s, 326.151481s > So, there is a bit of an improvement after the change. I also did some > YourKit profiling of the executors to get an idea of how much time was spent > in size estimation before and after the change. I roughly saw that size > estimation took up less of the time after my change, but YourKit's profiling > can be inconsistent and who knows if I was profiling the executors that had > the same data between runs? > The more convincing test I did was to run the size-estimation logic itself in > an isolated unit test. I ran the following code: > {code} > val bigArray = > Array.fill(1000)(Array.fill(1000)(java.util.UUID.randomUUID().toString())) > test("String arrays only perf testing") { > val startTime = System.currentTimeMillis() > for (i <- 1 to 50000) { > SizeEstimator.estimate(bigArray) > } > println("Runtime: " + (System.currentTimeMillis() - startTime) / 1000.0000) > } > {code} > I wanted to use a 2D array specifically because I wanted to measure the > performance of repeatedly calling Array.getLength. I used UUID-Strings to > ensure that the strings were randomized (so object re-use doesn't happen), > but that they would all be the same size. The results were as follows: > Before change: 209.275s, 190.107s, 185.424s > After change: 160.431s, 149.487s, 151.66s > I will be submitting a pull request with the proposed performance-improving > patch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org