[ 
https://issues.apache.org/jira/browse/SPARK-6269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14356073#comment-14356073
 ] 

Apache Spark commented on SPARK-6269:
-------------------------------------

User 'mccheah' has created a pull request for this issue:
https://github.com/apache/spark/pull/4972

> Using a different implementation of java array reflection for size estimation
> -----------------------------------------------------------------------------
>
>                 Key: SPARK-6269
>                 URL: https://issues.apache.org/jira/browse/SPARK-6269
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.3.0
>            Reporter: Matt Cheah
>
> When I use Spark, I notice that estimating the size of my multi-dimensional 
> arrays tends to be significantly slow. I have RDDs that contain items of type 
> List<List<String>>, for example, and YourKit indicates upwards of 15% of an 
> executor's runtime is spent during size estimation of arrays.
> There is an open issue in the JDK issue tracker for improving the performance 
> of Java's array reflection. It has not been merged in, but there is a block 
> of sample code there that is purported to be a more performant 
> implementation: https://bugs.openjdk.java.net/browse/JDK-8051447. (It seems 
> like the code is in the public domain explicitly as well - "I've written a 
> simple re-implementation of the methods in pure Java code, pasted below 
> (released into public domain or whatever license you would need if you wanted 
> to use it")).
> Spark currently uses java.lang.reflect.Array heavily to estimate the size of 
> arrays. I tried using code from the JDK ticket in Spark's size estimation 
> code in place of java.lang.reflect.Array. I found there to be a performance 
> benefit in the size estimation code when I did the swap.
> The benchmark numbers I am posting below will also be posted in the pull 
> request that will be submitted here. But just to get an idea, I did two 
> tests. The first, less convincing, take-with-a-block-of-salt test I did was 
> do a simple groupByKey operation to collect objects in a 4.0 GB text file RDD 
> into 30,000 buckets. I ran 1 Master and 4 Spark Worker JVMs on my mac, 
> fetching the RDD from a text file simply stored on disk, and saving it out to 
> another file located on local disk. The wall clock times I got back before 
> and after the change were:
> Before: 352.195s, 343.871s, 359.080s
> After: 342.929583s, 329.456623s, 326.151481s
> So, there is a bit of an improvement after the change. I also did some 
> YourKit profiling of the executors to get an idea of how much time was spent 
> in size estimation before and after the change. I roughly saw that size 
> estimation took up less of the time after my change, but YourKit's profiling 
> can be inconsistent and who knows if I was profiling the executors that had 
> the same data between runs?
> The more convincing test I did was to run the size-estimation logic itself in 
> an isolated unit test. I ran the following code:
> {code}
> val bigArray = 
> Array.fill(1000)(Array.fill(1000)(java.util.UUID.randomUUID().toString()))
> test("String arrays only perf testing") {
>   val startTime = System.currentTimeMillis()
>   for (i <- 1 to 50000) {
>     SizeEstimator.estimate(bigArray)
>   }
>   println("Runtime: " + (System.currentTimeMillis() - startTime) / 1000.0000)
> }
> {code}
> I wanted to use a 2D array specifically because I wanted to measure the 
> performance of repeatedly calling Array.getLength. I used UUID-Strings to 
> ensure that the strings were randomized (so object re-use doesn't happen), 
> but that they would all be the same size. The results were as follows:
> Before change: 209.275s, 190.107s, 185.424s
> After change: 160.431s, 149.487s, 151.66s
> I will be submitting a pull request with the proposed performance-improving 
> patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to