GitHub user mccheah opened a pull request:

    https://github.com/apache/spark/pull/4972

    [SPARK-6269] Use a different implementation of java.lang.reflect.Array

    This patch uses a different implementation of java.lang.reflect.Array. The 
code is copied and pasted from
    https://bugs.openjdk.java.net/browse/JDK-8051447 with appropriate style 
changes for this project. The appropriate code is in the public domain, so we 
can use it since the proper Apache licensing is on it.
    
    The idea is to use pure Java code in implementing the methods there, as 
opposed to relying on native C code which ends up being ill-performing. This 
improves the performance of estimating the size of arrays when we are checking 
for spilling in Spark.
    
    Here's the benchmark discussion from the ticket:
    
    I did two tests. The first, less convincing, take-with-a-block-of-salt test 
I did was do a simple groupByKey operation to collect objects in a 4.0 GB text 
file RDD into 30,000 buckets. I ran 1 Master and 4 Spark Worker JVMs on my mac, 
fetching the RDD from a text file simply stored on disk, and saving it out to 
another file located on local disk. The wall clock times I got back before and 
after the change were:
    
    Before: 352.195s, 343.871s, 359.080s
    After: 342.929583s, 329.456623s, 326.151481s
    
    So, there is a bit of an improvement after the change. I also did some 
YourKit profiling of the executors to get an idea of how much time was spent in 
size estimation before and after the change. I roughly saw that size estimation 
took up less of the time after my change, but YourKit's profiling can be 
inconsistent and who knows if I was profiling the executors that had the same 
data between runs?
    
    The more convincing test I did was to run the size-estimation logic itself 
in an isolated unit test. I ran the following code:
    ```
    {code}
    val bigArray = 
Array.fill(1000)(Array.fill(1000)(java.util.UUID.randomUUID().toString()))
    test("String arrays only perf testing") {
      val startTime = System.currentTimeMillis()
      for (i <- 1 to 50000) {
        SizeEstimator.estimate(bigArray)
      }
      println("Runtime: " + (System.currentTimeMillis() - startTime) / 
1000.0000)
    }
    {code}
    ```
    I wanted to use a 2D array specifically because I wanted to measure the 
performance of repeatedly calling Array.getLength. I used UUID-Strings to 
ensure that the strings were randomized (so object re-use doesn't happen), but 
that they would all be the same size. The results were as follows:
    
    Before change: 209.275s, 190.107s, 185.424s
    After change: 160.431s, 149.487s, 151.66s
    .

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/palantir/spark 
feature/spark-6269-reflect-array

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4972.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4972
    
----

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to