GitHub user witgo opened a pull request: https://github.com/apache/spark/pull/17329
[SPARK-19991]FileSegmentManagedBuffer performance improvement FileSegmentManagedBuffer performance improvement. ## What changes were proposed in this pull request? When we do not set the value of the configuration items `spark.storage.memoryMapThreshold` and `spark.shuffle.io.lazyFD`, each call to the cFileSegmentManagedBuffer.nioByteBuffer or FileSegmentManagedBuffer.createInputStream method creates a NoSuchElementException instance. This is a more time-consuming operation. In the use case, this PR can improve the performance of about 3.5% The test code: ``` scala (1 to 10).foreach { i => val numPartition = 10000 val rdd = sc.parallelize(0 until numPartition).repartition(numPartition).flatMap { t => (0 until numPartition).map(r => r * numPartition + t) }.repartition(numPartition) val serializeStart = System.currentTimeMillis() rdd.sum() val serializeFinish = System.currentTimeMillis() println(f"Test $i: ${(serializeFinish - serializeStart) / 1000D}%1.2f") } ``` and `spark-defaults.conf` file: ``` spark.master yarn-client spark.executor.instances 20 spark.driver.memory 64g spark.executor.memory 30g spark.executor.cores 5 spark.default.parallelism 100 spark.sql.shuffle.partitions 100 spark.serializer org.apache.spark.serializer.KryoSerializer spark.driver.maxResultSize 0 spark.ui.enabled false spark.driver.extraJavaOptions -XX:+UseG1GC -XX:+UseStringDeduplication -XX:G1HeapRegionSize=16M -XX:MetaspaceSize=512M spark.executor.extraJavaOptions -XX:+UseG1GC -XX:+UseStringDeduplication -XX:G1HeapRegionSize=16M -XX:MetaspaceSize=256M spark.cleaner.referenceTracking.blocking true spark.cleaner.referenceTracking.blocking.shuffle true ``` The test results are as follows | [SPARK-19991](https://github.com/witgo/spark/tree/SPARK-19991) |https://github.com/apache/spark/commit/68ea290b3aa89b2a539d13ea2c18bdb5a651b2bf| |---| --- | |226.09 s| 235.21 s| ## How was this patch tested? Existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/witgo/spark SPARK-19991 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17329.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17329 ---- commit abcfc79991ecd1d5cef2cd1e275b872695ba19d9 Author: Guoqiang Li <liguoqia...@huawei.com> Date: 2017-03-17T03:19:37Z FileSegmentManagedBuffer performance improvement ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org