[GitHub] spark pull request: [SPARK-14277] UnsafeSorterSpillReader should d...
Github user sitalkedia closed the pull request at: https://github.com/apache/spark/pull/12074 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14277] UnsafeSorterSpillReader should d...
Github user sitalkedia commented on the pull request: https://github.com/apache/spark/pull/12074#issuecomment-204146086 Changed the SPARK-14277 JIRA's description, closing this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14277] UnsafeSorterSpillReader should d...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/12074#issuecomment-204125633 Great! @sitalkedia, do you mind closing this PR in favor of #12096 and updating the SPARK-14277 JIRA's description to match your new PR so that it accurately describes the change that we're going to commit? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14277] UnsafeSorterSpillReader should d...
Github user sitalkedia commented on the pull request: https://github.com/apache/spark/pull/12074#issuecomment-204117055 Thanks @xerial. I tested the change and I saw 7.5% CPU savings after this change. Opened a PR https://github.com/apache/spark/pull/12096 to upgrade snappy. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14277] UnsafeSorterSpillReader should d...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/12074#issuecomment-204089164 Thanks @xerial! @sitalkedia, feel free to open a new PR for the dep. bump after you finish testing this new version. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14277] UnsafeSorterSpillReader should d...
Github user xerial commented on the pull request: https://github.com/apache/spark/pull/12074#issuecomment-203748806 Released snappy-java-1.1.2.4 with this fix. Thanks for letting me know. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14277] UnsafeSorterSpillReader should d...
Github user xerial commented on the pull request: https://github.com/apache/spark/pull/12074#issuecomment-203734165 @sitalkedia Sure. I'll do that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14277] UnsafeSorterSpillReader should d...
Github user sitalkedia commented on the pull request: https://github.com/apache/spark/pull/12074#issuecomment-203727586 @xerial - I am seeing similar issue for snappy write as well. Can we fix the write code path as well? Stack trace - org.xerial.snappy.SnappyNative.arrayCopy(Native Method) org.xerial.snappy.Snappy.arrayCopy(Snappy.java:85) org.xerial.snappy.SnappyOutputStream.rawWrite(SnappyOutputStream.java:273) org.xerial.snappy.SnappyOutputStream.write(SnappyOutputStream.java:115) org.apache.spark.io.SnappyOutputStreamWrapper.write(CompressionCodec.scala:202) org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:220) org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.write(UnsafeSorterSpillWriter.java:126) org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:192) org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:175) org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:249) org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:83) org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:298) org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:338) org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:93) org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:179) org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:90) org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:64) org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728) org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) org.apache.spark.rdd.RDD.iterator(RDD.scala:270) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) org.apache.spark.rdd.RDD.iterator(RDD.scala:270) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) org.apache.spark.rdd.RDD.iterator(RDD.scala:270) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) org.apache.spark.rdd.RDD.iterator(RDD.scala:270) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) org.apache.spark.rdd.RDD.iterator(RDD.scala:270) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) org.apache.spark.scheduler.Task.run(Task.scala:89) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) java.lang.Thread.run(Thread.java:745) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14277] UnsafeSorterSpillReader should d...
Github user sitalkedia commented on the pull request: https://github.com/apache/spark/pull/12074#issuecomment-203715133 @JoshRosen - thanks, working on it. Will update soon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14277] UnsafeSorterSpillReader should d...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/12074#issuecomment-203711830 @sitalkedia, if you confirm that the updated `snappy-java` fixes the performance issue for you, then I'd open a different pull request to upgrade Spark to the newer version. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14277] UnsafeSorterSpillReader should d...
Github user sitalkedia commented on the pull request: https://github.com/apache/spark/pull/12074#issuecomment-203707032 @JoshRosen - I guess after @xerial 's change, we won't be needing this change, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14277] UnsafeSorterSpillReader should d...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12074#issuecomment-203704164 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54564/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14277] UnsafeSorterSpillReader should d...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12074#issuecomment-203704159 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14277] UnsafeSorterSpillReader should d...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12074#issuecomment-203703541 **[Test build #54564 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54564/consoleFull)** for PR 12074 at commit [`5ad27f4`](https://github.com/apache/spark/commit/5ad27f47f4b452e17067424b2eda480b1a9ac454). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14277] UnsafeSorterSpillReader should d...
Github user sitalkedia commented on the pull request: https://github.com/apache/spark/pull/12074#issuecomment-203703076 That's great. Thanks a lot for the quick fix. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14277] UnsafeSorterSpillReader should d...
Github user xerial commented on the pull request: https://github.com/apache/spark/pull/12074#issuecomment-203702958 I have just deployed snappy-java-1.1.2.3 with this fix, which will be synchronized to the Maven central soon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14277] UnsafeSorterSpillReader should d...
Github user sitalkedia commented on the pull request: https://github.com/apache/spark/pull/12074#issuecomment-203701547 Thanks @xerial , this is going to fix all snappy read/write inefficiency due to small writes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14277] UnsafeSorterSpillReader should d...
Github user xerial commented on the pull request: https://github.com/apache/spark/pull/12074#issuecomment-203687378 A reason snappy-java's SnappyInputStream uses Snappy.arrayCopy (JNI method) is to load the uncompressed data into primitive type arrays (e.g., float[], int[]) since there is no standard Java method for doing this. When writing data to byte[], replacing the implementation with non-JNI based one (using System.arrayCopy) is possible. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14277] UnsafeSorterSpillReader should d...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12074#issuecomment-203683427 **[Test build #54564 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54564/consoleFull)** for PR 12074 at commit [`5ad27f4`](https://github.com/apache/spark/commit/5ad27f47f4b452e17067424b2eda480b1a9ac454). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14277] UnsafeSorterSpillReader should d...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/12074#issuecomment-203682157 Jenkins, this is ok to test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14277] UnsafeSorterSpillReader should d...
Github user sitalkedia commented on the pull request: https://github.com/apache/spark/pull/12074#issuecomment-203680866 @JoshRosen - There might be other places where buffering might help, I did not notice any other hotspot during my job run though. Also, as you mentioned pushing this into `wrapForCompression ` has undesirable effect of double buffering. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14277] UnsafeSorterSpillReader should d...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/12074#issuecomment-203656991 Also, /cc @xerial, who may be able to comment on whether `snappy-java` performs any of its own buffering. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14277] UnsafeSorterSpillReader should d...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/12074#issuecomment-203656188 Is this the only place where buffering helps or would it make sense to do buffered reads from Snappy streams in other circumstances as well? In other words, should this buffering perhaps either be done at more call-sites of `wrapForCompression` or in `wrapForCompression` itself? (Note that pushing this into `wrapForCompression` risks accidental double-buffering, which might be undesirable). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14277] UnsafeSorterSpillReader should d...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12074#issuecomment-203651711 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14277] UnsafeSorterSpillReader should d...
GitHub user sitalkedia opened a pull request: https://github.com/apache/spark/pull/12074 [SPARK-14277] UnsafeSorterSpillReader should do buffered read from un⦠## What changes were proposed in this pull request? While running a Spark job which is spilling a lot of data in reduce phase, we see that significant amount of CPU is being consumed in native Snappy ArrayCopy method (Please see the stack trace below). Stack trace - org.xerial.snappy.SnappyNative.$$YJP$$arrayCopy(Native Method) org.xerial.snappy.SnappyNative.arrayCopy(SnappyNative.java) org.xerial.snappy.Snappy.arrayCopy(Snappy.java:85) org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:190) org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:163) java.io.DataInputStream.readFully(DataInputStream.java:195) java.io.DataInputStream.readLong(DataInputStream.java:416) org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.loadNext(UnsafeSorterSpillReader.java:71) org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillMerger$2.loadNext(UnsafeSorterSpillMerger.java:79) org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:136) org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:123) The reason for that is the SpillReader does a lot of small reads from the underlying snappy compressed stream and we pay a heavy cost of jni calls for these small reads. The SpillReader should instead do a buffered read from the underlying snappy compressed stream. ## How was this patch tested? Tested by running the job and we saw more than 10% cpu savings. (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) â¦derlying compression stream You can merge this pull request into a Git repository by running: $ git pull https://github.com/sitalkedia/spark bufferedReader Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/12074.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12074 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org