[GitHub] spark issue #22338: [SPARK-25317][CORE] Avoid perf regression in Murmur3 Has...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22338 Since the change looks safer to me and it does fix the regression, I'm merging it to unblock 2.4 release. Please continue to investigate the root cause, thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22338: [SPARK-25317][CORE] Avoid perf regression in Murmur3 Has...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22338 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95721/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22338: [SPARK-25317][CORE] Avoid perf regression in Murmur3 Has...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22338 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22338: [SPARK-25317][CORE] Avoid perf regression in Murmur3 Has...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22338 **[Test build #95721 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95721/testReport)** for PR 22338 at commit [`6cb94ed`](https://github.com/apache/spark/commit/6cb94ed5ed76e23e4fb775a2fd6f0e66e9c15abd). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22338: [SPARK-25317][CORE] Avoid perf regression in Murmur3 Has...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/22338 Yeah, it's interesting... Seems both `checkedCast` and `Platform.getByte` are changed and performance gets gain. Any single change doesn't work. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22338: [SPARK-25317][CORE] Avoid perf regression in Murmur3 Has...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/22338 > While I say these performance differences, I do not understand why these difference occurs completely. That is why I said "I have not found the root cause". Yes, I am in the same situation. I see the difference but I cannot explain it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22338: [SPARK-25317][CORE] Avoid perf regression in Murmur3 Has...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22338 While I say these performance differences, I do not understand why these difference occurs completely. That is why I said "I have not found the root cause". Let us narrow down the problem and find the root cause. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22338: [SPARK-25317][CORE] Avoid perf regression in Murmur3 Has...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22338 **[Test build #95721 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95721/testReport)** for PR 22338 at commit [`6cb94ed`](https://github.com/apache/spark/commit/6cb94ed5ed76e23e4fb775a2fd6f0e66e9c15abd). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22338: [SPARK-25317][CORE] Avoid perf regression in Murmur3 Has...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22338 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2877/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22338: [SPARK-25317][CORE] Avoid perf regression in Murmur3 Has...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22338 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22338: [SPARK-25317][CORE] Avoid perf regression in Murmur3 Has...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/22338 you're right @kiszk. Let me update the PR accordingly in order to narrow down the change/problem, thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22338: [SPARK-25317][CORE] Avoid perf regression in Murmur3 Has...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/22338 >This does fix the regression, will we also lose the perf speed up too? IIUC we did observe significant perf boost when introducing the memory block. @cloud-fan I checked the original PR and I couldn't find any benchmark related to Murmur3 hash. The only benchmark I found there was using the `HiveHasher`. So I cannot really answer on this. Do you have a benchmark to test it? Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22338: [SPARK-25317][CORE] Avoid perf regression in Murmur3 Has...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22338 In addition to [your commit](https://github.com/apache/spark/pull/22338/commits/91adce590461dda885d88319a700a775e63f9ce6), I applied the following change, basically use `MemoryBlock` in `hashBytesByIntBlock()`. I got 33us. ``` private static int hashBytesByIntBlock(MemoryBlock base, int lengthInBytes, int seed) { assert (lengthInBytes % 4 == 0); int h1 = seed; for (int i = 0; i < lengthInBytes; i += 4) { int halfWord = base.getInt(i); int k1 = mixK1(halfWord); h1 = mixH1(h1, k1); } return h1; } ``` But, furthermore, when I also used `MemoryBlock` in `hashUnsafeBytesBlock()`, I got 111us. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22338: [SPARK-25317][CORE] Avoid perf regression in Murmur3 Has...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/22338 @kiszk yes, I know. Moving `checkedCast` makes a big difference, but if you just move it, without the other changes, there is no perf gain (at least this is what I found in my experiments). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22338: [SPARK-25317][CORE] Avoid perf regression in Murmur3 Has...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22338 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95704/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22338: [SPARK-25317][CORE] Avoid perf regression in Murmur3 Has...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22338 @mgaido91 thanks, interestingly I did experiments with similar code in my box. While I am using the linux box, I can confirm the performance improvement (or performance recover). Regarding bytecode size, what I have seen in my experiments is the following. I have not found the root cause of this behavior. Your original code: 38us ``` public static int hashUnsafeBytesBlock(MemoryBlock base, int seed) { return hashUnsafeBytesBlock(base, Ints.checkedCast(base.size()), seed); } private static int hashUnsafeBytesBlock(MemoryBlock base, int lengthInBytes, int seed) { // This is not compatible with original and another implementations. // But remain it for backward compatibility for the components existing before 2.3. assert (lengthInBytes >= 0): "lengthInBytes cannot be negative"; int lengthAligned = lengthInBytes - lengthInBytes % 4; ``` Your original code with moving `checkedCast`: 112us ``` public static int hashUnsafeBytesBlock(MemoryBlock base, int seed) { // return hashUnsafeBytesBlock(base, Ints.checkedCast(base.size()), seed); return hashUnsafeBytesBlock(base, -1, seed); } private static int hashUnsafeBytesBlock(MemoryBlock base, int lengthInBytes, int seed) { // This is not compatible with original and another implementations. // But remain it for backward compatibility for the components existing before 2.3. lengthInBytes = Ints.checkedCast(base.size()); assert (lengthInBytes >= 0): "lengthInBytes cannot be negative"; int lengthAligned = lengthInBytes - lengthInBytes % 4; ... ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22338: [SPARK-25317][CORE] Avoid perf regression in Murmur3 Has...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22338 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22338: [SPARK-25317][CORE] Avoid perf regression in Murmur3 Has...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22338 **[Test build #95704 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95704/testReport)** for PR 22338 at commit [`91adce5`](https://github.com/apache/spark/commit/91adce590461dda885d88319a700a775e63f9ce6). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22338: [SPARK-25317][CORE] Avoid perf regression in Murmur3 Has...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22338 This basically reverts the memory block in the hash computing, now the memory block is just a holder of the base object and base offset. This does fix the regression, will we also lose the perf speed up too? IIUC we did observe significant perf boost when introducing the memory block. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22338: [SPARK-25317][CORE] Avoid perf regression in Murmur3 Has...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/22338 @cloud-fan I think I tried doing something like what you suggested but it didn't help. Moreover, the current code in `MemoryBlock` already leverages `Platform.get*` for most of its methods, so that wouldn't really change much I think. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22338: [SPARK-25317][CORE] Avoid perf regression in Murmur3 Has...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22338 Thanks for working on it! Will it be helpful if we move these hash methods to `MemoryBlock`? e.g. the code can be `int halfWord =bytes[offset + i];` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22338: [SPARK-25317][CORE] Avoid perf regression in Murmur3 Has...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/22338 cc @cloud-fan @kiszk I checked the bytecode generated and the size of the generated code seems not to be the issue either. FYI I am attaching here the disassembled code before and after: [afterPatch.txt](https://github.com/apache/spark/files/2352699/afterPatch.txt) [beforePatch.txt](https://github.com/apache/spark/files/2352700/beforePatch.txt) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22338: [SPARK-25317][CORE] Avoid perf regression in Murmur3 Has...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22338 **[Test build #95704 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95704/testReport)** for PR 22338 at commit [`91adce5`](https://github.com/apache/spark/commit/91adce590461dda885d88319a700a775e63f9ce6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22338: [SPARK-25317][CORE] Avoid perf regression in Murmur3 Has...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22338 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2865/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22338: [SPARK-25317][CORE] Avoid perf regression in Murmur3 Has...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22338 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org