[GitHub] [spark] mundaym commented on pull request #29762: [SPARK-32892][CORE][SQL] Fix hash functions on big-endian platforms.
mundaym commented on pull request #29762: URL: https://github.com/apache/spark/pull/29762#issuecomment-696404313 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mundaym commented on pull request #29762: [SPARK-32892][CORE][SQL] Fix hash functions on big-endian platforms.
mundaym commented on pull request #29762: URL: https://github.com/apache/spark/pull/29762#issuecomment-696786519 > While I see the caller of hashUnsafeWords(), I am not sure what hash value is expected. It's a good question. Right now the only users of `hashUnsafeWords` appear to be `UnsafeRow` and `BytesToBytesMap`. `UnsafeRow` uses 8 byte slots to store all primitive values. `BytesToBytesMap` appears to enforce that keys are 8 byte aligned. In both cases I am fairly certain they are calling `hashUnsafeWords` rather than `hashUnsafeBytes` simply as an optimization since they know the input is 8 byte aligned. For `UnsafeRow` in particular sub-8-byte types do not appear to be extended to 8 byte types, just left aligned in the same slot. For example, a `float` would be left aligned in the slot rather than converted to a `double` to fill the slot. A word-by-word hash of them would therefore still produce different results on big- and little-endian systems. My current understanding is therefore that `hashUnsafeWords` and `hashUnsafeBytes` should produce identical results. Aside: the use of 'word' here is in my opinion ambiguous. Is it 32 bits on a 32 bit system? I think that if we wanted to be able to hash an array of 8 byte values as if they were encoded in little-endian byte order we should name the method after the Java type identifier, so something like `hashUnsafeLongs`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mundaym commented on pull request #29762: [SPARK-32892][CORE][SQL] Fix hash functions on big-endian platforms.
mundaym commented on pull request #29762: URL: https://github.com/apache/spark/pull/29762#issuecomment-696404313 > I expect to provide additional tests. Regarding the condition, it is nice to have with me. Added some basic test coverage to hashUnsafeWords. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mundaym commented on pull request #29762: [SPARK-32892][CORE][SQL] Fix hash functions on big-endian platforms.
mundaym commented on pull request #29762: URL: https://github.com/apache/spark/pull/29762#issuecomment-696404313 > I expect to provide additional tests. Regarding the condition, it is nice to have with me. Added some basic test coverage to hashUnsafeWords. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mundaym commented on pull request #29762: [SPARK-32892][CORE][SQL] Fix hash functions on big-endian platforms.
mundaym commented on pull request #29762: URL: https://github.com/apache/spark/pull/29762#issuecomment-693368436 Thanks, it sounds like the inline code that this PR uses is preferred to Platform class changes at this point so I'll leave this PR as-is. It might be worth reassessing this later if/when `jdk.internal.misc.Unsafe` is adopted since that introduces `getIntUnaligned` and `getLongUnaligned` which take into account endianness and alignment requirements on the target platform. Using those functions would make this code fully portable and - as part of the standard library - should be well optimized by the compiler. Certainly the unaligned load and the reverse bytes are both treated as intrinsics by Hotspot. I think we'd need to wait until JDK 8 support is dropped to adopt it though. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mundaym commented on pull request #29762: [SPARK-32892][CORE][SQL] Fix hash functions on big-endian platforms.
mundaym commented on pull request #29762: URL: https://github.com/apache/spark/pull/29762#issuecomment-693262630 In an ideal world we'd probably consolidate platform endianness handling into the Platform class by adding methods that perform the necessary conversions. I'm not sure if that would be a reasonable to add to the Platform class API? We could, for example, have overloads for getters and setters for at least the primitive types that take a byte order argument so we could make calls like `Platform.getInt(base, offset, ByteOrder.LITTLE_ENDIAN)`. The implementation would be trivial: ```java public static int getInt(Object object, long offset, ByteOrder order) { int x = _UNSAFE.getInt(object, offset); return ByteOrder.nativeOrder().equals(order) ? x : Integer.reverseBytes(x); } ``` That would make changes like this one far cleaner. Though it might be slightly more work (and therefore be slightly riskier performance-wise) for the JIT compiler to optimize well than the proposed inline conversions in this PR. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org