[GitHub] [spark] mundaym commented on pull request #29762: [SPARK-32892][CORE][SQL] Fix hash functions on big-endian platforms.

2020-09-22 Thread GitBox


mundaym commented on pull request #29762:
URL: https://github.com/apache/spark/pull/29762#issuecomment-696404313







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mundaym commented on pull request #29762: [SPARK-32892][CORE][SQL] Fix hash functions on big-endian platforms.

2020-09-22 Thread GitBox


mundaym commented on pull request #29762:
URL: https://github.com/apache/spark/pull/29762#issuecomment-696786519


   > While I see the caller of hashUnsafeWords(), I am not sure what hash value 
is expected.
   
   It's a good question. Right now the only users of `hashUnsafeWords` appear 
to be `UnsafeRow` and `BytesToBytesMap`. `UnsafeRow` uses 8 byte slots to store 
all primitive values. `BytesToBytesMap` appears to enforce that keys are 8 byte 
aligned. In both cases I am fairly certain they are calling `hashUnsafeWords` 
rather than `hashUnsafeBytes` simply as an optimization since they know the 
input is 8 byte aligned.
   
   For `UnsafeRow` in particular sub-8-byte types do not appear to be extended 
to 8 byte types, just left aligned in the same slot. For example, a `float` 
would be left aligned in the slot rather than converted to a `double` to fill 
the slot. A word-by-word hash of them would therefore still produce different 
results on big- and little-endian systems.
   
   My current understanding is therefore that `hashUnsafeWords` and 
`hashUnsafeBytes` should produce identical results.
   
   Aside: the use of 'word' here is in my opinion ambiguous. Is it 32 bits on a 
32 bit system? I think that if we wanted to be able to hash an array of 8 byte 
values as if they were encoded in little-endian byte order we should name the 
method after the Java type identifier, so something like `hashUnsafeLongs`.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mundaym commented on pull request #29762: [SPARK-32892][CORE][SQL] Fix hash functions on big-endian platforms.

2020-09-21 Thread GitBox


mundaym commented on pull request #29762:
URL: https://github.com/apache/spark/pull/29762#issuecomment-696404313


   > I expect to provide additional tests. Regarding the condition, it is nice 
to have with me.
   
   Added some basic test coverage to hashUnsafeWords.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mundaym commented on pull request #29762: [SPARK-32892][CORE][SQL] Fix hash functions on big-endian platforms.

2020-09-21 Thread GitBox


mundaym commented on pull request #29762:
URL: https://github.com/apache/spark/pull/29762#issuecomment-696404313


   > I expect to provide additional tests. Regarding the condition, it is nice 
to have with me.
   
   Added some basic test coverage to hashUnsafeWords.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mundaym commented on pull request #29762: [SPARK-32892][CORE][SQL] Fix hash functions on big-endian platforms.

2020-09-16 Thread GitBox


mundaym commented on pull request #29762:
URL: https://github.com/apache/spark/pull/29762#issuecomment-693368436


   Thanks, it sounds like the inline code that this PR uses is preferred to 
Platform class changes at this point so I'll leave this PR as-is.
   
   It might be worth reassessing this later if/when `jdk.internal.misc.Unsafe` 
is adopted since that introduces `getIntUnaligned` and `getLongUnaligned` which 
take into account endianness and alignment requirements on the target platform. 
Using those functions would make this code fully portable and - as part of the 
standard library - should be well optimized by the compiler. Certainly the 
unaligned load and the reverse bytes are both treated as intrinsics by Hotspot. 
I think we'd need to wait until JDK 8 support is dropped to adopt it though.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mundaym commented on pull request #29762: [SPARK-32892][CORE][SQL] Fix hash functions on big-endian platforms.

2020-09-16 Thread GitBox


mundaym commented on pull request #29762:
URL: https://github.com/apache/spark/pull/29762#issuecomment-693262630


   In an ideal world we'd probably consolidate platform endianness handling 
into the Platform class by adding methods that perform the necessary 
conversions. I'm not sure if that would be a reasonable to add to the Platform 
class API? We could, for example, have overloads for getters and setters for at 
least the primitive types that take a byte order argument so we could make 
calls like `Platform.getInt(base, offset, ByteOrder.LITTLE_ENDIAN)`. The 
implementation would be trivial:
   
   ```java
   public static int getInt(Object object, long offset, ByteOrder order) {
 int x = _UNSAFE.getInt(object, offset);
 return ByteOrder.nativeOrder().equals(order) ? x : Integer.reverseBytes(x);
   }
   ```
   
   That would make changes like this one far cleaner. Though it might be 
slightly more work (and therefore be slightly riskier performance-wise) for the 
JIT compiler to optimize well than the proposed inline conversions in this PR.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org