Re: [PR] [SPARK-47788] [Structured Streaming] Ensure the same hash partitioning for streaming stateful ops [spark]

via GitHub Fri, 12 Apr 2024 16:51:51 -0700


fanyue-xia commented on PR #45971:
URL: https://github.com/apache/spark/pull/45971#issuecomment-2052704919


   > > the seed might behave differently across runs/on different machines
   > 
   > Ah I see, this indeed makes sense.
   > 
   > In this case, I think we should fix the generator of rows. It's okay to 
sacrifice randomness of rows here. We can have a dedicated row generation 
function, depending on the input type, this function just return a fixed return 
(e.g. if input is int, just return 233, if input is byte, just return 
0xdeadbeef)
   > 
   > Giving up randomness of the row should still get the job done. The way the 
hash is computed is sth like `hash(field 1, hash (field 2, seed...)...)`, and 
this part hasn't been touched likely since the beginning.
   > 
   > 
https://github.com/apache/spark/blob/6ee662c28ffb0deb70f08a971f9c1869288d39ba/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala#L289-L298
   > 
   > This should not be changed, even if it is changed, it makes less sense to 
be changed to a hash function that hash (1, 2) and (2, 1) to be the same bucket.
   > 
   > Then as long as we increase the max number of field and the number of 
schemas (now this can be a fairly high number, also pay attention to the test 
run time), it should behave similar as having a large number of random 
generated rows.
   
   Thanks for your suggestion. I preserved the randomness of the row since it 
feels more intuitive and we don't need to reason about the underlying hashing 
schemes for `StructType`. Changed the code to limit to just one nested schema 
and limit the `Array` size and `String` length when generating rows. Now the 
golden file is smaller than 10mb.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47788] [Structured Streaming] Ensure the same hash partitioning for streaming stateful ops [spark]

Reply via email to