fanyue-xia commented on PR #45971: URL: https://github.com/apache/spark/pull/45971#issuecomment-2052704919
> > the seed might behave differently across runs/on different machines > > Ah I see, this indeed makes sense. > > In this case, I think we should fix the generator of rows. It's okay to sacrifice randomness of rows here. We can have a dedicated row generation function, depending on the input type, this function just return a fixed return (e.g. if input is int, just return 233, if input is byte, just return 0xdeadbeef) > > Giving up randomness of the row should still get the job done. The way the hash is computed is sth like `hash(field 1, hash (field 2, seed...)...)`, and this part hasn't been touched likely since the beginning. > > https://github.com/apache/spark/blob/6ee662c28ffb0deb70f08a971f9c1869288d39ba/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala#L289-L298 > > This should not be changed, even if it is changed, it makes less sense to be changed to a hash function that hash (1, 2) and (2, 1) to be the same bucket. > > Then as long as we increase the max number of field and the number of schemas (now this can be a fairly high number, also pay attention to the test run time), it should behave similar as having a large number of random generated rows. Thanks for your suggestion. I preserved the randomness of the row since it feels more intuitive and we don't need to reason about the underlying hashing schemes for `StructType`. Changed the code to limit to just one nested schema and limit the `Array` size and `String` length when generating rows. Now the golden file is smaller than 10mb. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org