GideonPotok commented on PR #45453:
URL: https://github.com/apache/spark/pull/45453#issuecomment-1989191094

   > @GideonPotok - I think that better approach for benchmarking collation 
track is to start with the basics. e.g. unit benchmarks against 
`CollationFactory` +`UTF8String`. E.g. what is the perf diff between simple 
filter, without the rest of the spark stack, between UTF8_BINARY, 
UTF8_BINARY_LCASE, UNICODE and UNICODE_CI. After filter we can do the same for 
hashFunction. You should be able to just generate bunch of UTF8Stings and guide 
them through `comparator`/`hashFunction` of `Collations` in `CollationFactory`.
   > 
   > That way benchmarking will be actionable. Starting immediately with joins 
is too high up and I think that we will not be able to do much with the results.
   
   @dbatomic that is extremely helpful thank you. will do that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to