parisni commented on PR #8657: URL: https://github.com/apache/hudi/pull/8657#issuecomment-1538139978
> Can you elaborate a little more what the specific functionality of hashing alrorithm for Hive BUCKET, does different alrorithm may cause incorrect query outputs? Or maybe Hive requires the hashing alrorithm to be very limited set of choices. According to https://issues.apache.org/jira/browse/SPARK-19256 hive itself (and also presto/trino) are not able to use the spark hashing algorithm (and also file names specs + number of files and sorting). Moreover spark is not able itself to exploit hive bucketing. So I assume hudi way of doing (which is not compliant with both hive and spark) cannot be used to improve query engines queries such join and filter. Then this leads all of below are wrong: - the current config https://hudi.apache.org/docs/configurations/#hoodiedatasourcehive_syncbucket_sync - this current PR - the rfc statement about support of hive bucketing https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org