[GitHub] [hudi] parisni commented on pull request #8657: [HUDI-6150] Support bucketing for each hive client

via GitHub Mon, 08 May 2023 03:27:43 -0700


parisni commented on PR #8657:
URL: https://github.com/apache/hudi/pull/8657#issuecomment-1538139978


   > Can you elaborate a little more what the specific functionality of hashing 
alrorithm for Hive BUCKET, does different alrorithm may cause incorrect query 
outputs? Or maybe Hive requires the hashing alrorithm to be very limited set of 
choices.
   
   According to https://issues.apache.org/jira/browse/SPARK-19256 hive itself 
(and also presto/trino) are not able to use the spark hashing algorithm (and 
also file names specs + number of files and sorting). Moreover spark is not 
able itself to exploit hive bucketing. 
   
   So I assume hudi way of doing (which is not compliant with both hive and 
spark) cannot be used to improve query engines queries such join and filter. 
Then this leads all of below are wrong:
   - the current config 
https://hudi.apache.org/docs/configurations/#hoodiedatasourcehive_syncbucket_sync
   - this current PR
   - the rfc statement about support of hive bucketing 
https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] parisni commented on pull request #8657: [HUDI-6150] Support bucketing for each hive client

Reply via email to