[GitHub] [hudi] danny0405 commented on pull request #8657: [HUDI-6150] Support bucketing for each hive client

2023-05-14 Thread via GitHub
danny0405 commented on PR #8657: URL: https://github.com/apache/hudi/pull/8657#issuecomment-1547119881 > I am not sure it is a good design to introduce spark concepts within hudi-client-common Obviously it is a bad design that we should avoid to take, can we just impl the whole spark

[GitHub] [hudi] danny0405 commented on pull request #8657: [HUDI-6150] Support bucketing for each hive client

2023-05-10 Thread via GitHub
danny0405 commented on PR #8657: URL: https://github.com/apache/hudi/pull/8657#issuecomment-1543269796 > Hardcoding Murmur is likely a good idea Not hardcoding, I mean to make it configurable, the use choose the algorithm they desire to use. > it would allow to support both spa

[GitHub] [hudi] danny0405 commented on pull request #8657: [HUDI-6150] Support bucketing for each hive client

2023-05-10 Thread via GitHub
danny0405 commented on PR #8657: URL: https://github.com/apache/hudi/pull/8657#issuecomment-1541617605 > > , I'm afraid the algorithm should be in-consistency too in order to operate the bucket pruning opimization > > not sure to understand. Do you mean the hashing algorithm must be t

[GitHub] [hudi] danny0405 commented on pull request #8657: [HUDI-6150] Support bucketing for each hive client

2023-05-09 Thread via GitHub
danny0405 commented on PR #8657: URL: https://github.com/apache/hudi/pull/8657#issuecomment-1541253334 > ${bucketId}_$ So it seems the naming convention used by Hudi is compatible with Hive in general(not Spark or Trino), the only concern is the hasing algorithm, I'm afraid the algor

[GitHub] [hudi] danny0405 commented on pull request #8657: [HUDI-6150] Support bucketing for each hive client

2023-05-08 Thread via GitHub
danny0405 commented on PR #8657: URL: https://github.com/apache/hudi/pull/8657#issuecomment-1539309061 > - hashing - file naming - file numbering - file sorting Can you elaborate a little more about these items? -- This is an automated message from the Apache Git Service. To respond

[GitHub] [hudi] danny0405 commented on pull request #8657: [HUDI-6150] Support bucketing for each hive client

2023-05-08 Thread via GitHub
danny0405 commented on PR #8657: URL: https://github.com/apache/hudi/pull/8657#issuecomment-1538176946 > * the rfc statement about support of hive bucketing https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index Thanks for the detailed analysis, so what the actions th

[GitHub] [hudi] danny0405 commented on pull request #8657: [HUDI-6150] Support bucketing for each hive client

2023-05-07 Thread via GitHub
danny0405 commented on PR #8657: URL: https://github.com/apache/hudi/pull/8657#issuecomment-1537691855 > but so far I am not sure what the current status of hudi hashing It uses only simple Java hashcode: https://github.com/apache/hudi/blob/20938c30b168d63cf4e520c6b4e1d7b930bed1ab/