[GitHub] [hudi] lw309637554 commented on issue #2652: [SUPPORT] I have some questions for hudi clustering

GitBox Thu, 11 Mar 2021 07:42:26 -0800


lw309637554 commented on issue #2652:
URL: https://github.com/apache/hudi/issues/2652#issuecomment-796829101



   > 1. does the mapping of [<key,partitionpath> -> fileGroupId ]  changed 
after clustering ?  the record may wrote to another filegroup?
   > 2. clusting sort the columns, does it change the physical path of the 
record to different location which not a partition path by using inlinefs ?
   > 3. does clustering work on full hudi table or we can choose some 
partitions?
   > 4. why clustering ignore the file which size over the targetFileSize? if 
we ignore it, we should cost time for full scan this file.
   > 5. when some file is compacting , does clutering scheduler will ignore 
these files , and then clustering running will still 
   @shenbinglife @vinothchandar hello,  i can reply it .
   1. yes, the mapping changed. Will write to another file group.
   2. clustering sort ,just make the records in a filegroup are sorted. It use 
spark RDDCustomColumnsSortPartitioner.
   3. Now it will work on full table. And  every time will choose 
"hoodie.clustering.plan.strategy.daybased.lookback.partitions" num partition to 
clustering. You can set the param.
   4. Clustering will make small file to large file. Because with large file  
spark or presto can split it . Performance better.
   5. yes


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] lw309637554 commented on issue #2652: [SUPPORT] I have some questions for hudi clustering

Reply via email to