hudi-bot opened a new issue, #15100: URL: https://github.com/apache/hudi/issues/15100
Avoid storing filename of each record in the colstats partition. As of now, we store fileName as part of value in Col stats entries. This results in more storage, but comes w/ ease of getting everything in 1 look up. But as you could see, file name is repeated in every entries' value. And since its UUID based, each file name is going to add 70 bytes to each entry. For eg, lets say we have a table with 1000 columns. 1000 partitions. with each partition having 10k files. Total entries in col stats partition = 1000*1000*10000 = 10^10. 10B records. So, thats ~ 70GB. where in, if we can come up with a mapping of a unique Id for every filename, and store the mapping elsewhere (like FILES partition), we need only 8 bytes per entry. ## JIRA info - Link: https://issues.apache.org/jira/browse/HUDI-3777 - Type: Task - Epic: https://issues.apache.org/jira/browse/HUDI-1822 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
