[ 
https://issues.apache.org/jira/browse/HUDI-3777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3777:
-----------------------------
    Description: 
Avoid storing filename of each record in the colstats partition.


As of now, we store fileName as part of value in Col stats entries. This 
results in more storage, but comes w/ ease of getting everything in 1 look up. 
But as you could see, file name is repeated in every entries' value. And since 
its UUID based, each file name is going to add 70 bytes to each entry. 

For eg, 
lets say we have a table with 1000 columns. 1000 partitions. with each 
partition having 10k files. 

Total entries in col stats partition = 1000*1000*10000 = 10^10. 10B records. 
So, thats ~ 70GB. 

where in, if we can come up with a mapping of a unique Id for every filename, 
and store the mapping elsewhere (like FILES partition), we need only 8 bytes 
per entry. 

  was:Avoid storing filename of each record in the colstats partition.


> Optimize column stats storage
> -----------------------------
>
>                 Key: HUDI-3777
>                 URL: https://issues.apache.org/jira/browse/HUDI-3777
>             Project: Apache Hudi
>          Issue Type: Task
>            Reporter: Sagar Sumit
>            Priority: Blocker
>             Fix For: 0.13.0
>
>
> Avoid storing filename of each record in the colstats partition.
> As of now, we store fileName as part of value in Col stats entries. This 
> results in more storage, but comes w/ ease of getting everything in 1 look 
> up. But as you could see, file name is repeated in every entries' value. And 
> since its UUID based, each file name is going to add 70 bytes to each entry. 
> For eg, 
> lets say we have a table with 1000 columns. 1000 partitions. with each 
> partition having 10k files. 
> Total entries in col stats partition = 1000*1000*10000 = 10^10. 10B records. 
> So, thats ~ 70GB. 
> where in, if we can come up with a mapping of a unique Id for every filename, 
> and store the mapping elsewhere (like FILES partition), we need only 8 bytes 
> per entry. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to