[ https://issues.apache.org/jira/browse/HUDI-64?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Raymond Xu reassigned HUDI-64: ------------------------------ Assignee: Forward Xu (was: Ethan Guo) > Estimation of compression ratio & other dynamic storage knobs based on > historical stats > --------------------------------------------------------------------------------------- > > Key: HUDI-64 > URL: https://issues.apache.org/jira/browse/HUDI-64 > Project: Apache Hudi > Issue Type: New Feature > Components: storage-management, writer-core > Reporter: Vinoth Chandar > Assignee: Forward Xu > Priority: Blocker > Labels: help-requested, sev:high > Fix For: 0.11.0 > > > Something core to Hudi writing is using heuristics or runtime workload > statistics to optimize aspects of storage like file sizes, partitioning and > so on. > Below lists all such places. > > # Compression ratio for parquet > [https://github.com/apache/incubator-hudi/blob/a4f9d7575f39bb79089714049ffea12ba5f25ec8/hudi-client/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java#L46] > . This is used by HoodieWrapperFileSystem, to estimate amount of bytes it > has written for a given parquet file and closes the parquet file once the > configured size has reached. DFSOutputStream level we only know bytes written > before compression. Once enough data has been written, it should be possible > to replace this by a simple estimate of what the avg record size would be > (commit metadata would give you size and number of records in each file) > # Very similar problem exists for log files > [https://github.com/apache/incubator-hudi/blob/a4f9d7575f39bb79089714049ffea12ba5f25ec8/hudi-client/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java#L52] > We write data into logs in avro and can log updates to same record in > parquet multiple times. We need to estimate again how large the log file(s) > can grow to, and still we would be able to produce a parquet file of > configured size during compaction. (hope I conveyed this clearly) > # WorkloadProfile : > [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/table/WorkloadProfile.java] > caches the input records using Spark Caching and computes the shape of the > workload, i.e how many records per partition, how many inserts vs updates > etc. This is used by the Partitioner here > [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java#L141] > for assigning records to a file group. This is the critical one to replace > for Flink support and probably the hardest, since we need to guess input, > which is not always possible? > # Within partitioner, we already derive a simple average size per record > [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java#L756] > from the last commit metadata alone. This can be generalized. (default : > [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java#L71]) > > # > Our goal in this Jira is to see, if could derive this information in the > background purely using the commit metadata.. Some parts of this are > open-ended.. Good starting point would be to see whats feasible, estimate ROI > before aactually implementing > > > > > > > Roughly along the likes of. [https://github.com/uber/hudi/issues/270] -- This message was sent by Atlassian Jira (v8.20.1#820001)