[jira] [Assigned] (HUDI-64) Estimation of compression ratio & other dynamic storage knobs based on historical stats

2022-05-03 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-64?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu reassigned HUDI-64:
--

Assignee: (was: Forward Xu)

> Estimation of compression ratio & other dynamic storage knobs based on 
> historical stats
> ---
>
> Key: HUDI-64
> URL: https://issues.apache.org/jira/browse/HUDI-64
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: storage-management, writer-core
>Reporter: Vinoth Chandar
>Priority: Major
> Fix For: 0.12.0
>
>
> Something core to Hudi writing is using heuristics or runtime workload 
> statistics to optimize aspects of storage like file sizes, partitioning and 
> so on.  
> Below lists all such places. 
>  
>  # Compression ratio for parquet 
> [https://github.com/apache/incubator-hudi/blob/a4f9d7575f39bb79089714049ffea12ba5f25ec8/hudi-client/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java#L46]
>  . This is used by HoodieWrapperFileSystem, to estimate amount of bytes it 
> has written for a given parquet file and closes the parquet file once the 
> configured size has reached. DFSOutputStream level we only know bytes written 
> before compression. Once enough data has been written, it should be possible 
> to replace this by a simple estimate of what the avg record size would be 
> (commit metadata would give you size and number of records in each file)
>  # Very similar problem exists for log files 
> [https://github.com/apache/incubator-hudi/blob/a4f9d7575f39bb79089714049ffea12ba5f25ec8/hudi-client/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java#L52]
>  We write data into logs in avro and can log updates to same record in 
> parquet multiple times. We need to estimate again how large the log file(s) 
> can grow to, and still we would be able to produce a parquet file of 
> configured size during compaction. (hope I conveyed this clearly)
>  # WorkloadProfile : 
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/table/WorkloadProfile.java]
>  caches the input records using Spark Caching and computes the shape of the 
> workload, i.e how many records per partition, how many inserts vs updates 
> etc. This is used by the Partitioner here 
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java#L141]
>  for assigning records to a file group. This is the critical one to replace 
> for Flink support and probably the hardest, since we need to guess input, 
> which is not always possible? 
>  # Within partitioner, we already derive a simple average size per record 
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java#L756]
>  from the last commit metadata alone. This can be generalized.  (default : 
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java#L71])
>  
>  # 
> Our goal in this Jira is to see, if could derive this information in the 
> background purely using the commit metadata.. Some parts of this are 
> open-ended.. Good starting point would be to see whats feasible, estimate ROI 
> before aactually implementing 
>  
>  
>  
>  
>  
>  
> Roughly along the likes of. [https://github.com/uber/hudi/issues/270] 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (HUDI-64) Estimation of compression ratio & other dynamic storage knobs based on historical stats

2022-02-28 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-64?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu reassigned HUDI-64:
--

Assignee: Forward Xu  (was: Ethan Guo)

> Estimation of compression ratio & other dynamic storage knobs based on 
> historical stats
> ---
>
> Key: HUDI-64
> URL: https://issues.apache.org/jira/browse/HUDI-64
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: storage-management, writer-core
>Reporter: Vinoth Chandar
>Assignee: Forward Xu
>Priority: Blocker
>  Labels: help-requested, sev:high
> Fix For: 0.11.0
>
>
> Something core to Hudi writing is using heuristics or runtime workload 
> statistics to optimize aspects of storage like file sizes, partitioning and 
> so on.  
> Below lists all such places. 
>  
>  # Compression ratio for parquet 
> [https://github.com/apache/incubator-hudi/blob/a4f9d7575f39bb79089714049ffea12ba5f25ec8/hudi-client/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java#L46]
>  . This is used by HoodieWrapperFileSystem, to estimate amount of bytes it 
> has written for a given parquet file and closes the parquet file once the 
> configured size has reached. DFSOutputStream level we only know bytes written 
> before compression. Once enough data has been written, it should be possible 
> to replace this by a simple estimate of what the avg record size would be 
> (commit metadata would give you size and number of records in each file)
>  # Very similar problem exists for log files 
> [https://github.com/apache/incubator-hudi/blob/a4f9d7575f39bb79089714049ffea12ba5f25ec8/hudi-client/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java#L52]
>  We write data into logs in avro and can log updates to same record in 
> parquet multiple times. We need to estimate again how large the log file(s) 
> can grow to, and still we would be able to produce a parquet file of 
> configured size during compaction. (hope I conveyed this clearly)
>  # WorkloadProfile : 
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/table/WorkloadProfile.java]
>  caches the input records using Spark Caching and computes the shape of the 
> workload, i.e how many records per partition, how many inserts vs updates 
> etc. This is used by the Partitioner here 
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java#L141]
>  for assigning records to a file group. This is the critical one to replace 
> for Flink support and probably the hardest, since we need to guess input, 
> which is not always possible? 
>  # Within partitioner, we already derive a simple average size per record 
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java#L756]
>  from the last commit metadata alone. This can be generalized.  (default : 
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java#L71])
>  
>  # 
> Our goal in this Jira is to see, if could derive this information in the 
> background purely using the commit metadata.. Some parts of this are 
> open-ended.. Good starting point would be to see whats feasible, estimate ROI 
> before aactually implementing 
>  
>  
>  
>  
>  
>  
> Roughly along the likes of. [https://github.com/uber/hudi/issues/270] 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (HUDI-64) Estimation of compression ratio & other dynamic storage knobs based on historical stats

2022-01-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-64?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-64:
--

Assignee: Ethan Guo  (was: Vinoth Chandar)

> Estimation of compression ratio & other dynamic storage knobs based on 
> historical stats
> ---
>
> Key: HUDI-64
> URL: https://issues.apache.org/jira/browse/HUDI-64
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Storage Management, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: help-requested
> Fix For: 0.12.0
>
>
> Something core to Hudi writing is using heuristics or runtime workload 
> statistics to optimize aspects of storage like file sizes, partitioning and 
> so on.  
> Below lists all such places. 
>  
>  # Compression ratio for parquet 
> [https://github.com/apache/incubator-hudi/blob/a4f9d7575f39bb79089714049ffea12ba5f25ec8/hudi-client/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java#L46]
>  . This is used by HoodieWrapperFileSystem, to estimate amount of bytes it 
> has written for a given parquet file and closes the parquet file once the 
> configured size has reached. DFSOutputStream level we only know bytes written 
> before compression. Once enough data has been written, it should be possible 
> to replace this by a simple estimate of what the avg record size would be 
> (commit metadata would give you size and number of records in each file)
>  # Very similar problem exists for log files 
> [https://github.com/apache/incubator-hudi/blob/a4f9d7575f39bb79089714049ffea12ba5f25ec8/hudi-client/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java#L52]
>  We write data into logs in avro and can log updates to same record in 
> parquet multiple times. We need to estimate again how large the log file(s) 
> can grow to, and still we would be able to produce a parquet file of 
> configured size during compaction. (hope I conveyed this clearly)
>  # WorkloadProfile : 
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/table/WorkloadProfile.java]
>  caches the input records using Spark Caching and computes the shape of the 
> workload, i.e how many records per partition, how many inserts vs updates 
> etc. This is used by the Partitioner here 
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java#L141]
>  for assigning records to a file group. This is the critical one to replace 
> for Flink support and probably the hardest, since we need to guess input, 
> which is not always possible? 
>  # Within partitioner, we already derive a simple average size per record 
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java#L756]
>  from the last commit metadata alone. This can be generalized.  (default : 
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java#L71])
>  
>  # 
> Our goal in this Jira is to see, if could derive this information in the 
> background purely using the commit metadata.. Some parts of this are 
> open-ended.. Good starting point would be to see whats feasible, estimate ROI 
> before aactually implementing 
>  
>  
>  
>  
>  
>  
> Roughly along the likes of. [https://github.com/uber/hudi/issues/270] 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (HUDI-64) Estimation of compression ratio & other dynamic storage knobs based on historical stats

2019-10-02 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-64?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-64:
--

Assignee: Vinoth Chandar

> Estimation of compression ratio & other dynamic storage knobs based on 
> historical stats
> ---
>
> Key: HUDI-64
> URL: https://issues.apache.org/jira/browse/HUDI-64
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Storage Management, Write Client
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>
> Roughly along the likes of. [https://github.com/uber/hudi/issues/270] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)