[jira] [Created] (HUDI-315) Reimplement statistics/workload profile collected during writes using Spark 2.x custom accumulators

Vinoth Chandar (Jira) Sat, 26 Oct 2019 23:48:38 -0700

Vinoth Chandar created HUDI-315:
-----------------------------------

             Summary: Reimplement statistics/workload profile collected during 
writes using Spark 2.x custom accumulators
                 Key: HUDI-315
                 URL: https://issues.apache.org/jira/browse/HUDI-315
             Project: Apache Hudi (incubating)
          Issue Type: Improvement
          Components: Performance, Write Client
            Reporter: Vinoth Chandar



https://medium.com/@shrechak/leveraging-custom-accumulators-in-apache-spark-2-0-f4fef23f19f1
 

In Hudi, there are two places where we need to obtain statistics on the input 
data 

- HoodieBloomIndex  : for knowing what partitions need to be loaded and checked 
against (is this still needed with the timeline server enabled is a separate 
question) 
- Workload profile to get a sense of number of updates, inserts to each 
partition/file group

Both of them issue their own groupBy or shuffle computation today. This can be 
avoided using an accumulator



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-315) Reimplement statistics/workload profile collected during writes using Spark 2.x custom accumulators

Reply via email to