Vinoth Chandar created HUDI-315: ----------------------------------- Summary: Reimplement statistics/workload profile collected during writes using Spark 2.x custom accumulators Key: HUDI-315 URL: https://issues.apache.org/jira/browse/HUDI-315 Project: Apache Hudi (incubating) Issue Type: Improvement Components: Performance, Write Client Reporter: Vinoth Chandar
https://medium.com/@shrechak/leveraging-custom-accumulators-in-apache-spark-2-0-f4fef23f19f1 In Hudi, there are two places where we need to obtain statistics on the input data - HoodieBloomIndex : for knowing what partitions need to be loaded and checked against (is this still needed with the timeline server enabled is a separate question) - Workload profile to get a sense of number of updates, inserts to each partition/file group Both of them issue their own groupBy or shuffle computation today. This can be avoided using an accumulator -- This message was sent by Atlassian Jira (v8.3.4#803005)