[ https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17101055#comment-17101055 ]
Yanjia Gary Li commented on HUDI-494: ------------------------------------- Ok, I see what happened here. Root cause is [https://github.com/apache/incubator-hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndex.java#L214] So basically commit 1 wrote a very small file(let's say 200 records) to a new partition day=05. And then when commit 2 trying to write to day=05, it will look up the affected partition and use the Bloom index range from the existing files, so it will use 200 here. Commit 2 has much more records than 200, so it will create tons of files since the Bloom index range is too small. I am not really familiar with the indexing part of the code. Please let me know if I understand this correctly and we can figure out a fix. [~lamber-ken] [~vinoth] > [DEBUGGING] Huge amount of tasks when writing files into HDFS > ------------------------------------------------------------- > > Key: HUDI-494 > URL: https://issues.apache.org/jira/browse/HUDI-494 > Project: Apache Hudi (incubating) > Issue Type: Test > Reporter: Yanjia Gary Li > Assignee: Yanjia Gary Li > Priority: Major > Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot > 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png, > image-2020-01-05-07-30-53-567.png > > > I am using the manual build master after > [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65] > commit. EDIT: tried with the latest master but got the same result > I am seeing 3 million tasks when the Hudi Spark job writing the files into > HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 > million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. > I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ > folder in my HDFS. In the Spark UI, each task only writes less than 10 > records in > {code:java} > count at HoodieSparkSqlWriter{code} > All the stages before this seem normal. Any idea what happened here? My > first guess would be something related to the bloom filter index. Maybe > somewhere trigger the repartitioning with the bloom filter index? But I am > not really familiar with that part of the code. > Thanks > -- This message was sent by Atlassian Jira (v8.3.4#803005)