[ https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yanjia Gary Li resolved HUDI-494. --------------------------------- Resolution: Fixed > [DEBUGGING] Huge amount of tasks when writing files into HDFS > ------------------------------------------------------------- > > Key: HUDI-494 > URL: https://issues.apache.org/jira/browse/HUDI-494 > Project: Apache Hudi > Issue Type: Test > Reporter: Yanjia Gary Li > Assignee: Yanjia Gary Li > Priority: Major > Labels: bug-bash-0.6.0, pull-request-available > Fix For: 0.6.0 > > Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot > 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png, > image-2020-01-05-07-30-53-567.png > > > I am using the manual build master after > [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65] > commit. EDIT: tried with the latest master but got the same result > I am seeing 3 million tasks when the Hudi Spark job writing the files into > HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 > million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. > I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ > folder in my HDFS. In the Spark UI, each task only writes less than 10 > records in > {code:java} > count at HoodieSparkSqlWriter{code} > All the stages before this seem normal. Any idea what happened here? My > first guess would be something related to the bloom filter index. Maybe > somewhere trigger the repartitioning with the bloom filter index? But I am > not really familiar with that part of the code. > Thanks > -- This message was sent by Atlassian Jira (v8.3.4#803005)