Arnaud Nauwynck created MAPREDUCE-7465:
------------------------------------------

             Summary: performance problem in FileOutputCommiter for big list 
processed  by single thread
                 Key: MAPREDUCE-7465
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7465
             Project: Hadoop Map/Reduce
          Issue Type: Improvement
          Components: performance
    Affects Versions: 3.3.6, 3.3.4, 3.3.3, 3.3.5, 3.2.4, 3.3.2, 3.2.3
            Reporter: Arnaud Nauwynck


when commiting a big hadoop job (for example via Spark) having many partitions,
the class FileOutputCommiter process thousands of dirs/files to rename with a 
single Thread. This is performance issue, caused by lot of waits on FileStystem 
storage operations.


Notice that sub-class instances of FileOutputCommiter are supposed to be 
created at runtime dependending of a configurable property 
([https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/PathOutputCommitterFactory.java|PathOutputCommitterFactory.java]).

But for example in Parquet + Spark, this is buggy and can not be changed at 
runtime. 
There is an ongoing Jira and PR to fix it in Parquet + Spark: 
[https://issues.apache.org/jira/browse/PARQUET-2416|https://issues.apache.org/jira/browse/PARQUET-2416]





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-h...@hadoop.apache.org

Reply via email to