[ 
https://issues.apache.org/jira/browse/HADOOP-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12658909#action_12658909
 ] 

Koji Noguchi commented on HADOOP-4927:
--------------------------------------

On one of our clusters, counted number of empty "part-" files.

Out of 30 million files/dirs, 4.5 million part- files were empty. 40 users 
having more than 10,000 empty files.

bq. If you specify N output partitions then you should generate N output files, 
I believe some users did mention that the feature of having exactly N output 
files is useful.

If we could somehow make the no-empty-part-files feature configurable, it'll 
ease up our support work a lot.
(Instead of asking our users to implement a custom outputformat, I can just ask 
them to set the jobconf.)



> Part files on the output filesystem are created irrespective of whether the 
> corresponding task has anything to write there
> --------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4927
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4927
>             Project: Hadoop Core
>          Issue Type: Bug
>            Reporter: Devaraj Das
>             Fix For: 0.20.0
>
>
> When OutputFormat.getRecordWriter is invoked, a part file is created on the 
> output filesystem. But the created RecordWriter is not used until the 
> OutputCollector.collect call is made by the task (user's code). This results 
> in empty part files even if the OutputCollector.collect is never invoked by 
> the corresponding tasks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to