[
https://issues.apache.org/jira/browse/HADOOP-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12658909#action_12658909
]
Koji Noguchi commented on HADOOP-4927:
--------------------------------------
On one of our clusters, counted number of empty "part-" files.
Out of 30 million files/dirs, 4.5 million part- files were empty. 40 users
having more than 10,000 empty files.
bq. If you specify N output partitions then you should generate N output files,
I believe some users did mention that the feature of having exactly N output
files is useful.
If we could somehow make the no-empty-part-files feature configurable, it'll
ease up our support work a lot.
(Instead of asking our users to implement a custom outputformat, I can just ask
them to set the jobconf.)
> Part files on the output filesystem are created irrespective of whether the
> corresponding task has anything to write there
> --------------------------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-4927
> URL: https://issues.apache.org/jira/browse/HADOOP-4927
> Project: Hadoop Core
> Issue Type: Bug
> Reporter: Devaraj Das
> Fix For: 0.20.0
>
>
> When OutputFormat.getRecordWriter is invoked, a part file is created on the
> output filesystem. But the created RecordWriter is not used until the
> OutputCollector.collect call is made by the task (user's code). This results
> in empty part files even if the OutputCollector.collect is never invoked by
> the corresponding tasks.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.