[
https://issues.apache.org/jira/browse/HADOOP-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12579846#action_12579846
]
Devaraj Das commented on HADOOP-3041:
-------------------------------------
Alejandro, the reason for modifying the job's output dir is to let user apps
transparently deal with things like creation of side files in the task's output
directory, and, speculative tasks creating the same output files. Another
reason is that the getOutputPath can be used (and is usually used) in the
OutputFormat implementation. All user code could use getOutputPath and create
task specific stuff there and the framework automatically promotes/discards
these files upon successful/failed task completion. Look at the JavaDoc in
JobConf.getOutputPath() to get a clear explanation of what i am trying to say
(by the way this doc needs to be fixed to include _temporary).
You are facing the problem since you create a directory in the _same level_ as
the _actual_ output directory of the job. One way to address your problem is to
provide an additional API like JobConf.getConfiguredOutputPath that would
internally do things like getOutputPath.getParent(), etc. and return you the
actual configured directory. This will ensure that your apps don't break when
the framework changes the directory structure of the output path, etc. Not the
best solution but we have to arrive at a compromise between your requirement
and what we already document and provide. Thoughts?
> Within a task, the value ofJobConf.getOutputPath() method is modified
> ---------------------------------------------------------------------
>
> Key: HADOOP-3041
> URL: https://issues.apache.org/jira/browse/HADOOP-3041
> Project: Hadoop Core
> Issue Type: Bug
> Components: mapred
> Affects Versions: 0.16.1
> Environment: all
> Reporter: Alejandro Abdelnur
> Priority: Blocker
> Fix For: 0.16.2
>
>
> Until 0.16.0 the value of the getOutputPath() method, if queried within a
> task, pointed to the part file assigned to the task.
> For example: /user/foo/myoutput/part_00000
> In 0.16.1, now it returns an internal hadoop for the task output temporary
> location.
> For the above example: /user/foo/myoutput/_temporary/part_00000
> This change breaks applications that use the getOutputPath() to compute other
> directories.
> IMO, this has always being broken, Hadoop should not change the values of
> properties injected by the client, instead it should use private properties
> or internal helper methods.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.