Nathan Marz wrote:
Hello all,
Occasionally when running jobs, Hadoop fails to clean up the
"_temporary" directories it has left behind. This only appears to
happen when a task is killed (aka a speculative execution), and the
data that task has outputted so far is not cleaned up. Is this a known
issue in hadoop?
Yes. It is possible that _temporary gets created by a speculative, after
the cleanup in some corner cases.
Is the data from that task guaranteed to be duplicate data of what was
outputted by another task? Is it safe to just delete this directory
without worrying about losing data?
Yes. You are right. It is duplicate data created by the speculative
task. You can go ahead and delete it.
-Amareshwari
Thanks,
Nathan Marz
Rapleaf