Hey all.
We've been running into a very annoying problem pretty frequently
lately. We'll be running some job, for instance a distcp, and it'll
be moving along quite nicely, until all of the sudden, it sort of
freezes up. It takes a while, and then we'll get an error like this one:
attempt_200809261607_0003_m_000002_0: Exception closing file /tmp/
dustin/input/input_dataunits/_distcp_tmp_1dk90o/part-01897.bucketfile
attempt_200809261607_0003_m_000002_0: java.io.IOException: Could not
get block locations. Aborting...
attempt_200809261607_0003_m_000002_0: at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError
(DFSClient.java:2143)
attempt_200809261607_0003_m_000002_0: at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1400
(DFSClient.java:1735)
attempt_200809261607_0003_m_000002_0: at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run
(DFSClient.java:1889)
At approximately the same time, we start seeing lots of these errors
in the namenode log:
2008-09-26 16:19:26,502 WARN org.apache.hadoop.dfs.StateChange: DIR*
NameSystem.startFile: failed to create file /tmp/dustin/input/
input_dataunits/_distcp_tmp_1dk90o/part-01897.bucketfile for
DFSClient_attempt_200809261607_0003_m_000002_1 on client 10.100.11.83
because current leaseholder is trying to recreate file.
2008-09-26 16:19:26,502 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 8 on 7276, call create(/tmp/dustin/input/input_dataunits/
_distcp_tmp_1dk90o/part-01897.bucketfile, rwxr-xr-x,
DFSClient_attempt_200809261607_0003_m_000002_1, true, 3, 67108864)
from 10.100.11.83:60056: error:
org.apache.hadoop.dfs.AlreadyBeingCreatedException: failed to create
file /tmp/dustin/input/input_dataunits/_distcp_tmp_1dk90o/
part-01897.bucketfile for
DFSClient_attempt_200809261607_0003_m_000002_1 on client 10.100.11.83
because current leaseholder is trying to recreate file.
org.apache.hadoop.dfs.AlreadyBeingCreatedException: failed to create
file /tmp/dustin/input/input_dataunits/_distcp_tmp_1dk90o/
part-01897.bucketfile for
DFSClient_attempt_200809261607_0003_m_000002_1 on client 10.100.11.83
because current leaseholder is trying to recreate file.
at org.apache.hadoop.dfs.FSNamesystem.startFileInternal
(FSNamesystem.java:952)
at org.apache.hadoop.dfs.FSNamesystem.startFile
(FSNamesystem.java:903)
at org.apache.hadoop.dfs.NameNode.create(NameNode.java:284)
at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke
(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)
Eventually, the job fails because of these errors. Subsequent job
runs also experience this problem and fail. The only way we've been
able to recover is to restart the DFS. It doesn't happen every time,
but it does happen often enough that I'm worried.
Does anyone have any ideas as to why this might be happening? I
thought that https://issues.apache.org/jira/browse/HADOOP-2669 might
be the culprit, but today we upgraded to hadoop 0.18.1 and the
problem still happens.
Thanks,
Bryan