Hey all.

We've been running into a very annoying problem pretty frequently lately. We'll be running some job, for instance a distcp, and it'll be moving along quite nicely, until all of the sudden, it sort of freezes up. It takes a while, and then we'll get an error like this one:

attempt_200809261607_0003_m_000002_0: Exception closing file /tmp/ dustin/input/input_dataunits/_distcp_tmp_1dk90o/part-01897.bucketfile attempt_200809261607_0003_m_000002_0: java.io.IOException: Could not get block locations. Aborting... attempt_200809261607_0003_m_000002_0: at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError (DFSClient.java:2143) attempt_200809261607_0003_m_000002_0: at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1400 (DFSClient.java:1735) attempt_200809261607_0003_m_000002_0: at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run (DFSClient.java:1889)

At approximately the same time, we start seeing lots of these errors in the namenode log:

2008-09-26 16:19:26,502 WARN org.apache.hadoop.dfs.StateChange: DIR* NameSystem.startFile: failed to create file /tmp/dustin/input/ input_dataunits/_distcp_tmp_1dk90o/part-01897.bucketfile for DFSClient_attempt_200809261607_0003_m_000002_1 on client 10.100.11.83 because current leaseholder is trying to recreate file. 2008-09-26 16:19:26,502 INFO org.apache.hadoop.ipc.Server: IPC Server handler 8 on 7276, call create(/tmp/dustin/input/input_dataunits/ _distcp_tmp_1dk90o/part-01897.bucketfile, rwxr-xr-x, DFSClient_attempt_200809261607_0003_m_000002_1, true, 3, 67108864) from 10.100.11.83:60056: error: org.apache.hadoop.dfs.AlreadyBeingCreatedException: failed to create file /tmp/dustin/input/input_dataunits/_distcp_tmp_1dk90o/ part-01897.bucketfile for DFSClient_attempt_200809261607_0003_m_000002_1 on client 10.100.11.83 because current leaseholder is trying to recreate file. org.apache.hadoop.dfs.AlreadyBeingCreatedException: failed to create file /tmp/dustin/input/input_dataunits/_distcp_tmp_1dk90o/ part-01897.bucketfile for DFSClient_attempt_200809261607_0003_m_000002_1 on client 10.100.11.83 because current leaseholder is trying to recreate file. at org.apache.hadoop.dfs.FSNamesystem.startFileInternal (FSNamesystem.java:952) at org.apache.hadoop.dfs.FSNamesystem.startFile (FSNamesystem.java:903)
        at org.apache.hadoop.dfs.NameNode.create(NameNode.java:284)
        at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)



Eventually, the job fails because of these errors. Subsequent job runs also experience this problem and fail. The only way we've been able to recover is to restart the DFS. It doesn't happen every time, but it does happen often enough that I'm worried.

Does anyone have any ideas as to why this might be happening? I thought that https://issues.apache.org/jira/browse/HADOOP-2669 might be the culprit, but today we upgraded to hadoop 0.18.1 and the problem still happens.

Thanks,

Bryan

Reply via email to