I've been running up against the good old fashioned "replicated to 0 nodes" 
gremlin quite a bit recently.  My system (a set of processes interacting with 
hadoop, and of course hadoop itself) runs for a while (a day or so) and then I 
get plagued with these errors.  This is a very simple system, a single node 
running pseudo-distributed.  Obviously, the replication factor is implicitly 1 
and the datanode is the same machine as the namenode.  None of the typical 
culprits seem to explain the situation and I'm not sure what to do.  I'm also 
not sure how I'm getting around it so far.  I fiddle desperately for a few 
hours and things start running again, but that's not really a solution...I've 
tried stopping and restarting hdfs, but that doesn't seem to improve things.

So, to go through the common suspects one by one, as quoted on 
http://wiki.apache.org/hadoop/CouldOnlyBeReplicatedTo:

• No DataNode instances being up and running. Action: look at the servers, see 
if the processes are running.

I can interact with hdfs through the command line (doing directory listings for 
example).  Furthermore, I can see that the relevant java processes are all 
running (NameNode, SecondaryNameNode, DataNode, JobTracker, TaskTracker).

• The DataNode instances cannot talk to the server, through networking or 
Hadoop configuration problems. Action: look at the logs of one of the DataNodes.

Obviously irrelevant in a single-node scenario.  Anyway, like I said, I can 
perform basic hdfs listings, I just can't upload new data.

• Your DataNode instances have no hard disk space in their configured data 
directories. Action: look at the dfs.data.dir list in the node configurations, 
verify that at least one of the directories exists, and is writeable by the 
user running the Hadoop processes. Then look at the logs.

There's plenty of space, at least 50GB.

• Your DataNode instances have run out of space. Look at the disk capacity via 
the Namenode web pages. Delete old files. Compress under-used files. Buy more 
disks for existing servers (if there is room), upgrade the existing servers to 
bigger drives, or add some more servers.

Nope, 50GBs free, I'm only uploading a few KB at a time, maybe a few MB.

• The reserved space for a DN (as set in dfs.datanode.du.reserved is greater 
than the remaining free space, so the DN thinks it has no free space

I grepped all the files in the conf directory and couldn't find this parameter 
so I don't really know anything about it.  At any rate, it seems rather 
esoteric, I doubt it is related to my problem.  Any thoughts on this?

• You may also get this message due to permissions, eg if JT can not create 
jobtracker.info on startup.

Meh, like I said, the system basicaslly works...and then stops working.  The 
only explanation that would really make sense in that context is running out of 
space...which isn't happening. If this were a permission error, or a 
configuration error, or anything weird like that, then the whole system would 
never get up and running in the first place.

Why would a properly running hadoop system start exhibiting this error without 
running out of disk space?  THAT's the real question on the table here.

Any ideas?

________________________________________________________________________________
Keith Wiley     kwi...@keithwiley.com     keithwiley.com    music.keithwiley.com

"Yet mark his perfect self-contentment, and hence learn his lesson, that to be
self-contented is to be vile and ignorant, and that to aspire is better than to
be blindly and impotently happy."
                                           --  Edwin A. Abbott, Flatland
________________________________________________________________________________

Reply via email to