I've been running up against the good old fashioned "replicated to 0 nodes" gremlin quite a bit recently. My system (a set of processes interacting with hadoop, and of course hadoop itself) runs for a while (a day or so) and then I get plagued with these errors. This is a very simple system, a single node running pseudo-distributed. Obviously, the replication factor is implicitly 1 and the datanode is the same machine as the namenode. None of the typical culprits seem to explain the situation and I'm not sure what to do. I'm also not sure how I'm getting around it so far. I fiddle desperately for a few hours and things start running again, but that's not really a solution...I've tried stopping and restarting hdfs, but that doesn't seem to improve things.
So, to go through the common suspects one by one, as quoted on http://wiki.apache.org/hadoop/CouldOnlyBeReplicatedTo: • No DataNode instances being up and running. Action: look at the servers, see if the processes are running. I can interact with hdfs through the command line (doing directory listings for example). Furthermore, I can see that the relevant java processes are all running (NameNode, SecondaryNameNode, DataNode, JobTracker, TaskTracker). • The DataNode instances cannot talk to the server, through networking or Hadoop configuration problems. Action: look at the logs of one of the DataNodes. Obviously irrelevant in a single-node scenario. Anyway, like I said, I can perform basic hdfs listings, I just can't upload new data. • Your DataNode instances have no hard disk space in their configured data directories. Action: look at the dfs.data.dir list in the node configurations, verify that at least one of the directories exists, and is writeable by the user running the Hadoop processes. Then look at the logs. There's plenty of space, at least 50GB. • Your DataNode instances have run out of space. Look at the disk capacity via the Namenode web pages. Delete old files. Compress under-used files. Buy more disks for existing servers (if there is room), upgrade the existing servers to bigger drives, or add some more servers. Nope, 50GBs free, I'm only uploading a few KB at a time, maybe a few MB. • The reserved space for a DN (as set in dfs.datanode.du.reserved is greater than the remaining free space, so the DN thinks it has no free space I grepped all the files in the conf directory and couldn't find this parameter so I don't really know anything about it. At any rate, it seems rather esoteric, I doubt it is related to my problem. Any thoughts on this? • You may also get this message due to permissions, eg if JT can not create jobtracker.info on startup. Meh, like I said, the system basicaslly works...and then stops working. The only explanation that would really make sense in that context is running out of space...which isn't happening. If this were a permission error, or a configuration error, or anything weird like that, then the whole system would never get up and running in the first place. Why would a properly running hadoop system start exhibiting this error without running out of disk space? THAT's the real question on the table here. Any ideas? ________________________________________________________________________________ Keith Wiley kwi...@keithwiley.com keithwiley.com music.keithwiley.com "Yet mark his perfect self-contentment, and hence learn his lesson, that to be self-contented is to be vile and ignorant, and that to aspire is better than to be blindly and impotently happy." -- Edwin A. Abbott, Flatland ________________________________________________________________________________