I have been storing log data into hdfs cluster (just one datanode at this moment) with 4GB as block size. It worked fine at the beginning but now my individual file sizes have grown over 2GB and I cannot access the files from HDFS cluster anymore. This seems to be happening if the file size is over 2GB. All files which are under 2GB work fine. There has been always enough disk space and time doesn't seem to be a factor (for example 2008-11-24 doesn't work, but 2008-12-05 works)
"hadoop dfs -lsr /events/eventlog" -rw-r--r-- 1 garo supergroup 2177143062 2008-11-25 04:04 /events/eventlog/eventlog-2008-11-24 (doesn't work) -rw-r--r-- 1 garo supergroup 2121109956 2008-12-06 04:04 /events/eventlog/eventlog-2008-12-05 (works) Note that 2008-12-05 filesize is less than 2^31 but 2008-11-24 is larger than 2^31 (2 GB) Example: [g...@postmetal tmp]$ hadoop dfs -get /events/eventlog/eventlog-2008-11-24 . get: null Error log: ==> hadoop-garo-datanode-postmetal.pri.log <== 2008-12-22 10:52:12,325 ERROR org.apache.hadoop.dfs.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-1049869337-10.157.67.82-50010-1221647796455, infoPort=50075, ipcPort=50020):DataXceiver: java.lang.IndexOutOfBoundsException at java.io.DataInputStream.readFully(DataInputStream.java:175) at org.apache.hadoop.dfs.DataNode$BlockSender.sendChunks(DataNode.java:1821) at org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1967) at org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:1109) at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1037) at java.lang.Thread.run(Thread.java:619) Datanode web interface for url: http://postmetal.pri:50075/browseBlock.jsp?blockId=-7907060692488773710&blockSize=2177143062&genstamp=6286&filename=/events/eventlog/eventlog-2008-11-24&datanodePort=50010&namenodeInfoPort=50070 displays this: Total number of blocks: 1 -7907060692488773710: 127.0.0.1:50010 Is this a known problem? Has hadoop ever been tested with block sizes over 2GB? Are my files corrupted (I do have working backups in non-hadoop system). If this is the case and hadoop doesn't support such big block sizes then there should be a clear error message when trying to add files with big block sizes. Or is the problem not in block size but in some other place? - Juho Mäkinen