Hadoop corrupting files if file block size is 4GB and file size is 2GB

Juho Mäkinen Mon, 22 Dec 2008 00:59:48 -0800

I have been storing log data into hdfs cluster (just one datanode at
this moment) with 4GB as block size. It worked fine at the beginning
but now my individual file sizes have grown over 2GB and I cannot
access the files from HDFS cluster anymore. This seems to be happening
if the file size is over 2GB. All files which are under 2GB work fine.
There has been always enough disk space and time doesn't seem to be a
factor (for example 2008-11-24 doesn't work, but 2008-12-05 works)


"hadoop dfs -lsr /events/eventlog"
-rw-r--r--   1 garo supergroup 2177143062 2008-11-25 04:04
/events/eventlog/eventlog-2008-11-24 (doesn't work)
-rw-r--r--   1 garo supergroup 2121109956 2008-12-06 04:04
/events/eventlog/eventlog-2008-12-05 (works)

Note that 2008-12-05 filesize is less than 2^31 but 2008-11-24 is
larger than 2^31 (2 GB)


Example:
[g...@postmetal tmp]$ hadoop dfs -get /events/eventlog/eventlog-2008-11-24 .
get: null

Error log:
==> hadoop-garo-datanode-postmetal.pri.log <==
2008-12-22 10:52:12,325 ERROR org.apache.hadoop.dfs.DataNode:
DatanodeRegistration(127.0.0.1:50010,
storageID=DS-1049869337-10.157.67.82-50010-1221647796455,
infoPort=50075, ipcPort=50020):DataXceiver:
java.lang.IndexOutOfBoundsException
        at java.io.DataInputStream.readFully(DataInputStream.java:175)
        at 
org.apache.hadoop.dfs.DataNode$BlockSender.sendChunks(DataNode.java:1821)
        at 
org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1967)
        at 
org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:1109)
        at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1037)
        at java.lang.Thread.run(Thread.java:619)

Datanode web interface for url:
http://postmetal.pri:50075/browseBlock.jsp?blockId=-7907060692488773710&blockSize=2177143062&genstamp=6286&filename=/events/eventlog/eventlog-2008-11-24&datanodePort=50010&namenodeInfoPort=50070
displays this:

Total number of blocks: 1
-7907060692488773710:           127.0.0.1:50010

Is this a known problem? Has hadoop ever been tested with block sizes
over 2GB? Are my files corrupted (I do have working backups in
non-hadoop system). If this is the case and hadoop doesn't support
such big block sizes then there should be a clear error message when
trying to add files with big block sizes. Or is the problem not in
block size but in some other place?

 - Juho Mäkinen

Hadoop corrupting files if file block size is 4GB and file size is 2GB

Reply via email to