On Feb 9, 2009, at 7:50 PM, jason hadoop wrote:
The other issue you may run into, with many files in your HDFS is
that you
may end up with more than a few 100k worth of blocks on each of your
datanodes. At present this can lead to instability due to the way the
periodic block reports to the namenode are handled. The more blocks
per
datanode, the larger the risk of congestion collapse in your hdfs.
Of course, if you stay below, say, 500k, you don't have much of a risk
of congestion.
In our experience, 500k blocks or less is going to be fine with decent
hardware. Between 500k and 750k, you will hit a wall somewhere
depending on your hardware. Good luck getting anything above 750k.
The recommendation is that you keep this number as low as possible --
and explore the limits of your system and hardware in testing before
you discover them in production :)
Brian
On Mon, Feb 9, 2009 at 5:11 PM, Bryan Duxbury <br...@rapleaf.com>
wrote:
Correct.
+1 to Jason's more unix file handles suggestion. That's a must-have.
-Bryan
On Feb 9, 2009, at 3:09 PM, Scott Whitecross wrote:
This would be an addition to the hadoop-site.xml file, to up
dfs.datanode.max.xcievers?
Thanks.
On Feb 9, 2009, at 5:54 PM, Bryan Duxbury wrote:
Small files are bad for hadoop. You should avoid keeping a lot of
small
files if possible.
That said, that error is something I've seen a lot. It usually
happens
when the number of xcievers hasn't been adjusted upwards from the
default of
256. We run with 8000 xcievers, and that seems to solve our
problems. I
think that if you have a lot of open files, this problem happens
a lot
faster.
-Bryan
On Feb 9, 2009, at 1:01 PM, Scott Whitecross wrote:
Hi all -
I've been running into this error the past few days:
java.io.IOException: Could not get block locations. Aborting...
at
org.apache.hadoop.dfs.DFSClient
$DFSOutputStream.processDatanodeError(DFSClient.java:2143)
at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access
$1400(DFSClient.java:1735)
at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream
$DataStreamer.run(DFSClient.java:1889)
It seems to be related to trying to write to many files to
HDFS. I have
a class extending
org.apache.hadoop.mapred.lib.MultipleOutputFormat and if I
output to a few file names, everything works. However, if I
output to
thousands of small files, the above error occurs. I'm having
trouble
isolating the problem, as the problem doesn't occur in the
debugger
unfortunately.
Is this a memory issue, or is there an upper limit to the number
of
files HDFS can hold? Any settings to adjust?
Thanks.