On Feb 9, 2009, at 7:50 PM, jason hadoop wrote:

The other issue you may run into, with many files in your HDFS is that you
may end up with more than a few 100k worth of blocks on each of your
datanodes. At present this can lead to instability due to the way the
periodic block reports to the namenode are handled. The more blocks per
datanode, the larger the risk of congestion collapse in your hdfs.

Of course, if you stay below, say, 500k, you don't have much of a risk of congestion.

In our experience, 500k blocks or less is going to be fine with decent hardware. Between 500k and 750k, you will hit a wall somewhere depending on your hardware. Good luck getting anything above 750k.

The recommendation is that you keep this number as low as possible -- and explore the limits of your system and hardware in testing before you discover them in production :)

Brian



On Mon, Feb 9, 2009 at 5:11 PM, Bryan Duxbury <br...@rapleaf.com> wrote:

Correct.

+1 to Jason's more unix file handles suggestion. That's a must-have.

-Bryan


On Feb 9, 2009, at 3:09 PM, Scott Whitecross wrote:

This would be an addition to the hadoop-site.xml file, to up
dfs.datanode.max.xcievers?

Thanks.



On Feb 9, 2009, at 5:54 PM, Bryan Duxbury wrote:

Small files are bad for hadoop. You should avoid keeping a lot of small
files if possible.

That said, that error is something I've seen a lot. It usually happens when the number of xcievers hasn't been adjusted upwards from the default of 256. We run with 8000 xcievers, and that seems to solve our problems. I think that if you have a lot of open files, this problem happens a lot
faster.

-Bryan

On Feb 9, 2009, at 1:01 PM, Scott Whitecross wrote:

Hi all -

I've been running into this error the past few days:
java.io.IOException: Could not get block locations. Aborting...
      at
org.apache.hadoop.dfs.DFSClient $DFSOutputStream.processDatanodeError(DFSClient.java:2143)
      at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access $1400(DFSClient.java:1735)
      at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream $DataStreamer.run(DFSClient.java:1889)

It seems to be related to trying to write to many files to HDFS. I have a class extending org.apache.hadoop.mapred.lib.MultipleOutputFormat and if I output to a few file names, everything works. However, if I output to thousands of small files, the above error occurs. I'm having trouble isolating the problem, as the problem doesn't occur in the debugger
unfortunately.

Is this a memory issue, or is there an upper limit to the number of
files HDFS can hold?  Any settings to adjust?

Thanks.







Reply via email to