So I have done more digging on this subject... There is another problem if many files are kept open at the same time: once you read some data from a HDFS file by calling FSInputStream.read (byte[] buf, int off, int len), a tcp connection between HdfsBroker and the DataNode that contains the file block is set up, this connection is kept until you read another block (by default 64MB in size) of the file, or close the file entirely. There is a timeout on the server side, but I see no clue on the client side. So you quickly end up with a lot of idle connections between the HdfsBroker and many DataNodes.
What's even worse, no matter how many bytes the application wants to read, the HDFS client library always requests the the chosen DataNode to send all the remaining bytes of the block. Which means if you read 1 byte from the beginning of a block, the DataNode actually gets the request of sending the whole block, of which only the first few bytes are read. Consequences are: if the client reads nothing for quite a long while, 1) the kernel tcp send queue on the DataNode side and the tcp receive queue on the client side are quickly fed up; 2) the DataNode Xceiver thread (there is a max count of 256 by default) is blocked. Eventually the Xceiver timeouts, and closes the connection. However this FIN packet cannot reach client side as send&receive queues are still blocked. Here is what I observe from one node of our test cluster: $ netstat -ntp (Not all processes could be identified, non-owned process info will not be shown, you would have to be root to see it all.) Active Internet connections (w/o servers) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp 0 121937 10.65.25.150:50010 10.65.25.150:38595 FIN_WAIT1 - [...] tcp 74672 0 10.65.25.150:38595 10.65.25.150:50010 ESTABLISHED 32667/java [...] (and hundreds of other connections in the same states) Possible solutions without modifying hadoop client library are: 1) open-read-close the file stream every time the cell store is accessed; 2) always use postioned read: read(long position, byte[] buf, int off, int len) instead, because pread doesn't keep the tcp connection with DataNodes. Solution 1 is not scalable because every open() operation includes interaction with HDFS NameNode, which immediately becomes a bottleneck: in our test cluster the NameNode can only handle hundreds of parallel open() request per second, with an average delay of 2-3ms. I haven't tested the performance of solution 2 yet, I will put up some numbers tomorrow. Donald On Feb 25, 8:59 pm, "Liu Kejia (Donald)" <[email protected]> wrote: > It turns out the hadoop-default.xml packaged in my custom > hadoop-0.19.0-core.jar has set the "io.file.buffer.size" to 131072 (128KB), > which means DfsBroker has to open a 128KB buffer for every open file. The > official hadoop-0.19.0-core.jar sets this value to 4096, which is more > reasonable for applications like Hypertable. > Donald > > On Fri, Feb 20, 2009 at 11:55 AM, Liu Kejia (Donald) > <[email protected]>wrote: > > > Caching might not work very well because keys are randomly generated, > > resulting in bad locality... > > Even it's Java, hundreds of kilobytes per file object is still very big. > > I'll profile HdfsBroker to see what exactly is using so much memory, and > > post the results later. > > > Donald > > > On Fri, Feb 20, 2009 at 11:20 AM, Doug Judd <[email protected]> wrote: > > >> Hi Donald, > > >> Interesting. One possibility would be to have an open CellStore cache. > >> Frequently accessed CellStores would remain open, while seldom used ones > >> get > >> closed. The effectiveness of this solution would depend on the workload. > >> Do you think this might work for your use case? > > >> - Doug > > >> On Thu, Feb 19, 2009 at 7:09 PM, donald <[email protected]> wrote: > > >>> Hi all, > > >>> I recently run into the problem that HdfsBroker throws out of memory > >>> exception, because too many CellStore files in HDFS are kept open - I > >>> have over 600 ranges per range server, with a maximum of 10 cell > >>> stores per range, that'll be 6,000 open files at the same time, making > >>> HdfsBroker to take gigabytes of memory. > > >>> If we open the CellStore file on demand, i.e. when a scanner is > >>> created on it, this problem is gone. However random-read performance > >>> may drop due to the the overhead of opening a file in HDFS. Any better > >>> solution? > > >>> Donald --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Hypertable Development" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/hypertable-dev?hl=en -~----------~----~----~----~------~----~------~--~---
