Hi Luke, I changed the seek-read logic in HdfsBroker to use HDFS's build-in pread API last week on a 11-node cluster. As expected the number of socket connections per node drops drastically from hundreds to about 20. Meanwhile there is no significant change in performance.
I am reading the HDFS 0.19 code and doing some tests these days, results show HDFS create/write/close fails with high probability when some nodes are under heavy load, especially when the number of nodes is small. I'll post my analysis in details later in a new thread. Donald On Mar 8, 4:25 pm, "Liu Kejia (Donald)" <[email protected]> wrote: > Get it! I was so careless not to find the problem of HdfsBroker... > I'll give it a try and post the result soon. Thanks very much. > > Donald > > On Sun, Mar 8, 2009 at 4:15 PM, Luke <[email protected]> wrote: > > > On Mar 8, 12:06 am, "Liu Kejia (Donald)" <[email protected]> wrote: > > > read(byte[] buf, int off, int len) has much better performance for large > > > scans (e.g. while doing a merging compaction), if using 128KB buffers, it > > > can make 80MB/s throughput, while read(long pos, byte[] buf, int off, int > > > len) can only achieve 40MB/s. To make best utilization of HDFS > > performance, > > > we should use positioned read for random read/small scans to achieve > > shorter > > > response time. OTOH use plain read for large scans to have better > > > throughput, and close-reopen the file after the scanner is destroyed to > > > avoid socket congestion and fd leak. > > > I then read the current CellCacheScanner code roughly and found it > > already > > > has dealt with this issue! I guess Doug's already aware of this long > > before? > > > Now I am all confused that we still have hundreds of tcp connections in > > > FIN_WAIT1 state on every DataNode, I really wonder what's the real cause. > > > Yes, Doug's already using the right Hypertable::Filesystem API. The > > problem is the implementation of PositionRead in HdfsBroker. It is > > actually using seek and plain read. Doug just made the change on > > Friday to use positioned read, he didn't see any performance > > difference in brief tests. > > > I wonder if you can do us a favor and test the change on the cluster. > > > __Luke > > > > Donald > > > > On Sat, Mar 7, 2009 at 4:01 AM, Luke <[email protected]> wrote: > > > > > Sorry, I didn't read your posts fully (was in a hurry). DfsBroker does > > > > have pread interface, which is used by range server for random reads. > > > > We just need to fix our HdfsBroker (PositionRead) to use HDFS > > > > positioned read interface. Can you check if that improve things for > > > > you? > > > > > On Mar 6, 11:37 am, Luke <[email protected]> wrote: > > > > > It appears that HDFS does have pread like interface: readFully(pos, > > > > > buf, len). Can you run the tests again using this API and see if > > > > > things improve? > > > > > > On Mar 6, 11:18 am, Luke Lu <[email protected]> wrote: > > > > > > > Great analysis Donald! Thanks for the numbers. It seems to me the > > > > > > right fix would be enhance the HDFS client library to add a pread > > like > > > > > > > interface to do the right thing for random reads. Maybe you want to > > > > > > file a Hadoop jira ticket for that? > > > > > > > __Luke > > > > > > > On Mar 6, 2009, at 3:15 AM, Liu Kejia (Donald) wrote: > > > > > > > > On Thu, Mar 5, 2009 at 11:50 PM, donald <[email protected]> > > > > wrote: > > > > > > > > So I have done more digging on this subject... > > > > > > > > There is another problem if many files are kept open at the same > > > > time: > > > > > > > once you read some data from a HDFS file by calling > > > > FSInputStream.read > > > > > > > (byte[] buf, int off, int len), a tcp connection between > > HdfsBroker > > > > > > > and the DataNode that contains the file block is set up, this > > > > > > > connection is kept until you read another block (by default 64MB > > in > > > > > > > size) of the file, or close the file entirely. There is a timeout > > on > > > > > > > the server side, but I see no clue on the client side. So you > > quickly > > > > > > > end up with a lot of idle connections between the HdfsBroker and > > many > > > > > > > DataNodes. > > > > > > > > What's even worse, no matter how many bytes the application wants > > to > > > > > > > read, the HDFS client library always requests the the chosen > > DataNode > > > > > > > to send all the remaining bytes of the block. Which means if you > > read > > > > > > > 1 byte from the beginning of a block, the DataNode actually gets > > the > > > > > > > request of sending the whole block, of which only the first few > > bytes > > > > > > > are read. Consequences are: if the client reads nothing for quite > > a > > > > > > > long while, 1) the kernel tcp send queue on the DataNode side and > > the > > > > > > > tcp receive queue on the client side are quickly fed up; 2) the > > > > > > > DataNode Xceiver thread (there is a max count of 256 by default) > > is > > > > > > > blocked. Eventually the Xceiver timeouts, and closes the > > connection. > > > > > > > However this FIN packet cannot reach client side as send&receive > > > > > > > queues are still blocked. Here is what I observe from one node of > > our > > > > > > > test cluster: > > > > > > > $ netstat -ntp > > > > > > > (Not all processes could be identified, non-owned process info > > > > > > > will not be shown, you would have to be root to see it all.) > > > > > > > Active Internet connections (w/o servers) > > > > > > > Proto Recv-Q Send-Q Local Address Foreign > > > > > > > Address State PID/Program name > > > > > > > tcp 0 121937 10.65.25.150:50010 > > > > > > > 10.65.25.150:38595 FIN_WAIT1 - > > > > > > > [...] > > > > > > > tcp 74672 0 10.65.25.150:38595 > > > > > > > 10.65.25.150:50010 ESTABLISHED 32667/java > > > > > > > [...] > > > > > > > (and hundreds of other connections in the same states) > > > > > > > > Possible solutions without modifying hadoop client library are: > > 1) > > > > > > > open-read-close the file stream every time the cell store is > > > > accessed; > > > > > > > 2) always use postioned read: read(long position, byte[] buf, int > > > > off, > > > > > > > int len) instead, because pread doesn't keep the tcp connection > > with > > > > > > > DataNodes. Solution 1 is not scalable because every open() > > operation > > > > > > > includes interaction with HDFS NameNode, which immediately > > becomes a > > > > > > > bottleneck: in our test cluster the NameNode can only handle > > hundreds > > > > > > > of parallel open() request per second, with an average delay of > > > > 2-3ms. > > > > > > > I haven't tested the performance of solution 2 yet, I will put up > > > > some > > > > > > > numbers tomorrow. > > > > > > > > Donald > > > > > > > > I've created 1000 files in a 11-node hadoop cluster, each file is > > > > > > > 20MB. Then I wrote simple java programs to do the following > > tests: > > > > > > > > Opening all 1000 files, one process: about 2.5 s (2.5ms latency) > > > > > > > Closing all 1000files, one process: 50ms > > > > > > > Opening all 1000 files, 10 processes (running distributedly on > > the > > > > > > > 10 datanodes): 15 s (15ms latency, or 700 opens/s) > > > > > > > Reading the first 1KB data from each file (plain read), one > > process: > > > > > > > > 6s (6ms latency) > > > > > > > Reading the first 1KB data from each file (positioned read), one > > > > > > > process: 2.5s (2.5ms latency) > > > > > > > Reading the first 100KB data from each file, 1KB at a time > > > > > > > (positioned read), one process: 130s (1.3ms latency, or 0.77MB/s) > > > > > > > Reading the first 100KB data from each file, 1KB at a time (plain > > > > > > > read), one process: 8.8s (11MB/s) > > > > > > > > The tests are done multiple times to make sure all blocks are > > > > > > > effectively cached in Linux page cache.The hadoop version was > > 0.19.0 > > > > > > > > with a few patches. io.file.buffer.size = 4096 > > > > > > > > Donald > > > > > > > > On Feb 25, 8:59 pm, "Liu Kejia (Donald)" <[email protected]> > > > > wrote: > > > > > > > > It turns out the hadoop-default.xml packaged in my custom > > > > > > > > hadoop-0.19.0-core.jar has set the "io.file.buffer.size" to > > 131072 > > > > > > > > (128KB), > > > > > > > > which means DfsBroker has to open a 128KB buffer for every open > > > > > > > file. The > > > > > > > > official hadoop-0.19.0-core.jar sets this value to 4096, which > > is > > > > > > > more > > > > > > > > reasonable for applications like Hypertable. > > > > > > > > Donald > > > > > > > > > On Fri, Feb 20, 2009 at 11:55 AM, Liu Kejia (Donald) > > > > > > > > <[email protected]>wrote: > > > > > > > > > > Caching might not work very well because keys are randomly > > > > > > > generated, > > > > > > > > > resulting in bad locality... > > > > > > > > > Even it's Java, hundreds of kilobytes per file object is > > still > > > > > > > very big. > > > > > > > > > I'll profile HdfsBroker to see what exactly is using so much > > > > > > > memory, and > > > > > > > > > post the results later. > > > > > > > > > > Donald > > > > > > > > > > On Fri, Feb 20, 2009 at 11:20 AM, Doug Judd > > > > > > > <[email protected]> wrote: > > > > > > > > > >> Hi Donald, > > > > > > > > > >> Interesting. One possibility would be to have an open > > > > > > > CellStore cache. > > > > > > > > >> Frequently accessed CellStores would remain open, while > > seldom > > > > > > > used ones get > > > > > > > > >> closed. The effectiveness of this solution would depend on > > the > > > > > > > > workload. > > > > > > > > >> Do you think this might work for your use case? > > > > > > > > > >> - Doug > > > > > > > > > >> On Thu, Feb 19, 2009 at 7:09 PM, donald < > > [email protected]> > > > > > > > > wrote: > > > > > > > > > >>> Hi all, > > > > > > > > > >>> I recently run into the problem that HdfsBroker throws out > > of > > > > > > > memory > > > > > > > > >>> exception, because too many CellStore files in HDFS are > > kept > > > > > > > open - I > > > > > > > > >>> have over 600 ranges per range server, with a maximum of 10 > > > > cell > > > > > > > > >>> stores per range, that'll be 6,000 open files at the same > > > > > > > time, making > > > > > > > > >>> HdfsBroker to take gigabytes of memory. > > > > > > > > > >>> If we open the CellStore file on demand, i.e. when a > > scanner is > > > > > > > > >>> created on it, this problem is gone. However random-read > > > > > > > performance > > > > > > > > >>> may drop due to the the overhead of opening a file in HDFS. > > > > > > > Any better > > ... > > read more » --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Hypertable Development" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/hypertable-dev?hl=en -~----------~----~----~----~------~----~------~--~---
