[jira] [Commented] (HDFS-1323) Pool/share file channels for HDFS read
[ https://issues.apache.org/jira/browse/HDFS-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13051066#comment-13051066 ] Jay Booth commented on HDFS-1323: - I could take a crack at it next week/weekend. Will get my build set up and report back in a week or two. Pool/share file channels for HDFS read -- Key: HDFS-1323 URL: https://issues.apache.org/jira/browse/HDFS-1323 Project: Hadoop HDFS Issue Type: Improvement Components: data-node Reporter: Jay Booth Attachments: hdfs-1323-20100730.patch, hdfs-1323-trunk.txt Currently, all reads in HDFS require opening and closing the underlying block/meta filechannels. We could pool these filechannels and save some system calls and other work. Since HDFS read requests can be satisfied by positioned reads and transferTos, we can even share these filechannels between concurrently executing requests. The attached patch was benchmarked as part of work on HDFS-918 and exhibited a 10% performance increase for small random reads. This does not affect client logic and involves minimal change to server logic. Patch is based on branch 20-append. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads
[ https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12977544#action_12977544 ] Jay Booth commented on HDFS-918: Hey all, sorry for the slow response, been swamped with the new year and all. RE: unit tests, at one point it was passing all tests, not sure if the tests changed or this changed but I can take a look at it. RE: 0.23, I can look at forward porting this again, but a lot of changes have gone in since then. @stack, were you testing the only pooling patch or the with full multiplexing patch? Only pooling would be much simpler to forward port, although I do think that the full multiplexing patch is pretty worthwhile. Aside from the small-but-significant performance gain, it was IMO much better factoring to have the DN-side logic all encapsulated in a Connection object which has sendPacket() repeatedly called, rather than a giant procedural loop that goes down and back up through several classes. The architecture also made keepalive pretty straightforward.. just throw that connection back into a listening pool when done, and make corresponding changes on client side. But, I guess that logic's been revised now anyways, so it'd be a significant piece of work to bring it all back up to date. Use single Selector and small thread pool to replace many instances of BlockSender for reads Key: HDFS-918 URL: https://issues.apache.org/jira/browse/HDFS-918 Project: Hadoop HDFS Issue Type: Improvement Components: data-node Reporter: Jay Booth Assignee: Jay Booth Fix For: 0.22.0 Attachments: hbase-hdfs-benchmarks.ods, hdfs-918-20100201.patch, hdfs-918-20100203.patch, hdfs-918-20100211.patch, hdfs-918-20100228.patch, hdfs-918-20100309.patch, hdfs-918-branch20-append.patch, hdfs-918-branch20.2.patch, hdfs-918-pool.patch, hdfs-918-TRUNK.patch, hdfs-multiplex.patch Currently, on read requests, the DataXCeiver server allocates a new thread per request, which must allocate its own buffers and leads to higher-than-optimal CPU and memory usage by the sending threads. If we had a single selector and a small threadpool to multiplex request packets, we could theoretically achieve higher performance while taking up fewer resources and leaving more CPU on datanodes available for mapred, hbase or whatever. This can be done without changing any wire protocols. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HDFS-1323) Pool/share file channels for HDFS read
Pool/share file channels for HDFS read -- Key: HDFS-1323 URL: https://issues.apache.org/jira/browse/HDFS-1323 Project: Hadoop HDFS Issue Type: Improvement Components: data-node Reporter: Jay Booth Fix For: 0.20-append, 0.22.0 Currently, all reads in HDFS require opening and closing the underlying block/meta filechannels. We could pool these filechannels and save some system calls and other work. Since HDFS read requests can be satisfied by positioned reads and transferTos, we can even share these filechannels between concurrently executing requests. The attached patch was benchmarked as part of work on HDFS-918 and exhibited a 10% performance increase for small random reads. This does not affect client logic and involves minimal change to server logic. Patch is based on branch 20-append. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1323) Pool/share file channels for HDFS read
[ https://issues.apache.org/jira/browse/HDFS-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Booth updated HDFS-1323: Attachment: hdfs-1323-20100730.patch Pool/share file channels for HDFS read -- Key: HDFS-1323 URL: https://issues.apache.org/jira/browse/HDFS-1323 Project: Hadoop HDFS Issue Type: Improvement Components: data-node Reporter: Jay Booth Fix For: 0.20-append, 0.22.0 Attachments: hdfs-1323-20100730.patch Currently, all reads in HDFS require opening and closing the underlying block/meta filechannels. We could pool these filechannels and save some system calls and other work. Since HDFS read requests can be satisfied by positioned reads and transferTos, we can even share these filechannels between concurrently executing requests. The attached patch was benchmarked as part of work on HDFS-918 and exhibited a 10% performance increase for small random reads. This does not affect client logic and involves minimal change to server logic. Patch is based on branch 20-append. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1323) Pool/share file channels for HDFS read
[ https://issues.apache.org/jira/browse/HDFS-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12894072#action_12894072 ] Jay Booth commented on HDFS-1323: - Correction - the patch created a 10% performance increase for HBase random GETs. It was probably a larger % of the read operation, if you don't include other work by HBase. Pool/share file channels for HDFS read -- Key: HDFS-1323 URL: https://issues.apache.org/jira/browse/HDFS-1323 Project: Hadoop HDFS Issue Type: Improvement Components: data-node Reporter: Jay Booth Fix For: 0.20-append, 0.22.0 Attachments: hdfs-1323-20100730.patch Currently, all reads in HDFS require opening and closing the underlying block/meta filechannels. We could pool these filechannels and save some system calls and other work. Since HDFS read requests can be satisfied by positioned reads and transferTos, we can even share these filechannels between concurrently executing requests. The attached patch was benchmarked as part of work on HDFS-918 and exhibited a 10% performance increase for small random reads. This does not affect client logic and involves minimal change to server logic. Patch is based on branch 20-append. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads
[ https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Booth updated HDFS-918: --- Attachment: hbase-hdfs-benchmarks.ods Benchmarked on EC2 this weekend, I set up 0.20.2-append clean, a copy with my multiplex patch applied, and a third copy which only ports filechannel pooling to the current architecture (can submit that patch later, it's at home). All runs were with HBase block caching disabled to highlight the difference in filesystem access speeds. This is running across a decently small dataset (little less than 1GB) so all files are presumably in memory for the majority of test duration. Run involved 6 clients reading 1,000,000 rows each divided over 10 mappers. Cluster setup was 3x EC2 High-CPU XL, 1 NN/JT/ZK/Master and 2x DN/TT/RS. Ran in 3 batches of 3 runs each. Cluster was restarted in between each batch for each run type because we're changing DN implementation. Topline numbers (rest are in document): Total Run Averages Testclean poolmultiplex random 21159050.44 19448216.89 16806247 scan436106.89 442452.54 443262.56 sequential 19298239.78 17871047.67 14987028.44 Pool is 7.5% gain, multiplex is more like 20% for random reads Only batches 2+3 (batch 1 was a little messed up and doesn't track with others) Testclean poolmultiplex random 20555308.67 1842501716987643.33 scan426849 427277.98 448031 sequential 18665323.67 16969885.83 15102404 Pool is 10% gain, multiplex is 17% or so for random reads Per row for random read (batches 2+3 only): clean: 3.42ms pool: 3.07ms multiplex: 2.83ms Use single Selector and small thread pool to replace many instances of BlockSender for reads Key: HDFS-918 URL: https://issues.apache.org/jira/browse/HDFS-918 Project: Hadoop HDFS Issue Type: Improvement Components: data-node Reporter: Jay Booth Fix For: 0.22.0 Attachments: hbase-hdfs-benchmarks.ods, hdfs-918-20100201.patch, hdfs-918-20100203.patch, hdfs-918-20100211.patch, hdfs-918-20100228.patch, hdfs-918-20100309.patch, hdfs-918-branch20-append.patch, hdfs-918-branch20.2.patch, hdfs-918-TRUNK.patch, hdfs-multiplex.patch Currently, on read requests, the DataXCeiver server allocates a new thread per request, which must allocate its own buffers and leads to higher-than-optimal CPU and memory usage by the sending threads. If we had a single selector and a small threadpool to multiplex request packets, we could theoretically achieve higher performance while taking up fewer resources and leaving more CPU on datanodes available for mapred, hbase or whatever. This can be done without changing any wire protocols. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads
[ https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Booth updated HDFS-918: --- Attachment: hdfs-918-branch20-append.patch Managed to get back to this. Rebased on branch-20-append. Fixed resource leak issue that apurtell identified. Runs through HBase PerformanceEvaluation on my workstation completely with default ulimit of 1024, no crashes. I'm going to try and benchmark this on a real cluster this weekend and report results. Happy Friday everyone Use single Selector and small thread pool to replace many instances of BlockSender for reads Key: HDFS-918 URL: https://issues.apache.org/jira/browse/HDFS-918 Project: Hadoop HDFS Issue Type: Improvement Components: data-node Reporter: Jay Booth Fix For: 0.22.0 Attachments: hdfs-918-20100201.patch, hdfs-918-20100203.patch, hdfs-918-20100211.patch, hdfs-918-20100228.patch, hdfs-918-20100309.patch, hdfs-918-branch20-append.patch, hdfs-918-branch20.2.patch, hdfs-918-TRUNK.patch, hdfs-multiplex.patch Currently, on read requests, the DataXCeiver server allocates a new thread per request, which must allocate its own buffers and leads to higher-than-optimal CPU and memory usage by the sending threads. If we had a single selector and a small threadpool to multiplex request packets, we could theoretically achieve higher performance while taking up fewer resources and leaving more CPU on datanodes available for mapred, hbase or whatever. This can be done without changing any wire protocols. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads
[ https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852610#action_12852610 ] Jay Booth commented on HDFS-918: Am I ever, this one should be good to go but I'm on my way out right now and won't be around to help if it breaks again. If you wanna give it a whirl, be my guest :) Use single Selector and small thread pool to replace many instances of BlockSender for reads Key: HDFS-918 URL: https://issues.apache.org/jira/browse/HDFS-918 Project: Hadoop HDFS Issue Type: Improvement Components: data-node Reporter: Jay Booth Fix For: 0.22.0 Attachments: hdfs-918-20100201.patch, hdfs-918-20100203.patch, hdfs-918-20100211.patch, hdfs-918-20100228.patch, hdfs-918-20100309.patch, hdfs-918-branch20.2.patch, hdfs-918-TRUNK.patch, hdfs-multiplex.patch Currently, on read requests, the DataXCeiver server allocates a new thread per request, which must allocate its own buffers and leads to higher-than-optimal CPU and memory usage by the sending threads. If we had a single selector and a small threadpool to multiplex request packets, we could theoretically achieve higher performance while taking up fewer resources and leaving more CPU on datanodes available for mapred, hbase or whatever. This can be done without changing any wire protocols. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads
[ https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Booth updated HDFS-918: --- Attachment: (was: hdfs-918-branch20.2.patch) Use single Selector and small thread pool to replace many instances of BlockSender for reads Key: HDFS-918 URL: https://issues.apache.org/jira/browse/HDFS-918 Project: Hadoop HDFS Issue Type: Improvement Components: data-node Reporter: Jay Booth Fix For: 0.22.0 Attachments: hdfs-918-20100201.patch, hdfs-918-20100203.patch, hdfs-918-20100211.patch, hdfs-918-20100228.patch, hdfs-918-20100309.patch, hdfs-918-branch20.2.patch, hdfs-multiplex.patch Currently, on read requests, the DataXCeiver server allocates a new thread per request, which must allocate its own buffers and leads to higher-than-optimal CPU and memory usage by the sending threads. If we had a single selector and a small threadpool to multiplex request packets, we could theoretically achieve higher performance while taking up fewer resources and leaving more CPU on datanodes available for mapred, hbase or whatever. This can be done without changing any wire protocols. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads
[ https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Booth updated HDFS-918: --- Attachment: hdfs-918-branch20.2.patch Straightened out the block not found thing with Andrew, that was on his end, but then he found a resource leak that's fixed here -- I'll post a trunk patch which incorporates this fix and the previous fix shortly. Use single Selector and small thread pool to replace many instances of BlockSender for reads Key: HDFS-918 URL: https://issues.apache.org/jira/browse/HDFS-918 Project: Hadoop HDFS Issue Type: Improvement Components: data-node Reporter: Jay Booth Fix For: 0.22.0 Attachments: hdfs-918-20100201.patch, hdfs-918-20100203.patch, hdfs-918-20100211.patch, hdfs-918-20100228.patch, hdfs-918-20100309.patch, hdfs-918-branch20.2.patch, hdfs-multiplex.patch Currently, on read requests, the DataXCeiver server allocates a new thread per request, which must allocate its own buffers and leads to higher-than-optimal CPU and memory usage by the sending threads. If we had a single selector and a small threadpool to multiplex request packets, we could theoretically achieve higher performance while taking up fewer resources and leaving more CPU on datanodes available for mapred, hbase or whatever. This can be done without changing any wire protocols. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads
[ https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Booth updated HDFS-918: --- Attachment: hdfs-918-TRUNK.patch Trunk patch with previous fixes. Use single Selector and small thread pool to replace many instances of BlockSender for reads Key: HDFS-918 URL: https://issues.apache.org/jira/browse/HDFS-918 Project: Hadoop HDFS Issue Type: Improvement Components: data-node Reporter: Jay Booth Fix For: 0.22.0 Attachments: hdfs-918-20100201.patch, hdfs-918-20100203.patch, hdfs-918-20100211.patch, hdfs-918-20100228.patch, hdfs-918-20100309.patch, hdfs-918-branch20.2.patch, hdfs-918-TRUNK.patch, hdfs-multiplex.patch Currently, on read requests, the DataXCeiver server allocates a new thread per request, which must allocate its own buffers and leads to higher-than-optimal CPU and memory usage by the sending threads. If we had a single selector and a small threadpool to multiplex request packets, we could theoretically achieve higher performance while taking up fewer resources and leaving more CPU on datanodes available for mapred, hbase or whatever. This can be done without changing any wire protocols. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads
[ https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Booth updated HDFS-918: --- Attachment: hdfs-918-branch20.2.patch Cleaned up a bug in the BlockChannelPool.cleanup() code, added new unit test, improved descriptions of new config values (useMultiplex, packetSize, maxOpenBlockChannels, minOpenBlockchannels (number to cleanup() to)). This patch is for branch 20, I'll post a new one against trunk tonight. Use single Selector and small thread pool to replace many instances of BlockSender for reads Key: HDFS-918 URL: https://issues.apache.org/jira/browse/HDFS-918 Project: Hadoop HDFS Issue Type: Improvement Components: data-node Reporter: Jay Booth Fix For: 0.22.0 Attachments: hdfs-918-20100201.patch, hdfs-918-20100203.patch, hdfs-918-20100211.patch, hdfs-918-20100228.patch, hdfs-918-20100309.patch, hdfs-918-branch20.2.patch, hdfs-multiplex.patch Currently, on read requests, the DataXCeiver server allocates a new thread per request, which must allocate its own buffers and leads to higher-than-optimal CPU and memory usage by the sending threads. If we had a single selector and a small threadpool to multiplex request packets, we could theoretically achieve higher performance while taking up fewer resources and leaving more CPU on datanodes available for mapred, hbase or whatever. This can be done without changing any wire protocols. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads
[ https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Booth updated HDFS-918: --- Attachment: (was: hdfs-200+826+918-branch20.patch) Use single Selector and small thread pool to replace many instances of BlockSender for reads Key: HDFS-918 URL: https://issues.apache.org/jira/browse/HDFS-918 Project: Hadoop HDFS Issue Type: Improvement Components: data-node Reporter: Jay Booth Fix For: 0.22.0 Attachments: hdfs-918-20100201.patch, hdfs-918-20100203.patch, hdfs-918-20100211.patch, hdfs-918-20100228.patch, hdfs-918-20100309.patch, hdfs-918-branch20.2.patch, hdfs-multiplex.patch Currently, on read requests, the DataXCeiver server allocates a new thread per request, which must allocate its own buffers and leads to higher-than-optimal CPU and memory usage by the sending threads. If we had a single selector and a small threadpool to multiplex request packets, we could theoretically achieve higher performance while taking up fewer resources and leaving more CPU on datanodes available for mapred, hbase or whatever. This can be done without changing any wire protocols. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads
[ https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Booth updated HDFS-918: --- Attachment: hdfs-200+826+918-branch20.patch I heard all the cool kids are running HDFS-200 and HDFS-826 on their 0.20.2 installations these days, so I merged HDFS-918 with them. Also, nobody use the existing 0.20.2 patch, I'll delete now and post a new one tonight -- it happens to be missing a very important Thread.start() invocation. Use single Selector and small thread pool to replace many instances of BlockSender for reads Key: HDFS-918 URL: https://issues.apache.org/jira/browse/HDFS-918 Project: Hadoop HDFS Issue Type: Improvement Components: data-node Reporter: Jay Booth Fix For: 0.22.0 Attachments: hdfs-200+826+918-branch20.patch, hdfs-918-20100201.patch, hdfs-918-20100203.patch, hdfs-918-20100211.patch, hdfs-918-20100228.patch, hdfs-918-20100309.patch, hdfs-multiplex.patch Currently, on read requests, the DataXCeiver server allocates a new thread per request, which must allocate its own buffers and leads to higher-than-optimal CPU and memory usage by the sending threads. If we had a single selector and a small threadpool to multiplex request packets, we could theoretically achieve higher performance while taking up fewer resources and leaving more CPU on datanodes available for mapred, hbase or whatever. This can be done without changing any wire protocols. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads
[ https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Booth updated HDFS-918: --- Attachment: (was: hdfs-918-0.20.2.patch) Use single Selector and small thread pool to replace many instances of BlockSender for reads Key: HDFS-918 URL: https://issues.apache.org/jira/browse/HDFS-918 Project: Hadoop HDFS Issue Type: Improvement Components: data-node Reporter: Jay Booth Fix For: 0.22.0 Attachments: hdfs-200+826+918-branch20.patch, hdfs-918-20100201.patch, hdfs-918-20100203.patch, hdfs-918-20100211.patch, hdfs-918-20100228.patch, hdfs-918-20100309.patch, hdfs-multiplex.patch Currently, on read requests, the DataXCeiver server allocates a new thread per request, which must allocate its own buffers and leads to higher-than-optimal CPU and memory usage by the sending threads. If we had a single selector and a small threadpool to multiplex request packets, we could theoretically achieve higher performance while taking up fewer resources and leaving more CPU on datanodes available for mapred, hbase or whatever. This can be done without changing any wire protocols. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads
[ https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Booth updated HDFS-918: --- Attachment: hdfs-918-0.20.2.patch 0.20.2 compatible patch! A couple people mentioned that it would be much easier for them to benchmark if I produced an 0.20.2 compatible patch. So here it is, it works, seems to pass all unit tests that I ran on it, and I even did a hadoop fs -put and hadoop fs -cat. But that's the entire extent of the testing, unit tests and a super-simple pseudodistributed operation. So anyone who wants to try this on some I/O bound jobs on a test 0.20.2 cluster and see if they have speedups, please feel free and report results. Use single Selector and small thread pool to replace many instances of BlockSender for reads Key: HDFS-918 URL: https://issues.apache.org/jira/browse/HDFS-918 Project: Hadoop HDFS Issue Type: Improvement Components: data-node Reporter: Jay Booth Fix For: 0.22.0 Attachments: hdfs-918-0.20.2.patch, hdfs-918-20100201.patch, hdfs-918-20100203.patch, hdfs-918-20100211.patch, hdfs-918-20100228.patch, hdfs-918-20100309.patch, hdfs-multiplex.patch Currently, on read requests, the DataXCeiver server allocates a new thread per request, which must allocate its own buffers and leads to higher-than-optimal CPU and memory usage by the sending threads. If we had a single selector and a small threadpool to multiplex request packets, we could theoretically achieve higher performance while taking up fewer resources and leaving more CPU on datanodes available for mapred, hbase or whatever. This can be done without changing any wire protocols. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads
[ https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844596#action_12844596 ] Jay Booth commented on HDFS-918: .bq I think it is very important to have separate pools for each partition. Otherwise, each disk will be accessed only as much as the slowest disk (when DN has enough load). This would be the case if I were using a fixed-size thread pool and a LinkedBlockingQueue -- but I'm not, see Executors.newCachedThreadPool(), it's actually bounded at Integer.MAX_VALUE threads and uses a SynchronousQueue. If a new thread is needed in order to start work on a task immediately, it's created. Otherwise, an existing waiting thread will be re-used. (Threads are purged if they've been idle for 60 seconds). Either way, the underlying I/O request is dispatched pretty much immediately after the connection is writable. So I don't see why separate pools per partition would help anything, the operating system will handle IO requests as it can and put threads into runnable state as it can regardless of which pool they're in. RE: Netty, I'm not very knowledgeable about it beyond the Cliff's Notes version, but my code dealing with the Selector is pretty small -- the main loop is under 75 lines, and java.util.concurrent does most of the heavy lifting. Most of the code is dealing with application and protocol specifics. So my instinct in general is that adding a framework may actually increase the amount of code, especially if there's any mismatches between what we're doing and what it wants us to do (the packet-header, sums data, main data format is pretty specific to us). Plus, as Todd said, we can't really change the blocking IO nature of the main accept() loop in DataXceiverServer without this becoming a much bigger patch, although I agree that we should go there in general. That being said, better is better, so if a Netty implementation took up fewer lines of code and performed better, then that speaks for itself. Use single Selector and small thread pool to replace many instances of BlockSender for reads Key: HDFS-918 URL: https://issues.apache.org/jira/browse/HDFS-918 Project: Hadoop HDFS Issue Type: Improvement Components: data-node Reporter: Jay Booth Fix For: 0.22.0 Attachments: hdfs-918-20100201.patch, hdfs-918-20100203.patch, hdfs-918-20100211.patch, hdfs-918-20100228.patch, hdfs-918-20100309.patch, hdfs-multiplex.patch Currently, on read requests, the DataXCeiver server allocates a new thread per request, which must allocate its own buffers and leads to higher-than-optimal CPU and memory usage by the sending threads. If we had a single selector and a small threadpool to multiplex request packets, we could theoretically achieve higher performance while taking up fewer resources and leaving more CPU on datanodes available for mapred, hbase or whatever. This can be done without changing any wire protocols. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads
[ https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844658#action_12844658 ] Jay Booth commented on HDFS-918: I think it is very important to have separate pools for each partition. This would be the case if I were using a fixed-size thread pool and a LinkedBlockingQueue - but I'm not, see Executors.newCachedThreadPool(), hmm.. does it mean that if you have thousand clients and the load is disk bound, we end up with 1000 threads? Yeah, although it'll likely turn out to be less than 1000 in practice.. If the requests are all short-lived, it could be significantly less than 1000 threads when you consider re-use, if it's 1000 long reads, it'll probably wind up being only a little less if at all. The threads themselves are really lightweight, the only resources attached to them are a ThreadLocalByteBuffer(8096). (8k seemed ok for the ByteBuffer because the header+checksums portion is always significantly less than that, and the main block file transfers are done using transferTo). I chose this approach after initially experimenting with a fixed-size threadpool and LinkedBlockingQueue because the handoff is faster and every pending IO request is guaranteed to become an actual disk-read syscall waiting on the operating system as fast as possible. This way, the operating system decides which disk request to fulfill first, taking advantage of the lower-level optimizations around disk IO. Since the threads are pretty lightweight and the lower-level calls do a better job of optimal fulfillment, I think this will work better than a fixed-size threadpool, where for example, 2 adjacent reads from separate threads could be separated from each other in time whereas the disk controller might fulfill both simultaneously and faster. This becomes even more important, I think, with the higher 512kb packet size -- those are big chunks of work per-sycall that can be optimized by the underlying OS. Regarding the extra resource allocation for the threads -- if we're disk-bound, then generally speaking a few extra memory resources shouldn't be a huge deal -- the gains from dispatching more disk requests in parallel should outweigh the memory allocation and context switch costs. The above is all in theory -- I haven't benchmarked parallel implementations head-to-head. But certainly for random reads, and likely for longer reads, this approach should get the syscall invoked as fast as possible. Switching between the two models would be pretty simple, just change the parameters we pass to the constructor for new ThreadPoolExecutorService(). Use single Selector and small thread pool to replace many instances of BlockSender for reads Key: HDFS-918 URL: https://issues.apache.org/jira/browse/HDFS-918 Project: Hadoop HDFS Issue Type: Improvement Components: data-node Reporter: Jay Booth Fix For: 0.22.0 Attachments: hdfs-918-20100201.patch, hdfs-918-20100203.patch, hdfs-918-20100211.patch, hdfs-918-20100228.patch, hdfs-918-20100309.patch, hdfs-multiplex.patch Currently, on read requests, the DataXCeiver server allocates a new thread per request, which must allocate its own buffers and leads to higher-than-optimal CPU and memory usage by the sending threads. If we had a single selector and a small threadpool to multiplex request packets, we could theoretically achieve higher performance while taking up fewer resources and leaving more CPU on datanodes available for mapred, hbase or whatever. This can be done without changing any wire protocols. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads
[ https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844174#action_12844174 ] Jay Booth commented on HDFS-918: Yeah, it only uses nonblocking pread ops on the block and block.meta files.. it sends the packet header and checksums in one packet (maybe just part of the checksums if TCP buff was full), then repeatedly makes requests to send PACKET_LENGTH (default 512kb) bytes until they're sent. When I had some trace logging enabled, I could see the TCP window scale up.. first request sent 96k in a packet, then it scaled up to 512k per packet after a few. Here's a (simplified) breakdown of the control structures and main loop: {{ DataXCeiverServer -- accepts conns, creates thread per conn From thread: read OP, blocking if we're a read request and multiplex enabled, delegate to MultiplexedBlockSender and die otherwise instantiate DataXCeiver (which now takes op as an arg) and call xceiver.run()}} MultiplexedBlockSender // maintains ExecutorService, SelectorThread and exposes public register() method register(Socket conn); // configures nonblocking, sets up Connection object, dispatches the first packet-send as an optimization, then puts the FutureConnection in an inbox for the selector thread SelectorThread // maintains Selector, BlockingQueueFutureConnection inbox, LinkedListFutureConnection processingQueue main loop: 1) pull all futures from inbox, add to processingqueue 2) iterate/poll/remove Futures from processingQueue, then close/re-register those that finished sending a packet as appropriate (linear time, but pretty fast) 3) select 4) dispatch selected connections, add their Futures to processingQueue Connection.sendPacket(BlockChannelPool, ByteBuffer) // workhorse method, invoked via Callable maintains a bunch of internal state variables per connection fetches BlockChannel object from BlockChannelPool -- BlockChannel only exposes p-read methods for underlying channels buffers packet header and sums, sends, records how much successfully sent -- if less than 100%, return and wait for writable tries to send PACKET_LENGTH bytes from main file via transferTo, if less than 100%, return and wait for writable marks self as either FINISHED or READY, depending on if that was the last packet }} Regarding file IO, I don't know if it's faster to send the packet header as it's own 13 byte packet and use transferTo for the meta file, or to do what I'm doing now and buffer them into one packet. I feel like it'll be a wash.. or at any rate a minor difference because the checksums are so much smaller than the main data. What do people think about a test regime for this? It's a really big set of changes but it opens up a lot of doors (particularly connection re-use with that register() paradigm), seems to perform equal/better depending on the case, gets a big win on open file descriptors and factors all of the server-side protocol logic into one method, instead of spread out across several classes. I certainly understand being hesitant to commit such a big change without some pretty extensive testing, but if anyone had any direction as to what they'd like to see tested, that'd be awesome. I'm already planning on setting up some disk-bound benchmarks now that I've tested network-bound ones.. anything else that people want to see? It seems to pass all unit tests, my last run had a couple seemingly pre-existing failures but 99% of them passed. I guess I should do another full run and account for any that don't pass while I'm at it. Use single Selector and small thread pool to replace many instances of BlockSender for reads Key: HDFS-918 URL: https://issues.apache.org/jira/browse/HDFS-918 Project: Hadoop HDFS Issue Type: Improvement Components: data-node Reporter: Jay Booth Fix For: 0.22.0 Attachments: hdfs-918-20100201.patch, hdfs-918-20100203.patch, hdfs-918-20100211.patch, hdfs-918-20100228.patch, hdfs-918-20100309.patch, hdfs-multiplex.patch Currently, on read requests, the DataXCeiver server allocates a new thread per request, which must allocate its own buffers and leads to higher-than-optimal CPU and memory usage by the sending threads. If we had a single selector and a small threadpool to multiplex request packets, we could theoretically achieve higher performance while taking up fewer resources and leaving more CPU on datanodes available for mapred, hbase or whatever. This can be done without changing any wire protocols. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1034) Enhance datanode to read data and checksum file in parallel
[ https://issues.apache.org/jira/browse/HDFS-1034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844186#action_12844186 ] Jay Booth commented on HDFS-1034: - Wouldn't this preclude the use of transferTo for transfer from the main block file? The packet header and sums need to be sent before transferTo is invoked, otherwise things would all be jumbled up together. Enhance datanode to read data and checksum file in parallel --- Key: HDFS-1034 URL: https://issues.apache.org/jira/browse/HDFS-1034 Project: Hadoop HDFS Issue Type: Improvement Reporter: dhruba borthakur Assignee: dhruba borthakur In the current HDFS implementation, a read of a block issued to the datanode results in a disk access to the checksum file followed by a disk access to the checksum file. It would be nice to be able to do these two IOs in parallel to reduce read latency. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads
[ https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844191#action_12844191 ] Jay Booth commented on HDFS-918: I'll enthusiastically cheerlead? :) More seriously, I'm willing to put in the work to get everyone to a comfort level with this patch so it gets committed in some form and we get some wins out of it, but the write path is more complicated, I don't understand it as well and I'm honestly going to need a little bit of a break over the summer. I'd love to help in whatever way I can but I'm probably not going to have the bandwidth to be the main driver of it in the short term.. I don't fully grok the write pipeline and the intricacies of append yet, so at the least someone else would have to be involved. Use single Selector and small thread pool to replace many instances of BlockSender for reads Key: HDFS-918 URL: https://issues.apache.org/jira/browse/HDFS-918 Project: Hadoop HDFS Issue Type: Improvement Components: data-node Reporter: Jay Booth Fix For: 0.22.0 Attachments: hdfs-918-20100201.patch, hdfs-918-20100203.patch, hdfs-918-20100211.patch, hdfs-918-20100228.patch, hdfs-918-20100309.patch, hdfs-multiplex.patch Currently, on read requests, the DataXCeiver server allocates a new thread per request, which must allocate its own buffers and leads to higher-than-optimal CPU and memory usage by the sending threads. If we had a single selector and a small threadpool to multiplex request packets, we could theoretically achieve higher performance while taking up fewer resources and leaving more CPU on datanodes available for mapred, hbase or whatever. This can be done without changing any wire protocols. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads
[ https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Booth updated HDFS-918: --- Attachment: hdfs-918-20100309.patch New patch and better benchmarks: Environment: 8x2GHz, 7GB RAM, namenode and dfs client 8x2GHz, 7GB RAM, datanode Streaming: Single threaded: 60 runs over 100MB file, presumed in memory so network is chokepoint Current DFS : 92MB/s over 60 runs Multiplex : 97 MB/s over 60 runs * Either random variation, or maybe larger packet size helps Multi-threaded - 32 threads reading 100MB file, 60X each Both around 3.25MB/s/thread, 104 MB/s aggregate Network saturation Random reads: The multiplexed implementation saves about 1.5 ms, probably by avoiding extra file-opens and buffer allocation. - 5 iterations of 2000 reads each, 32kb, front of file, singlethreaded - splits for current DFS: 5.3, 4.6, 5.0, 4.4, 6.4 - splits for multiplex:3.2, 3.0, 4.6, 3.3 ,3.2 - multithreaded concurrent read speeds on a single host converged with more threads -- probably client-side delay negotiating lots of new tcp connections File handle consumption: Both rest at 401 open files (mostly jars) When doing random reads across 128 threads, BlockSender spikes to the 1150, opening a blockfile, metafile, selector, and socket for each concurrent connection. MultiplexedBlockSender only jumps to 530, with just the socket as a per-connection resource, blockfiles, metafiles and the single selector are shared. I'll post a comment later with an updated description of the patch, and when I get a chance, I'll run some more disk-bound benchmarks, I think the asynchronous approach will pay some dividends there by letting the operating system do more of the work. Super brief patch notes: eliminated silly add'l dependency on commons-math, now has no new dependencies incorporated Zlatin's suggestions upthread to do asynchronous I/O, 1 shared selector BlockChannelPool is shared across threads Buffers are threadlocal so they'll tend to be re-used rather than re-allocated Use single Selector and small thread pool to replace many instances of BlockSender for reads Key: HDFS-918 URL: https://issues.apache.org/jira/browse/HDFS-918 Project: Hadoop HDFS Issue Type: Improvement Components: data-node Reporter: Jay Booth Fix For: 0.22.0 Attachments: hdfs-918-20100201.patch, hdfs-918-20100203.patch, hdfs-918-20100211.patch, hdfs-918-20100228.patch, hdfs-918-20100309.patch, hdfs-multiplex.patch Currently, on read requests, the DataXCeiver server allocates a new thread per request, which must allocate its own buffers and leads to higher-than-optimal CPU and memory usage by the sending threads. If we had a single selector and a small threadpool to multiplex request packets, we could theoretically achieve higher performance while taking up fewer resources and leaving more CPU on datanodes available for mapred, hbase or whatever. This can be done without changing any wire protocols. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads
[ https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Booth updated HDFS-918: --- Attachment: hdfs-918-20100228.patch New patch -- Took Zlatin's advice and utilized selectionKey.interestOps(0) to avoid busy waits, so we're back to a single selector and an ExecutorService. The ExecutorService reuses threads if possible, destroying threads that haven't been used in 60 seconds. Analyzed logs and the selectorThread doesn't seem to busy wait ever. Buffers are now stored in threadlocals and allocated per thread (they're now HeapByteBuffers since we might have some churn and most of our transfer is using transferTo anyways). Still uses shared BlockChannelPool implemented via ReadWriteLock. I think this will be pretty good, will benchmark tonight. Use single Selector and small thread pool to replace many instances of BlockSender for reads Key: HDFS-918 URL: https://issues.apache.org/jira/browse/HDFS-918 Project: Hadoop HDFS Issue Type: Improvement Components: data-node Reporter: Jay Booth Fix For: 0.22.0 Attachments: hdfs-918-20100201.patch, hdfs-918-20100203.patch, hdfs-918-20100211.patch, hdfs-918-20100228.patch, hdfs-multiplex.patch Currently, on read requests, the DataXCeiver server allocates a new thread per request, which must allocate its own buffers and leads to higher-than-optimal CPU and memory usage by the sending threads. If we had a single selector and a small threadpool to multiplex request packets, we could theoretically achieve higher performance while taking up fewer resources and leaving more CPU on datanodes available for mapred, hbase or whatever. This can be done without changing any wire protocols. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads
[ https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12834252#action_12834252 ] Jay Booth commented on HDFS-918: Yeah, I'll do my best to get benchmarks by the end of the weekend, kind of a crazy week this week and I moved this past weekend, so I don't have a ton of time. Todd, if you feel like blasting a couple stream-of-consciousness comments to me via email, go right ahead, otherwise I'll run the benchmarks this weekend and wait for the well-written version :). Zlatin, I originally had a similar architecture to what you're describing, using a BlockingQueue to funnel the actual work to a threadpool, but I had some issues with being able to get the locking quite right, either I wasn't getting things into the queue as fast as possible, or I was burning a lot of empty cycles in the selector thread. Specifically, I can't cancel a SelectionKey and then re-register with the same selector afterwards, it leads to exceptions, so my Selector thread was spinning in a tight loop verifying that, yes, all of these writable SelectionKeys are currently enqueued for work, whenever anything was being processed. But that was a couple iterations ago, maybe I'll have better luck trying it now. What we really need is a libevent-like framework, I'll spend a little time reviewing the outward facing API for that and scratching my noggin. Ultimately, only so much I/O can actually happen at a time before the disk is swamped, so it might be that a set of, say, 32 selector threads gets the same performance as 1024 threads. In that case, we'd be taking up fewer resources for the same performance. At any rate, I need to benchmark before speculating further. Use single Selector and small thread pool to replace many instances of BlockSender for reads Key: HDFS-918 URL: https://issues.apache.org/jira/browse/HDFS-918 Project: Hadoop HDFS Issue Type: Improvement Components: data-node Reporter: Jay Booth Fix For: 0.22.0 Attachments: hdfs-918-20100201.patch, hdfs-918-20100203.patch, hdfs-918-20100211.patch, hdfs-multiplex.patch Currently, on read requests, the DataXCeiver server allocates a new thread per request, which must allocate its own buffers and leads to higher-than-optimal CPU and memory usage by the sending threads. If we had a single selector and a small threadpool to multiplex request packets, we could theoretically achieve higher performance while taking up fewer resources and leaving more CPU on datanodes available for mapred, hbase or whatever. This can be done without changing any wire protocols. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads
[ https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12834321#action_12834321 ] Jay Booth commented on HDFS-918: Thanks Zlatin, I think you're right. I'll look at finding a way to remove writable interest without cancelling the key, that could fix the busy looping issue, then I could use a condition to ensure wakeup when something is newly writable-interested (via completed packet or new request) and refactor back to a single selector thread and several executing threads. I'll make a copy of the patch and try benchmarking both methods. Use single Selector and small thread pool to replace many instances of BlockSender for reads Key: HDFS-918 URL: https://issues.apache.org/jira/browse/HDFS-918 Project: Hadoop HDFS Issue Type: Improvement Components: data-node Reporter: Jay Booth Fix For: 0.22.0 Attachments: hdfs-918-20100201.patch, hdfs-918-20100203.patch, hdfs-918-20100211.patch, hdfs-multiplex.patch Currently, on read requests, the DataXCeiver server allocates a new thread per request, which must allocate its own buffers and leads to higher-than-optimal CPU and memory usage by the sending threads. If we had a single selector and a small threadpool to multiplex request packets, we could theoretically achieve higher performance while taking up fewer resources and leaving more CPU on datanodes available for mapred, hbase or whatever. This can be done without changing any wire protocols. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads
[ https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Booth updated HDFS-918: --- Attachment: hdfs-918-20100203.patch New patch. Streamlined MultiplexedBlockSender, we now have one selector per worker thread and no BlockingQueues, writeable connections are handled inline by each thread as they're available. Includes a utility class to read a file with a bunch of threads and time them. Ran some ad hoc jobs on my laptop and got similar performance to existing BlockSender, slightly faster for single file and slightly slower for 15 competing localhost threads.. which is exactly the opposite of what I boldly predicted. I read somewhere that linux thread scheduling for Java is disabled because it requires root, so it ignores priority -- if that's the case, maybe running more threads is actually an advantage when all the readers are local and you're directly competing with them for CPU -- you compete more effectively for limited resources with more threads. I'm gonna try and write an MR job to run some different scenarios on a cluster soon (thundering herd, steady medium, large number of idles, individual read).. I think the architecture here is more suited to large numbers of connections so if it did ok under a small number, then great. I'll be pretty busy for the next month or so but will try to get this running in a cluster at some point and report some more interesting numbers. Use single Selector and small thread pool to replace many instances of BlockSender for reads Key: HDFS-918 URL: https://issues.apache.org/jira/browse/HDFS-918 Project: Hadoop HDFS Issue Type: Improvement Components: data-node Reporter: Jay Booth Fix For: 0.22.0 Attachments: hdfs-918-20100201.patch, hdfs-918-20100203.patch, hdfs-multiplex.patch Currently, on read requests, the DataXCeiver server allocates a new thread per request, which must allocate its own buffers and leads to higher-than-optimal CPU and memory usage by the sending threads. If we had a single selector and a small threadpool to multiplex request packets, we could theoretically achieve higher performance while taking up fewer resources and leaving more CPU on datanodes available for mapred, hbase or whatever. This can be done without changing any wire protocols. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads
[ https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Booth updated HDFS-918: --- Attachment: hdfs-918-20100201.patch New patch.. * new configuration params: dfs.datanode.multiplexBlockSender=true, dfs.datanode.multiplex.packetSize=32k, dfs.datanode.multiplex.numWorkers=3 * Packet size is tuneable, possibly allowing better performance with larger TCP buffers enabled * Workers only wake up when a connection is writable * 3 new class files, minor changes to DataXceiverServer and DataXceiver, 2 utility classes added to DataTransferProtocol (one stolen from HDFS-881) * Passes tests from earlier comment plus a new one for files with lengths that don't match up to checksum chunk size, as well as holding up to some load on TestDFSIO * Still fails all tests relying on SimulatedFSDataset * Has a large amount of TRACE level debugging going on in MultiplexedBlockSender in case anybody wants to watch the output * Adds dependencies for commons-pool and commons-math (for benchmarking code) * Doesn't yet have benchmarks, but those should be easy now that the configuration is all in place Use single Selector and small thread pool to replace many instances of BlockSender for reads Key: HDFS-918 URL: https://issues.apache.org/jira/browse/HDFS-918 Project: Hadoop HDFS Issue Type: Improvement Components: data-node Reporter: Jay Booth Fix For: 0.22.0 Attachments: hdfs-918-20100201.patch, hdfs-multiplex.patch Currently, on read requests, the DataXCeiver server allocates a new thread per request, which must allocate its own buffers and leads to higher-than-optimal CPU and memory usage by the sending threads. If we had a single selector and a small threadpool to multiplex request packets, we could theoretically achieve higher performance while taking up fewer resources and leaving more CPU on datanodes available for mapred, hbase or whatever. This can be done without changing any wire protocols. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads
[ https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12828009#action_12828009 ] Jay Booth commented on HDFS-918: I haven't had a chance to run benchmarks yet, but I think that under lots of connections, the thread-per-connection model will spend more time swapping compared to getting work done, plus it has a few places where they hot block by doing while (buff.hasRemaining()) { write() }. Only selecting the currently writeable connections and scheduling them sidesteps both issues while being less of a resource footprint - assuming it delivers on the performance. As soon as I get a chance, I'll write some benchmarks. If anyone wants to take a look at the code in the meantime, I think this patch is pretty easy to set up -- just enable MultiplexBlockSender.LOG for TRACE and run tests, and you can see how each packet is built and sent. 'ant compile eclipse-files' will set up the extra dependencies on commons-pool and commons-math. Use single Selector and small thread pool to replace many instances of BlockSender for reads Key: HDFS-918 URL: https://issues.apache.org/jira/browse/HDFS-918 Project: Hadoop HDFS Issue Type: Improvement Components: data-node Reporter: Jay Booth Fix For: 0.22.0 Attachments: hdfs-918-20100201.patch, hdfs-multiplex.patch Currently, on read requests, the DataXCeiver server allocates a new thread per request, which must allocate its own buffers and leads to higher-than-optimal CPU and memory usage by the sending threads. If we had a single selector and a small threadpool to multiplex request packets, we could theoretically achieve higher performance while taking up fewer resources and leaving more CPU on datanodes available for mapred, hbase or whatever. This can be done without changing any wire protocols. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads
[ https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12806460#action_12806460 ] Jay Booth commented on HDFS-918: I'll definitely perform lots of benchmarking before asking for more of a formal review for commit. To be clear, this patch by itself doesn't open up any additional ports/services and attempts minimal change to existing code -- the only major change is in opRead on DataXCeiver, replacing creation/execution of a BlockSender with a quick call to register() on my new readserver class (which should probably be renamed to MultiplexedBlockSender or something since it's not actually a server, doesn't expose any new services). The other changes are just the addition of 3 new classes. My newest patch is still at home and I'll upload it this weekend when I have some time, the current patch here doesn't completely work. Right now I expect this to yield some pretty decent savings in CPU time but maybe not in wall-clock time. I'm thinking about 2 approaches, running datanode with time while performing a bunch of IO tasks and checking CPU usage afterwards, and then measuring total throughput in TestDFSIO on an oversubscribed cluster so that we get CPU-bound and can demonstrate improvement that way. Any preference? Other suggestions? I'm a little worried that simply timing datanode will measure things like, how long I waited after start to launch a job, or background replication.. but I suppose those problems exist with any measurement. Also, I'm planning to add a parameter to configure usage of BlockSender vs MultiplexedBlockSender, does dfs.datanode.multiplexReads=true sound good? Use single Selector and small thread pool to replace many instances of BlockSender for reads Key: HDFS-918 URL: https://issues.apache.org/jira/browse/HDFS-918 Project: Hadoop HDFS Issue Type: Improvement Components: data-node Reporter: Jay Booth Fix For: 0.22.0 Attachments: hdfs-multiplex.patch Currently, on read requests, the DataXCeiver server allocates a new thread per request, which must allocate its own buffers and leads to higher-than-optimal CPU and memory usage by the sending threads. If we had a single selector and a small threadpool to multiplex request packets, we could theoretically achieve higher performance while taking up fewer resources and leaving more CPU on datanodes available for mapred, hbase or whatever. This can be done without changing any wire protocols. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads
[ https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12806476#action_12806476 ] Jay Booth commented on HDFS-918: Good call Todd, that's definitely the canonical case for multiplexing, plus it should show some benefit because I recycle file handles across requests by pooling them, rather than opening a new file per request. I'll see if I can get something along those lines set up. Use single Selector and small thread pool to replace many instances of BlockSender for reads Key: HDFS-918 URL: https://issues.apache.org/jira/browse/HDFS-918 Project: Hadoop HDFS Issue Type: Improvement Components: data-node Reporter: Jay Booth Fix For: 0.22.0 Attachments: hdfs-multiplex.patch Currently, on read requests, the DataXCeiver server allocates a new thread per request, which must allocate its own buffers and leads to higher-than-optimal CPU and memory usage by the sending threads. If we had a single selector and a small threadpool to multiplex request packets, we could theoretically achieve higher performance while taking up fewer resources and leaving more CPU on datanodes available for mapred, hbase or whatever. This can be done without changing any wire protocols. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads
[ https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12804561#action_12804561 ] Jay Booth commented on HDFS-918: Current patch is having some issues in terms of actual use -- seems to pass a lot of tests but I'm having problems with ChecksumExceptions running actual MR jobs over it, so clearly it's a work in progress still :) I have a better patch that fixes some of the issues but it's still not 100%, so I'll upload with some new tests once I resolve remaining issues. Currently passing TestPread, TestDistributedFileSystem, TestDataTransferProtocol and TestClientBlockVerification, but still getting issues when actually running the thing -- if somebody has recommendations for other tests to debug with, that'd be very welcome. ARCHITECTURE 1) On client connect to DataXCeiver server, dispatch a thread as per usual, except now the thread is extremely short-lived. It simply registers the connection with server.datanode.ReadServer and dies. (It would be more efficient to have some sort of accept loop that didn't spawn a thread here, but I went with lowest impact integration) 2) On register of a connection, ReadServer creates a Connection object and registers the channel with a selector inside of ReadServer.ResponseWriter. ResponseWriter maintains an ArrayBlockingQueueConnection workQueue and polls the selector, cancelling keys and adding connections which are ready for write to the work queue. ReadServer also maintains a BlockChannelPool, which is a pool of BlockChannel objects -- each BlockChannel represents the file and meta-file for a given block. 3) A small Handler pool takes items off of this work queue and calls connection.sendPacket(buffer, channelPool). Each handler maintains a DirectByteBuffer, instantiated at startup time, which it uses for all requests. 4) Connection.sendPacket(buffer, channelPool) consults internal state about what needs to be sent next (response headers, packet headers, checksums, bytes) and sends what it can, updating internal state variables. Uses the provided buffer and channelPool to do its work. Uses transferTo unless the config property for transferTo is disabled. Right now it actually sends 2 packets per packet (header+sums and then bytes), once I resolve all correctness bugs it may be worth combining the two into one packet for small reads. 5) After work has been done and internal state updated (even if only 1 byte was sent), Handler re-registers the Connection with ResponseWriter for further writes, or closes it if we're done. Once I have this fully working, I'd expect CPU savings from fewer long-running threads and less garbage collection of buffers, perhaps a small performance boost from the select-based architecture and using DirectByteBuffers instead of HeapByteBuffers, and a slight reduction in IOWAIT time under some circumstances because we're pooling file channels rather than re-opening for every request. It should also consume far fewer xceiver threads and open file handles while running -- the pool is capped, so if we start getting crazy numbers of requests, we'll close/re-open files as necessary to stay under the cap. INTEGRATION As I said, I made the DataXCeiver thread for opReadBlock register the channel and then die. This is probably the best way to go even though it's not optimal from a performance standpoint. Unfortunately, since DataXCeiver threads close their sockets when they die, I had to put a special boolean case 'skipClose' to avoid that if op == Op.READ, which is kind of ugly -- recommendations are welcome for what to do here. Also, as I noted earlier, the BlockChannelPool requires an instance of FSDataset to function, rather than FSDatasetInterface, because Interface doesn't supply any getFile() methods, just getInputStream() methods. Probably the best way to handle this for tests would be to have the SimulatedFSDataset write actual files to /tmp somewhere and provide handles to those files when running tests. Any thoughts from anyone? FUTURE Once I get this working, it might be worth exploring using this as a mechanism for repeated reads over the same datanode connection, which could give some pretty big performance gains to certain applications. Upon completion of a request, the ReadServer could simply put the connections in a pool that's polled every so often for isReadable() -- if it's readable, read the request and re-register with the ResponseWriter. Along those lines, once we get there it might wind up being simpler to do the initial read-request through that mechanism as well, which would mean that we could get rid of some of the messy integration with DataXceiverServer -- however, it would require opening up another port just for reads. What are people's thoughts on that? I won't make that a goal of this patch but I'd be
[jira] Created: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads
Use single Selector and small thread pool to replace many instances of BlockSender for reads Key: HDFS-918 URL: https://issues.apache.org/jira/browse/HDFS-918 Project: Hadoop HDFS Issue Type: Improvement Components: data-node Reporter: Jay Booth Fix For: 0.22.0 Currently, on read requests, the DataXCeiver server allocates a new thread per request, which must allocate its own buffers and leads to higher-than-optimal CPU and memory usage by the sending threads. If we had a single selector and a small threadpool to multiplex request packets, we could theoretically achieve higher performance while taking up fewer resources and leaving more CPU on datanodes available for mapred, hbase or whatever. This can be done without changing any wire protocols. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads
[ https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Booth updated HDFS-918: --- Attachment: hdfs-multiplex.patch Here's a first implementation -- it works, passes TestDistributedFileSystem, TestDataTransferProtocol and TestPread. However, it has a direct dependency on FSDataset (not FSDatasetInterface) because it needs to get ahold of files directly to open FileChannels. This leads to ClassCastExceptions in all tests relying on SimulatedFSDataset. Would love to hear feedback about a way to resolve this. Have not benchmarked yet, I'll post another comment with an architectural description. Use single Selector and small thread pool to replace many instances of BlockSender for reads Key: HDFS-918 URL: https://issues.apache.org/jira/browse/HDFS-918 Project: Hadoop HDFS Issue Type: Improvement Components: data-node Reporter: Jay Booth Fix For: 0.22.0 Attachments: hdfs-multiplex.patch Currently, on read requests, the DataXCeiver server allocates a new thread per request, which must allocate its own buffers and leads to higher-than-optimal CPU and memory usage by the sending threads. If we had a single selector and a small threadpool to multiplex request packets, we could theoretically achieve higher performance while taking up fewer resources and leaving more CPU on datanodes available for mapred, hbase or whatever. This can be done without changing any wire protocols. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-516) Low Latency distributed reads
[ https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Booth updated HDFS-516: --- Resolution: Won't Fix Status: Resolved (was: Patch Available) This work didn't really go anywhere, I tried multithreading on the client side to make a request-per-packet approach viable, but it wound up being even slower than singlethreaded request-per-packet. If there's interest in the client-side byte cache, I could create another issue with just that improvement. Low Latency distributed reads - Key: HDFS-516 URL: https://issues.apache.org/jira/browse/HDFS-516 Project: Hadoop HDFS Issue Type: New Feature Reporter: Jay Booth Priority: Minor Attachments: hdfs-516-20090912.patch, radfs.odp Original Estimate: 168h Remaining Estimate: 168h I created a method for low latency random reads using NIO on the server side and simulated OS paging with LRU caching and lookahead on the client side. Some applications could include lucene searching (term-doc and doc-offset mappings are likely to be in local cache, thus much faster than nutch's current FsDirectory impl and binary search through record files (bytes at 1/2, 1/4, 1/8 marks are likely to be cached) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-516) Low Latency distributed reads
[ https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12755823#action_12755823 ] Jay Booth commented on HDFS-516: Yeah, I was puzzled by the performance too. I dug through the DFS code and I'm saving a bit on new socket and object creation, maybe a couple instructions here and there, but that shouldn't add up to 100 seconds for a gigabyte (approx 20 blocks). I'm calling read() a bajillion times in a row so it's conceivable (although unlikely) that I'm pegging the CPU and that's the limiting factor. I'm busy for a couple days but will get back to you with some figures from netstat, top and whatever else I can think of, along with another streaming case that works with read(b, off, len) to see if that changes things. I'll do a little more digging into DFS as well to see if I can isolate the cause. I definitely did run them several times on the same machine and another time on a different cluster with similar results, so it wasn't simply bad luck on the rack placement on EC2 (well maybe but unlikely). Will report back when I have more numbers. After I get those, my roadmap for this is to add checksum support and better DatanodeInfo caching. User groups would come after that. Low Latency distributed reads - Key: HDFS-516 URL: https://issues.apache.org/jira/browse/HDFS-516 Project: Hadoop HDFS Issue Type: New Feature Reporter: Jay Booth Priority: Minor Attachments: hdfs-516-20090912.patch Original Estimate: 168h Remaining Estimate: 168h I created a method for low latency random reads using NIO on the server side and simulated OS paging with LRU caching and lookahead on the client side. Some applications could include lucene searching (term-doc and doc-offset mappings are likely to be in local cache, thus much faster than nutch's current FsDirectory impl and binary search through record files (bytes at 1/2, 1/4, 1/8 marks are likely to be cached) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-516) Low Latency distributed reads
[ https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12755120#action_12755120 ] Jay Booth commented on HDFS-516: Hey Todd, in short, I agree, we should be looking at moving performance improvements over to the main FS implementation. Right now, my version doesn't support user permissions or checksumming. I'd say it makes sense to keep it in contrib as a sandbox for now, and work towards full compatibility with the main DFS implementation at which point we could consider swapping in the new reading subsystem? User permissioning would require some model changes but should be workable, checksumming probably won't be too bad if I read the code right. So, I suppose keep it in contrib as a sandbox initially with an explicit goal of moving it over to DFS when it reaches compatibility? It doesn't really lend itself to moving over piecemeal, as it has several components which all pretty much need each other. However, it's pretty well integrated with the DFS API and only replaces one method on the filesystem class. Low Latency distributed reads - Key: HDFS-516 URL: https://issues.apache.org/jira/browse/HDFS-516 Project: Hadoop HDFS Issue Type: New Feature Reporter: Jay Booth Priority: Minor Attachments: hdfs-516-20090912.patch Original Estimate: 168h Remaining Estimate: 168h I created a method for low latency random reads using NIO on the server side and simulated OS paging with LRU caching and lookahead on the client side. Some applications could include lucene searching (term-doc and doc-offset mappings are likely to be in local cache, thus much faster than nutch's current FsDirectory impl and binary search through record files (bytes at 1/2, 1/4, 1/8 marks are likely to be cached) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-516) Low Latency distributed reads
[ https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Booth updated HDFS-516: --- Attachment: (was: hdfs-516-20090831.patch) Low Latency distributed reads - Key: HDFS-516 URL: https://issues.apache.org/jira/browse/HDFS-516 Project: Hadoop HDFS Issue Type: New Feature Reporter: Jay Booth Priority: Minor Attachments: hdfs-516-20090912.patch Original Estimate: 168h Remaining Estimate: 168h I created a method for low latency random reads using NIO on the server side and simulated OS paging with LRU caching and lookahead on the client side. Some applications could include lucene searching (term-doc and doc-offset mappings are likely to be in local cache, thus much faster than nutch's current FsDirectory impl and binary search through record files (bytes at 1/2, 1/4, 1/8 marks are likely to be cached) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-516) Low Latency distributed reads
[ https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Booth updated HDFS-516: --- Attachment: (was: radfs.patch) Low Latency distributed reads - Key: HDFS-516 URL: https://issues.apache.org/jira/browse/HDFS-516 Project: Hadoop HDFS Issue Type: New Feature Reporter: Jay Booth Priority: Minor Attachments: hdfs-516-20090912.patch Original Estimate: 168h Remaining Estimate: 168h I created a method for low latency random reads using NIO on the server side and simulated OS paging with LRU caching and lookahead on the client side. Some applications could include lucene searching (term-doc and doc-offset mappings are likely to be in local cache, thus much faster than nutch's current FsDirectory impl and binary search through record files (bytes at 1/2, 1/4, 1/8 marks are likely to be cached) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-516) Low Latency distributed reads
[ https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12750246#action_12750246 ] Jay Booth commented on HDFS-516: I did some benchmarking, here are the results: Each test ran 1000 searches to warm, then 5000 searches to benchmark. Binary search of a 20GB sorted sequence file of 20 million 1kb records. Tests were run from the namenode in a 4-node EC2 medium cluster, 1.7 GB of ram each. 1 namenode and 3 datanodes. From HDFS to a 512MB cached RadFS there was a 4X average improvement in search times, from 102ms to 24ms. Each search was, theoretically, 24.25 reads (log 2 of 20 million). Not actually measured. I only ran each set once. The 90th percent line trends the right way, although the max line is a little spikey. I'll add a 99th % in future benchmarks. HDFS, baseline: Warming with 1000 searches Executed 5000 random searches with FS class org.apache.hadoop.hdfs.DistributedFileSystem Done, Search Times: Mean: 102.178415 Variance: 5939.660105461091 Median: 97.0 Max: 3095.0 Min: 33.0 90th pct: 130.0 Rad, no cache Executed 5000 random searches with FS class org.apache.hadoop.hdfs.rad.RadFileSystem Done, Search Times: Mean: 68.556402 Variance: 233.8335857571515 Median: 67.0 Max: 379.0 Min: 26.0 90th pct: 79.0 Rad, 16MB cache: Warming with 1000 searches Executed 5000 random searches with FS class org.apache.hadoop.hdfs.rad.RadFileSystem Done, Search Times: Mean: 42.0397985 Variance: 237.83818359671966 Median: 40.0 Max: 203.0 Min: 5.0 90th pct: 59.0 Rad, 128MB cache: Warming with 1000 searches Executed 5000 random searches with FS class org.apache.hadoop.hdfs.rad.RadFileSystem Done, Search Times: Mean: 29.8506007 Variance: 202.08189601920367 Median: 27.0 Max: 203.0 Min: 1.0 90th pct: 45.0 Rad, 512MB cache: Warming with 1000 searches Executed 5000 random searches with FS class org.apache.hadoop.hdfs.rad.RadFileSystem Done, Search Times: Mean: 24.2746014 Variance: 250.3052558911758 Median: 22.0 Max: 687.0 Min: 0.0 90th pct: 36.0 I could still shave a point or two by cleaning up my caching system to be more graceful with its lookahead mechanism, but not bad for now. I'll pretty it up and post a first attempt at a final patch soon. Low Latency distributed reads - Key: HDFS-516 URL: https://issues.apache.org/jira/browse/HDFS-516 Project: Hadoop HDFS Issue Type: New Feature Reporter: Jay Booth Priority: Minor Attachments: hdfs-516-20090824.patch, hdfs-516-20090831.patch, radfs.patch Original Estimate: 168h Remaining Estimate: 168h I created a method for low latency random reads using NIO on the server side and simulated OS paging with LRU caching and lookahead on the client side. Some applications could include lucene searching (term-doc and doc-offset mappings are likely to be in local cache, thus much faster than nutch's current FsDirectory impl and binary search through record files (bytes at 1/2, 1/4, 1/8 marks are likely to be cached) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-516) Low Latency distributed reads
[ https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Booth updated HDFS-516: --- Attachment: hdfs-516-20090831.patch New patch, IPC server was too slow for IO operations (like 40 times slower than DFS without caching) so I wrote a custom ByteServer that's streamlined to avoid object creation or byte copying whenever possible and defaults to tcp nodelay. Client connections pool using commons-pool. Uses static methods in hdfs.rad.ByteServiceProtocol for all serialization, faster than reflection. On the laptop in pseudodistributed, I'm seeing 5X faster than DFS for random searches. Refactored a bunch on the client side, eliminated a few redundant classes, still need to make lookahead happen via a separate thread in caching byteservice and tweak a couple things in ByteServer for performance, then this thing will be pretty fast. I'm gonna run some numbers on EC2 tonight/tomorrow and see what I come up with. Also, cleaned up unit tests to JUnit 4 and added some javadoc, probably missed a bunch of places and could certainly expand on all of it. Haven't added license to the header of every file yet, license explicitly granted here, will get to that for next patch. Low Latency distributed reads - Key: HDFS-516 URL: https://issues.apache.org/jira/browse/HDFS-516 Project: Hadoop HDFS Issue Type: New Feature Reporter: Jay Booth Priority: Minor Attachments: hdfs-516-20090824.patch, hdfs-516-20090831.patch, radfs.patch Original Estimate: 168h Remaining Estimate: 168h I created a method for low latency random reads using NIO on the server side and simulated OS paging with LRU caching and lookahead on the client side. Some applications could include lucene searching (term-doc and doc-offset mappings are likely to be in local cache, thus much faster than nutch's current FsDirectory impl and binary search through record files (bytes at 1/2, 1/4, 1/8 marks are likely to be cached) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-516) Low Latency distributed reads
[ https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Booth updated HDFS-516: --- Attachment: hdfs-516-20090824.patch Changed caching system to ehcache, which gives us a configurable api with better caching options. Fixed all known correctness bugs and added lots of test coverage. Got SequenceFileSearcher working, but it depends on HADOOP-6196 -- I could fold the searcher code and searcher test into HADOOP-6196 if people are interested in the functionality. Added benchmarking utils, .sh scripts intended to be run from HADOOP_HOME/bin. Note, if anyone installs, run ant bin-package from project root before trying to run radfs unit tests. TODO: benchmark javadoc, pretty up codebase, change unit tests to JUnit 4 style, add filesystem contract unit test I'll get a fork going on github soon if anyone's interested in trying it out. Low Latency distributed reads - Key: HDFS-516 URL: https://issues.apache.org/jira/browse/HDFS-516 Project: Hadoop HDFS Issue Type: New Feature Reporter: Jay Booth Priority: Minor Attachments: hdfs-516-20090824.patch, radfs.patch Original Estimate: 168h Remaining Estimate: 168h I created a method for low latency random reads using NIO on the server side and simulated OS paging with LRU caching and lookahead on the client side. Some applications could include lucene searching (term-doc and doc-offset mappings are likely to be in local cache, thus much faster than nutch's current FsDirectory impl and binary search through record files (bytes at 1/2, 1/4, 1/8 marks are likely to be cached) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-516) Low Latency distributed reads
[ https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12738651#action_12738651 ] Jay Booth commented on HDFS-516: Wow, thanks Raghu, that's awesome and will save me a ton of time. A couple points for discussion: * The random 4k byte grabber is awesome and I will be using it as part of my benchmarking at the first opportunity, however I think it's worth also testing some likely applications to really show the strength of client-side caching. 10MB or so worth of properly warmed cache could mean your first 20 lookups in a binary search are almost-free, and having the frontmost 10% of a lucene index in cache will mean that almost all of the scoring portion of the search will be computed against local memory. Meanwhile, for truly random reads, having a cache that's, say, 5-10% of the size of the data will only get you a small improvement. So I'd like to get some numbers for use cases that really thrive on caching in addition to truly random access.But that will be extremely useful for tuning the IO layer and establishing a baseline for cache-miss performance, so thanks for the heads up. * I have a feeling that my implementation is significantly slower than the default when it comes to streaming, since it relies on successive, small positioned reads and a heavy memory footprint rather than a simple stream of bytes. Watching my unit tests run on my laptop with a ton of confounding factors sure seemed that way, although that's not a scientific measurement (one more item to benchmark). So while I agree with the urge for simplicity, I feel like we need to make that performance tradeoff clear. Otherwise, we could have a lot of very slow mapreduce jobs happening. Given that MapReduce is the primary use case for Hadoop, my instinct was to make RadFileSystem a non-default implementation. Point very well taken about the BlockLocations and CRC verification, maybe the best way to handle future integration with DataNode would be to develop separately, reuse as much code as possible and then when RadFileSystem is mature and benchmarked we can revisit a merge with DistributedFileSystem? Thanks again, I'll try and write a post later tonight with an explicit plan for benchmarking and then maybe people can comment and poke holes in it as they see fit? Low Latency distributed reads - Key: HDFS-516 URL: https://issues.apache.org/jira/browse/HDFS-516 Project: Hadoop HDFS Issue Type: New Feature Reporter: Jay Booth Priority: Minor Attachments: radfs.patch Original Estimate: 168h Remaining Estimate: 168h I created a method for low latency random reads using NIO on the server side and simulated OS paging with LRU caching and lookahead on the client side. Some applications could include lucene searching (term-doc and doc-offset mappings are likely to be in local cache, thus much faster than nutch's current FsDirectory impl and binary search through record files (bytes at 1/2, 1/4, 1/8 marks are likely to be cached) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HDFS-516) Low Latency distributed reads
Low Latency distributed reads - Key: HDFS-516 URL: https://issues.apache.org/jira/browse/HDFS-516 Project: Hadoop HDFS Issue Type: New Feature Reporter: Jay Booth Priority: Minor I created a method for low latency random reads using NIO on the server side and simulated OS paging with LRU caching and lookahead on the client side. Some applications could include lucene searching (term-doc and doc-offset mappings are likely to be in local cache, thus much faster than nutch's current FsDirectory impl and binary search through record files (bytes at 1/2, 1/4, 1/8 marks are likely to be cached) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-516) Low Latency distributed reads
[ https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Booth updated HDFS-516: --- Status: Patch Available (was: Open) Here's the initial patch. I have some pretty decent byte-consistency and integration tests wrapping it but no actual applications built using it. I started on a SequenceFileSearcher that implemented binary search but dealing with the sync points was sticky so it's not working yet (and commented out, with associated failing test). I'm hoping to get that working in the next couple weeks when I have some time as an example along with doing a performance comparison between nutch FsDirectory using DistributedFileSystem and RadFileSystem to see if this shows the gains that I'm thinking it will. My only change to core HDFS was to add getFile() to FSDatasetInterface so that the RadNode (plugged into datanode via ServicePlugin) can open FileChannels. Low Latency distributed reads - Key: HDFS-516 URL: https://issues.apache.org/jira/browse/HDFS-516 Project: Hadoop HDFS Issue Type: New Feature Reporter: Jay Booth Priority: Minor Attachments: radfs.patch Original Estimate: 168h Remaining Estimate: 168h I created a method for low latency random reads using NIO on the server side and simulated OS paging with LRU caching and lookahead on the client side. Some applications could include lucene searching (term-doc and doc-offset mappings are likely to be in local cache, thus much faster than nutch's current FsDirectory impl and binary search through record files (bytes at 1/2, 1/4, 1/8 marks are likely to be cached) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-516) Low Latency distributed reads
[ https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Booth updated HDFS-516: --- Attachment: radfs.patch Low Latency distributed reads - Key: HDFS-516 URL: https://issues.apache.org/jira/browse/HDFS-516 Project: Hadoop HDFS Issue Type: New Feature Reporter: Jay Booth Priority: Minor Attachments: radfs.patch Original Estimate: 168h Remaining Estimate: 168h I created a method for low latency random reads using NIO on the server side and simulated OS paging with LRU caching and lookahead on the client side. Some applications could include lucene searching (term-doc and doc-offset mappings are likely to be in local cache, thus much faster than nutch's current FsDirectory impl and binary search through record files (bytes at 1/2, 1/4, 1/8 marks are likely to be cached) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-516) Low Latency distributed reads
[ https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12737713#action_12737713 ] Jay Booth commented on HDFS-516: Here's some architectural overview and a general request for comments on the matter, I'll be away and busy the next few days but should be able to get back to this in the middle of next week. The basic workflow is I created a RadFileSystem (RandomAccessDistributed FS) which wraps DistributedFileSystem and delegates to it for everything except for getFSDataInputStream. That returns a custom FSDataInputStream which wraps a CachingByteService which itself wraps a RadFSByteService. The caching byte services share a cache which is managed by the RadFSClient class (could maybe factor that away and put it in RadFileSystem instead). They try to hit the cache, and if they miss, they call the underlying RadFSClientByteService to get the requested page plus a few pages of lookahead. The RadFSClientByteService calls the namenode to get appropriate block locations (todo, cache these effectively) and then calls RadNode, which is embedded in DataNode via ServicePlugin and maintains an IPCServer and a set of FileChannels to the local blocks. On repeated requests for the same data, the RadFSClient tends to favor going to the same host, figuring that the benefit of hitting the DataNode's OS cache for the given bytes outweighs the penalty of hopping a rack in terms of reducing latency (untested assumption). The intended use case is pretty different from MapReduce so I think this should be a contrib module that has to be explicitly invoked by clients. It really underperforms DFS in terms of streaming but should (haven't tested extensively outside of localhost) significantly outperform it in terms of random reads. In terms of files with 'hot paths', such as lucene indices or binary search over a normal file, cache hit percentage is likely to be pretty high so it should probably perform pretty well. Currently, it makes a fresh request to the NameNode for every read, which is inefficient but more likely to be correct. Going forward, I'd like to tighten this up, make sure it plays nice with append and get it into a future Hadoop release. Low Latency distributed reads - Key: HDFS-516 URL: https://issues.apache.org/jira/browse/HDFS-516 Project: Hadoop HDFS Issue Type: New Feature Reporter: Jay Booth Priority: Minor Attachments: radfs.patch Original Estimate: 168h Remaining Estimate: 168h I created a method for low latency random reads using NIO on the server side and simulated OS paging with LRU caching and lookahead on the client side. Some applications could include lucene searching (term-doc and doc-offset mappings are likely to be in local cache, thus much faster than nutch's current FsDirectory impl and binary search through record files (bytes at 1/2, 1/4, 1/8 marks are likely to be cached) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-516) Low Latency distributed reads
[ https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Booth updated HDFS-516: --- Attachment: radfs.tgz Oops. Shows what I get for posting at the end of the day, svn diff misses new files, duh. Here's a tarball of src/contrib/radfs. build.xml should play well with the rest of hadoop Low Latency distributed reads - Key: HDFS-516 URL: https://issues.apache.org/jira/browse/HDFS-516 Project: Hadoop HDFS Issue Type: New Feature Reporter: Jay Booth Priority: Minor Attachments: radfs.patch, radfs.tgz Original Estimate: 168h Remaining Estimate: 168h I created a method for low latency random reads using NIO on the server side and simulated OS paging with LRU caching and lookahead on the client side. Some applications could include lucene searching (term-doc and doc-offset mappings are likely to be in local cache, thus much faster than nutch's current FsDirectory impl and binary search through record files (bytes at 1/2, 1/4, 1/8 marks are likely to be cached) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-516) Low Latency distributed reads
[ https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12737799#action_12737799 ] Jay Booth commented on HDFS-516: Ok, thanks, is one big patch preferred to the tiny patch + tarball? Low Latency distributed reads - Key: HDFS-516 URL: https://issues.apache.org/jira/browse/HDFS-516 Project: Hadoop HDFS Issue Type: New Feature Reporter: Jay Booth Priority: Minor Attachments: radfs.patch, radfs.tgz Original Estimate: 168h Remaining Estimate: 168h I created a method for low latency random reads using NIO on the server side and simulated OS paging with LRU caching and lookahead on the client side. Some applications could include lucene searching (term-doc and doc-offset mappings are likely to be in local cache, thus much faster than nutch's current FsDirectory impl and binary search through record files (bytes at 1/2, 1/4, 1/8 marks are likely to be cached) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-516) Low Latency distributed reads
[ https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Booth updated HDFS-516: --- Attachment: (was: radfs.patch) Low Latency distributed reads - Key: HDFS-516 URL: https://issues.apache.org/jira/browse/HDFS-516 Project: Hadoop HDFS Issue Type: New Feature Reporter: Jay Booth Priority: Minor Original Estimate: 168h Remaining Estimate: 168h I created a method for low latency random reads using NIO on the server side and simulated OS paging with LRU caching and lookahead on the client side. Some applications could include lucene searching (term-doc and doc-offset mappings are likely to be in local cache, thus much faster than nutch's current FsDirectory impl and binary search through record files (bytes at 1/2, 1/4, 1/8 marks are likely to be cached) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-516) Low Latency distributed reads
[ https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Booth updated HDFS-516: --- Attachment: (was: radfs.patch) Low Latency distributed reads - Key: HDFS-516 URL: https://issues.apache.org/jira/browse/HDFS-516 Project: Hadoop HDFS Issue Type: New Feature Reporter: Jay Booth Priority: Minor Original Estimate: 168h Remaining Estimate: 168h I created a method for low latency random reads using NIO on the server side and simulated OS paging with LRU caching and lookahead on the client side. Some applications could include lucene searching (term-doc and doc-offset mappings are likely to be in local cache, thus much faster than nutch's current FsDirectory impl and binary search through record files (bytes at 1/2, 1/4, 1/8 marks are likely to be cached) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-516) Low Latency distributed reads
[ https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Booth updated HDFS-516: --- Attachment: radfs.patch Low Latency distributed reads - Key: HDFS-516 URL: https://issues.apache.org/jira/browse/HDFS-516 Project: Hadoop HDFS Issue Type: New Feature Reporter: Jay Booth Priority: Minor Attachments: radfs.patch Original Estimate: 168h Remaining Estimate: 168h I created a method for low latency random reads using NIO on the server side and simulated OS paging with LRU caching and lookahead on the client side. Some applications could include lucene searching (term-doc and doc-offset mappings are likely to be in local cache, thus much faster than nutch's current FsDirectory impl and binary search through record files (bytes at 1/2, 1/4, 1/8 marks are likely to be cached) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.