from:"Jay Booth \(JIRA\)"

[jira] [Commented] (HDFS-1323) Pool/share file channels for HDFS read

2011-06-17 Thread Jay Booth (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13051066#comment-13051066
 ] 

Jay Booth commented on HDFS-1323:
-

I could take a crack at it next week/weekend.  Will get my build set up and 
report back in a week or two.

 Pool/share file channels for HDFS read
 --

 Key: HDFS-1323
 URL: https://issues.apache.org/jira/browse/HDFS-1323
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: data-node
Reporter: Jay Booth
 Attachments: hdfs-1323-20100730.patch, hdfs-1323-trunk.txt


 Currently, all reads in HDFS require opening and closing the underlying 
 block/meta filechannels.  We could pool these filechannels and save some 
 system calls and other work.  Since HDFS read requests can be satisfied by 
 positioned reads and transferTos, we can even share these filechannels 
 between concurrently executing requests.
 The attached patch was benchmarked as part of work on HDFS-918 and exhibited 
 a 10% performance increase for small random reads.
 This does not affect client logic and involves minimal change to server 
 logic.  Patch is based on branch 20-append. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads

2011-01-04 Thread Jay Booth (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12977544#action_12977544
]

Jay Booth commented on HDFS-918:

Hey all, sorry for the slow response, been swamped with the new year and all.

RE: unit tests, at one point it was passing all tests, not sure if the tests
changed or this changed but I can take a look at it.

RE: 0.23, I can look at forward porting this again, but a lot of changes have
gone in since then.

@stack, were you testing the only pooling patch or the with full
multiplexing patch?

Only pooling would be much simpler to forward port, although I do think that
the full multiplexing patch is pretty worthwhile. Aside from the
small-but-significant performance gain, it was IMO much better factoring to
have the DN-side logic all encapsulated in a Connection object which has
sendPacket() repeatedly called, rather than a giant procedural loop that goes
down and back up through several classes. The architecture also made keepalive
pretty straightforward.. just throw that connection back into a listening pool
when done, and make corresponding changes on client side. But, I guess that
logic's been revised now anyways, so it'd be a significant piece of work to
bring it all back up to date.

Use single Selector and small thread pool to replace many instances of
BlockSender for reads

Key: HDFS-918
URL: https://issues.apache.org/jira/browse/HDFS-918
Project: Hadoop HDFS
Issue Type: Improvement
Components: data-node
Reporter: Jay Booth
Assignee: Jay Booth
Fix For: 0.22.0

Attachments: hbase-hdfs-benchmarks.ods, hdfs-918-20100201.patch,
hdfs-918-20100203.patch, hdfs-918-20100211.patch, hdfs-918-20100228.patch,
hdfs-918-20100309.patch, hdfs-918-branch20-append.patch,
hdfs-918-branch20.2.patch, hdfs-918-pool.patch, hdfs-918-TRUNK.patch,
hdfs-multiplex.patch

Currently, on read requests, the DataXCeiver server allocates a new thread
per request, which must allocate its own buffers and leads to
higher-than-optimal CPU and memory usage by the sending threads. If we had a
single selector and a small threadpool to multiplex request packets, we could
theoretically achieve higher performance while taking up fewer resources and
leaving more CPU on datanodes available for mapred, hbase or whatever. This
can be done without changing any wire protocols.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (HDFS-1323) Pool/share file channels for HDFS read

2010-07-30 Thread Jay Booth (JIRA)

Pool/share file channels for HDFS read
--

 Key: HDFS-1323
 URL: https://issues.apache.org/jira/browse/HDFS-1323
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: data-node
Reporter: Jay Booth
 Fix For: 0.20-append, 0.22.0


Currently, all reads in HDFS require opening and closing the underlying 
block/meta filechannels.  We could pool these filechannels and save some system 
calls and other work.  Since HDFS read requests can be satisfied by positioned 
reads and transferTos, we can even share these filechannels between 
concurrently executing requests.

The attached patch was benchmarked as part of work on HDFS-918 and exhibited a 
10% performance increase for small random reads.

This does not affect client logic and involves minimal change to server logic.  
Patch is based on branch 20-append. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HDFS-1323) Pool/share file channels for HDFS read

2010-07-30 Thread Jay Booth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jay Booth updated HDFS-1323:


Attachment: hdfs-1323-20100730.patch

 Pool/share file channels for HDFS read
 --

 Key: HDFS-1323
 URL: https://issues.apache.org/jira/browse/HDFS-1323
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: data-node
Reporter: Jay Booth
 Fix For: 0.20-append, 0.22.0

 Attachments: hdfs-1323-20100730.patch


 Currently, all reads in HDFS require opening and closing the underlying 
 block/meta filechannels.  We could pool these filechannels and save some 
 system calls and other work.  Since HDFS read requests can be satisfied by 
 positioned reads and transferTos, we can even share these filechannels 
 between concurrently executing requests.
 The attached patch was benchmarked as part of work on HDFS-918 and exhibited 
 a 10% performance increase for small random reads.
 This does not affect client logic and involves minimal change to server 
 logic.  Patch is based on branch 20-append. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1323) Pool/share file channels for HDFS read

2010-07-30 Thread Jay Booth (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12894072#action_12894072
 ] 

Jay Booth commented on HDFS-1323:
-

Correction - the patch created a 10% performance increase for HBase random 
GETs.  It was probably a larger % of the read operation, if you don't include 
other work by HBase.

 Pool/share file channels for HDFS read
 --

 Key: HDFS-1323
 URL: https://issues.apache.org/jira/browse/HDFS-1323
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: data-node
Reporter: Jay Booth
 Fix For: 0.20-append, 0.22.0

 Attachments: hdfs-1323-20100730.patch


 Currently, all reads in HDFS require opening and closing the underlying 
 block/meta filechannels.  We could pool these filechannels and save some 
 system calls and other work.  Since HDFS read requests can be satisfied by 
 positioned reads and transferTos, we can even share these filechannels 
 between concurrently executing requests.
 The attached patch was benchmarked as part of work on HDFS-918 and exhibited 
 a 10% performance increase for small random reads.
 This does not affect client logic and involves minimal change to server 
 logic.  Patch is based on branch 20-append. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads

2010-07-19 Thread Jay Booth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jay Booth updated HDFS-918:
---

Attachment: hbase-hdfs-benchmarks.ods

Benchmarked on EC2 this weekend, I set up 0.20.2-append clean, a copy with my 
multiplex patch applied, and a third copy which only ports filechannel pooling 
to the current architecture (can submit that patch later, it's at home).


All runs were with HBase block caching disabled to highlight the difference in 
filesystem access speeds.  

This is running across a decently small dataset (little less than 1GB) so all 
files are presumably in memory for the majority of test duration.

Run involved 6 clients reading 1,000,000 rows each divided over 10 mappers.  
Cluster setup was 3x EC2 High-CPU XL, 1 NN/JT/ZK/Master and 2x DN/TT/RS.  Ran 
in 3 batches of 3 runs each.  Cluster was restarted in between each batch for 
each run type because we're changing DN implementation.


Topline numbers (rest are in document):

Total Run Averages  

Testclean   poolmultiplex
random  21159050.44 19448216.89 16806247
scan436106.89   442452.54   443262.56
sequential  19298239.78 17871047.67 14987028.44

Pool is 7.5% gain, multiplex is more like 20% for random reads

Only batches 2+3 (batch 1 was a little messed up and doesn't track with others) 

Testclean   poolmultiplex
random  20555308.67 1842501716987643.33
scan426849  427277.98   448031
sequential  18665323.67 16969885.83 15102404

Pool is 10% gain, multiplex is 17% or so for random reads

Per row for random read (batches 2+3 only):
clean: 3.42ms
pool: 3.07ms
multiplex: 2.83ms


 Use single Selector and small thread pool to replace many instances of 
 BlockSender for reads
 

 Key: HDFS-918
 URL: https://issues.apache.org/jira/browse/HDFS-918
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: data-node
Reporter: Jay Booth
 Fix For: 0.22.0

 Attachments: hbase-hdfs-benchmarks.ods, hdfs-918-20100201.patch, 
 hdfs-918-20100203.patch, hdfs-918-20100211.patch, hdfs-918-20100228.patch, 
 hdfs-918-20100309.patch, hdfs-918-branch20-append.patch, 
 hdfs-918-branch20.2.patch, hdfs-918-TRUNK.patch, hdfs-multiplex.patch


 Currently, on read requests, the DataXCeiver server allocates a new thread 
 per request, which must allocate its own buffers and leads to 
 higher-than-optimal CPU and memory usage by the sending threads.  If we had a 
 single selector and a small threadpool to multiplex request packets, we could 
 theoretically achieve higher performance while taking up fewer resources and 
 leaving more CPU on datanodes available for mapred, hbase or whatever.  This 
 can be done without changing any wire protocols.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads

2010-07-09 Thread Jay Booth (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jay Booth updated HDFS-918:
---

Attachment: hdfs-918-branch20-append.patch

Managed to get back to this.

Rebased on branch-20-append.

Fixed resource leak issue that apurtell identified.

Runs through HBase PerformanceEvaluation on my workstation completely with
default ulimit of 1024, no crashes.

I'm going to try and benchmark this on a real cluster this weekend and report
results. Happy Friday everyone

Use single Selector and small thread pool to replace many instances of
BlockSender for reads

Key: HDFS-918
URL: https://issues.apache.org/jira/browse/HDFS-918
Project: Hadoop HDFS
Issue Type: Improvement
Components: data-node
Reporter: Jay Booth
Fix For: 0.22.0

Attachments: hdfs-918-20100201.patch, hdfs-918-20100203.patch,
hdfs-918-20100211.patch, hdfs-918-20100228.patch, hdfs-918-20100309.patch,
hdfs-918-branch20-append.patch, hdfs-918-branch20.2.patch,
hdfs-918-TRUNK.patch, hdfs-multiplex.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads

2010-04-01 Thread Jay Booth (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852610#action_12852610
 ] 

Jay Booth commented on HDFS-918:


Am I ever, this one should be good to go but I'm on my way out right now and 
won't be around to help if it breaks again.  If you wanna give it a whirl, be 
my guest :) 

 Use single Selector and small thread pool to replace many instances of 
 BlockSender for reads
 

 Key: HDFS-918
 URL: https://issues.apache.org/jira/browse/HDFS-918
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: data-node
Reporter: Jay Booth
 Fix For: 0.22.0

 Attachments: hdfs-918-20100201.patch, hdfs-918-20100203.patch, 
 hdfs-918-20100211.patch, hdfs-918-20100228.patch, hdfs-918-20100309.patch, 
 hdfs-918-branch20.2.patch, hdfs-918-TRUNK.patch, hdfs-multiplex.patch


 Currently, on read requests, the DataXCeiver server allocates a new thread 
 per request, which must allocate its own buffers and leads to 
 higher-than-optimal CPU and memory usage by the sending threads.  If we had a 
 single selector and a small threadpool to multiplex request packets, we could 
 theoretically achieve higher performance while taking up fewer resources and 
 leaving more CPU on datanodes available for mapred, hbase or whatever.  This 
 can be done without changing any wire protocols.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads

2010-03-31 Thread Jay Booth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jay Booth updated HDFS-918:
---

Attachment: (was: hdfs-918-branch20.2.patch)

 Use single Selector and small thread pool to replace many instances of 
 BlockSender for reads
 

 Key: HDFS-918
 URL: https://issues.apache.org/jira/browse/HDFS-918
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: data-node
Reporter: Jay Booth
 Fix For: 0.22.0

 Attachments: hdfs-918-20100201.patch, hdfs-918-20100203.patch, 
 hdfs-918-20100211.patch, hdfs-918-20100228.patch, hdfs-918-20100309.patch, 
 hdfs-918-branch20.2.patch, hdfs-multiplex.patch


 Currently, on read requests, the DataXCeiver server allocates a new thread 
 per request, which must allocate its own buffers and leads to 
 higher-than-optimal CPU and memory usage by the sending threads.  If we had a 
 single selector and a small threadpool to multiplex request packets, we could 
 theoretically achieve higher performance while taking up fewer resources and 
 leaving more CPU on datanodes available for mapred, hbase or whatever.  This 
 can be done without changing any wire protocols.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads

2010-03-31 Thread Jay Booth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jay Booth updated HDFS-918:
---

Attachment: hdfs-918-branch20.2.patch

Straightened out the block not found thing with Andrew, that was on his end, 
but then he found a resource leak that's fixed here -- I'll post a trunk patch 
which incorporates this fix and the previous fix shortly.

 Use single Selector and small thread pool to replace many instances of 
 BlockSender for reads
 

 Key: HDFS-918
 URL: https://issues.apache.org/jira/browse/HDFS-918
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: data-node
Reporter: Jay Booth
 Fix For: 0.22.0

 Attachments: hdfs-918-20100201.patch, hdfs-918-20100203.patch, 
 hdfs-918-20100211.patch, hdfs-918-20100228.patch, hdfs-918-20100309.patch, 
 hdfs-918-branch20.2.patch, hdfs-multiplex.patch


 Currently, on read requests, the DataXCeiver server allocates a new thread 
 per request, which must allocate its own buffers and leads to 
 higher-than-optimal CPU and memory usage by the sending threads.  If we had a 
 single selector and a small threadpool to multiplex request packets, we could 
 theoretically achieve higher performance while taking up fewer resources and 
 leaving more CPU on datanodes available for mapred, hbase or whatever.  This 
 can be done without changing any wire protocols.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads

2010-03-31 Thread Jay Booth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jay Booth updated HDFS-918:
---

Attachment: hdfs-918-TRUNK.patch

Trunk patch with previous fixes.  

 Use single Selector and small thread pool to replace many instances of 
 BlockSender for reads
 

 Key: HDFS-918
 URL: https://issues.apache.org/jira/browse/HDFS-918
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: data-node
Reporter: Jay Booth
 Fix For: 0.22.0

 Attachments: hdfs-918-20100201.patch, hdfs-918-20100203.patch, 
 hdfs-918-20100211.patch, hdfs-918-20100228.patch, hdfs-918-20100309.patch, 
 hdfs-918-branch20.2.patch, hdfs-918-TRUNK.patch, hdfs-multiplex.patch


 Currently, on read requests, the DataXCeiver server allocates a new thread 
 per request, which must allocate its own buffers and leads to 
 higher-than-optimal CPU and memory usage by the sending threads.  If we had a 
 single selector and a small threadpool to multiplex request packets, we could 
 theoretically achieve higher performance while taking up fewer resources and 
 leaving more CPU on datanodes available for mapred, hbase or whatever.  This 
 can be done without changing any wire protocols.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads

2010-03-30 Thread Jay Booth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jay Booth updated HDFS-918:
---

Attachment: hdfs-918-branch20.2.patch

Cleaned up a bug in the BlockChannelPool.cleanup() code, added new unit test, 
improved descriptions of new config values (useMultiplex, packetSize, 
maxOpenBlockChannels, minOpenBlockchannels (number to cleanup() to)).

This patch is for branch 20, I'll post a new one against trunk tonight.

 Use single Selector and small thread pool to replace many instances of 
 BlockSender for reads
 

 Key: HDFS-918
 URL: https://issues.apache.org/jira/browse/HDFS-918
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: data-node
Reporter: Jay Booth
 Fix For: 0.22.0

 Attachments: hdfs-918-20100201.patch, hdfs-918-20100203.patch, 
 hdfs-918-20100211.patch, hdfs-918-20100228.patch, hdfs-918-20100309.patch, 
 hdfs-918-branch20.2.patch, hdfs-multiplex.patch


 Currently, on read requests, the DataXCeiver server allocates a new thread 
 per request, which must allocate its own buffers and leads to 
 higher-than-optimal CPU and memory usage by the sending threads.  If we had a 
 single selector and a small threadpool to multiplex request packets, we could 
 theoretically achieve higher performance while taking up fewer resources and 
 leaving more CPU on datanodes available for mapred, hbase or whatever.  This 
 can be done without changing any wire protocols.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads

2010-03-30 Thread Jay Booth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jay Booth updated HDFS-918:
---

Attachment: (was: hdfs-200+826+918-branch20.patch)

 Use single Selector and small thread pool to replace many instances of 
 BlockSender for reads
 

 Key: HDFS-918
 URL: https://issues.apache.org/jira/browse/HDFS-918
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: data-node
Reporter: Jay Booth
 Fix For: 0.22.0

 Attachments: hdfs-918-20100201.patch, hdfs-918-20100203.patch, 
 hdfs-918-20100211.patch, hdfs-918-20100228.patch, hdfs-918-20100309.patch, 
 hdfs-918-branch20.2.patch, hdfs-multiplex.patch


 Currently, on read requests, the DataXCeiver server allocates a new thread 
 per request, which must allocate its own buffers and leads to 
 higher-than-optimal CPU and memory usage by the sending threads.  If we had a 
 single selector and a small threadpool to multiplex request packets, we could 
 theoretically achieve higher performance while taking up fewer resources and 
 leaving more CPU on datanodes available for mapred, hbase or whatever.  This 
 can be done without changing any wire protocols.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads

2010-03-24 Thread Jay Booth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jay Booth updated HDFS-918:
---

Attachment: hdfs-200+826+918-branch20.patch

I heard all the cool kids are running HDFS-200 and HDFS-826 on their 0.20.2 
installations these days, so I merged HDFS-918 with them.

Also, nobody use the existing 0.20.2 patch, I'll delete now and post a new one 
tonight -- it happens to be missing a very important Thread.start() invocation.

 Use single Selector and small thread pool to replace many instances of 
 BlockSender for reads
 

 Key: HDFS-918
 URL: https://issues.apache.org/jira/browse/HDFS-918
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: data-node
Reporter: Jay Booth
 Fix For: 0.22.0

 Attachments: hdfs-200+826+918-branch20.patch, 
 hdfs-918-20100201.patch, hdfs-918-20100203.patch, hdfs-918-20100211.patch, 
 hdfs-918-20100228.patch, hdfs-918-20100309.patch, hdfs-multiplex.patch


 Currently, on read requests, the DataXCeiver server allocates a new thread 
 per request, which must allocate its own buffers and leads to 
 higher-than-optimal CPU and memory usage by the sending threads.  If we had a 
 single selector and a small threadpool to multiplex request packets, we could 
 theoretically achieve higher performance while taking up fewer resources and 
 leaving more CPU on datanodes available for mapred, hbase or whatever.  This 
 can be done without changing any wire protocols.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads

2010-03-24 Thread Jay Booth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jay Booth updated HDFS-918:
---

Attachment: (was: hdfs-918-0.20.2.patch)

 Use single Selector and small thread pool to replace many instances of 
 BlockSender for reads
 

 Key: HDFS-918
 URL: https://issues.apache.org/jira/browse/HDFS-918
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: data-node
Reporter: Jay Booth
 Fix For: 0.22.0

 Attachments: hdfs-200+826+918-branch20.patch, 
 hdfs-918-20100201.patch, hdfs-918-20100203.patch, hdfs-918-20100211.patch, 
 hdfs-918-20100228.patch, hdfs-918-20100309.patch, hdfs-multiplex.patch


 Currently, on read requests, the DataXCeiver server allocates a new thread 
 per request, which must allocate its own buffers and leads to 
 higher-than-optimal CPU and memory usage by the sending threads.  If we had a 
 single selector and a small threadpool to multiplex request packets, we could 
 theoretically achieve higher performance while taking up fewer resources and 
 leaving more CPU on datanodes available for mapred, hbase or whatever.  This 
 can be done without changing any wire protocols.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads

2010-03-22 Thread Jay Booth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jay Booth updated HDFS-918:
---

Attachment: hdfs-918-0.20.2.patch

0.20.2 compatible patch!

A couple people mentioned that it would be much easier for them to benchmark if 
I produced an 0.20.2 compatible patch.  So here it is, it works, seems to pass 
all unit tests that I ran on it, and I even did a hadoop fs -put and hadoop fs 
-cat.  But that's the entire extent of the testing, unit tests and a 
super-simple pseudodistributed operation.

So anyone who wants to try this on some I/O bound jobs on a test 0.20.2 cluster 
and see if they have speedups, please feel free and report results.

 Use single Selector and small thread pool to replace many instances of 
 BlockSender for reads
 

 Key: HDFS-918
 URL: https://issues.apache.org/jira/browse/HDFS-918
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: data-node
Reporter: Jay Booth
 Fix For: 0.22.0

 Attachments: hdfs-918-0.20.2.patch, hdfs-918-20100201.patch, 
 hdfs-918-20100203.patch, hdfs-918-20100211.patch, hdfs-918-20100228.patch, 
 hdfs-918-20100309.patch, hdfs-multiplex.patch


 Currently, on read requests, the DataXCeiver server allocates a new thread 
 per request, which must allocate its own buffers and leads to 
 higher-than-optimal CPU and memory usage by the sending threads.  If we had a 
 single selector and a small threadpool to multiplex request packets, we could 
 theoretically achieve higher performance while taking up fewer resources and 
 leaving more CPU on datanodes available for mapred, hbase or whatever.  This 
 can be done without changing any wire protocols.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads

2010-03-12 Thread Jay Booth (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844596#action_12844596
]

Jay Booth commented on HDFS-918:

.bq I think it is very important to have separate pools for each partition.
Otherwise, each disk will be accessed only as much as the slowest disk (when DN
has enough load).

This would be the case if I were using a fixed-size thread pool and a
LinkedBlockingQueue -- but I'm not, see Executors.newCachedThreadPool(), it's
actually bounded at Integer.MAX_VALUE threads and uses a SynchronousQueue. If
a new thread is needed in order to start work on a task immediately, it's
created. Otherwise, an existing waiting thread will be re-used. (Threads are
purged if they've been idle for 60 seconds). Either way, the underlying I/O
request is dispatched pretty much immediately after the connection is writable.
So I don't see why separate pools per partition would help anything, the
operating system will handle IO requests as it can and put threads into
runnable state as it can regardless of which pool they're in.

RE: Netty, I'm not very knowledgeable about it beyond the Cliff's Notes
version, but my code dealing with the Selector is pretty small -- the main loop
is under 75 lines, and java.util.concurrent does most of the heavy lifting.
Most of the code is dealing with application and protocol specifics. So my
instinct in general is that adding a framework may actually increase the amount
of code, especially if there's any mismatches between what we're doing and what
it wants us to do (the packet-header, sums data, main data format is pretty
specific to us). Plus, as Todd said, we can't really change the blocking IO
nature of the main accept() loop in DataXceiverServer without this becoming a
much bigger patch, although I agree that we should go there in general. That
being said, better is better, so if a Netty implementation took up fewer lines
of code and performed better, then that speaks for itself.

Use single Selector and small thread pool to replace many instances of
BlockSender for reads

Key: HDFS-918
URL: https://issues.apache.org/jira/browse/HDFS-918
Project: Hadoop HDFS
Issue Type: Improvement
Components: data-node
Reporter: Jay Booth
Fix For: 0.22.0

Attachments: hdfs-918-20100201.patch, hdfs-918-20100203.patch,
hdfs-918-20100211.patch, hdfs-918-20100228.patch, hdfs-918-20100309.patch,
hdfs-multiplex.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads

2010-03-12 Thread Jay Booth (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844658#action_12844658
]

Jay Booth commented on HDFS-918:

I think it is very important to have separate pools for each partition.
This would be the case if I were using a fixed-size thread pool and a
LinkedBlockingQueue - but I'm not, see Executors.newCachedThreadPool(),

hmm.. does it mean that if you have thousand clients and the load is disk
bound, we end up with 1000 threads?

Yeah, although it'll likely turn out to be less than 1000 in practice.. If
the requests are all short-lived, it could be significantly less than 1000
threads when you consider re-use, if it's 1000 long reads, it'll probably wind
up being only a little less if at all. The threads themselves are really
lightweight, the only resources attached to them are a
ThreadLocalByteBuffer(8096). (8k seemed ok for the ByteBuffer because the
header+checksums portion is always significantly less than that, and the main
block file transfers are done using transferTo).

I chose this approach after initially experimenting with a fixed-size
threadpool and LinkedBlockingQueue because the handoff is faster and every
pending IO request is guaranteed to become an actual disk-read syscall waiting
on the operating system as fast as possible. This way, the operating system
decides which disk request to fulfill first, taking advantage of the
lower-level optimizations around disk IO. Since the threads are pretty
lightweight and the lower-level calls do a better job of optimal fulfillment, I
think this will work better than a fixed-size threadpool, where for example, 2
adjacent reads from separate threads could be separated from each other in time
whereas the disk controller might fulfill both simultaneously and faster. This
becomes even more important, I think, with the higher 512kb packet size --
those are big chunks of work per-sycall that can be optimized by the underlying
OS. Regarding the extra resource allocation for the threads -- if we're
disk-bound, then generally speaking a few extra memory resources shouldn't be a
huge deal -- the gains from dispatching more disk requests in parallel should
outweigh the memory allocation and context switch costs.

The above is all in theory -- I haven't benchmarked parallel implementations
head-to-head. But certainly for random reads, and likely for longer reads,
this approach should get the syscall invoked as fast as possible. Switching
between the two models would be pretty simple, just change the parameters we
pass to the constructor for new ThreadPoolExecutorService().

Use single Selector and small thread pool to replace many instances of
BlockSender for reads

Key: HDFS-918
URL: https://issues.apache.org/jira/browse/HDFS-918
Project: Hadoop HDFS
Issue Type: Improvement
Components: data-node
Reporter: Jay Booth
Fix For: 0.22.0

Attachments: hdfs-918-20100201.patch, hdfs-918-20100203.patch,
hdfs-918-20100211.patch, hdfs-918-20100228.patch, hdfs-918-20100309.patch,
hdfs-multiplex.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads

2010-03-11 Thread Jay Booth (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844174#action_12844174
 ] 

Jay Booth commented on HDFS-918:


Yeah, it only uses nonblocking pread ops on the block and block.meta files..  
it sends the packet header and checksums in one packet (maybe just part of the 
checksums if TCP buff was full), then repeatedly makes requests to send 
PACKET_LENGTH (default 512kb) bytes until they're sent.  When I had some trace 
logging enabled, I could see the TCP window scale up.. first request sent 96k 
in a packet, then it scaled up to 512k per packet after a few.

Here's a (simplified) breakdown of the control structures and main loop:

{{
DataXCeiverServer -- accepts conns, creates thread per conn
  From thread:
read OP, blocking
if we're a read request and multiplex enabled, delegate to 
MultiplexedBlockSender and die
otherwise instantiate DataXCeiver (which now takes op as an arg) and call 
xceiver.run()}}

MultiplexedBlockSender  // maintains ExecutorService, SelectorThread and 
exposes public register() method
register(Socket conn);  // configures nonblocking, sets up Connection 
object, dispatches the first packet-send as an optimization, then puts the 
FutureConnection in an inbox for the selector thread
  
  SelectorThread  // maintains Selector, BlockingQueueFutureConnection 
inbox, LinkedListFutureConnection processingQueue
main loop:
  1)  pull all futures from inbox, add to processingqueue
  2)  iterate/poll/remove Futures from processingQueue, then 
close/re-register those that finished sending a packet as appropriate (linear 
time, but pretty fast)
  3)  select
  4)  dispatch selected connections, add their Futures to processingQueue


  Connection.sendPacket(BlockChannelPool, ByteBuffer)  // workhorse method, 
invoked via Callable
 maintains a bunch of internal state variables per connection
 fetches BlockChannel object from BlockChannelPool -- BlockChannel only 
exposes p-read methods for underlying channels
 buffers packet header and sums, sends, records how much successfully sent 
-- if less than 100%, return and wait for writable
 tries to send PACKET_LENGTH bytes from main file via transferTo, if less 
than 100%, return and wait for writable
 marks self as either FINISHED or READY, depending on if that was the last 
packet
}}

Regarding file IO, I don't know if it's faster to send the packet header as 
it's own 13 byte packet and use transferTo for the meta file, or to do what I'm 
doing now and buffer them into one packet.  I feel like it'll be a wash..  or 
at any rate a minor difference because the checksums are so much smaller than 
the main data.

What do people think about a test regime for this?  It's a really big set of 
changes but it opens up a lot of doors (particularly connection re-use with 
that register() paradigm), seems to perform equal/better depending on the case, 
gets a big win on open file descriptors and factors all of the server-side 
protocol logic into one method, instead of spread out across several classes.

I certainly understand being hesitant to commit such a big change without some 
pretty extensive testing, but if anyone had any direction as to what they'd 
like to see tested, that'd be awesome.  I'm already planning on setting up some 
disk-bound benchmarks now that I've tested network-bound ones..  anything else 
that people want to see?  It seems to pass all unit tests, my last run had a 
couple seemingly pre-existing failures but 99% of them passed.  I guess I 
should do another full run and account for any that don't pass while I'm at it.

 Use single Selector and small thread pool to replace many instances of 
 BlockSender for reads
 

 Key: HDFS-918
 URL: https://issues.apache.org/jira/browse/HDFS-918
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: data-node
Reporter: Jay Booth
 Fix For: 0.22.0

 Attachments: hdfs-918-20100201.patch, hdfs-918-20100203.patch, 
 hdfs-918-20100211.patch, hdfs-918-20100228.patch, hdfs-918-20100309.patch, 
 hdfs-multiplex.patch


 Currently, on read requests, the DataXCeiver server allocates a new thread 
 per request, which must allocate its own buffers and leads to 
 higher-than-optimal CPU and memory usage by the sending threads.  If we had a 
 single selector and a small threadpool to multiplex request packets, we could 
 theoretically achieve higher performance while taking up fewer resources and 
 leaving more CPU on datanodes available for mapred, hbase or whatever.  This 
 can be done without changing any wire protocols.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1034) Enhance datanode to read data and checksum file in parallel

2010-03-11 Thread Jay Booth (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-1034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844186#action_12844186
 ] 

Jay Booth commented on HDFS-1034:
-

Wouldn't this preclude the use of transferTo for transfer from the main block 
file?  The packet header and sums need to be sent before transferTo is invoked, 
otherwise things would all be jumbled up together.

 Enhance datanode to read data and checksum file in parallel
 ---

 Key: HDFS-1034
 URL: https://issues.apache.org/jira/browse/HDFS-1034
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: dhruba borthakur
Assignee: dhruba borthakur

 In the current HDFS implementation, a read of a block issued to the datanode 
 results in a disk access to the checksum file followed by a disk access to 
 the checksum file. It would be nice to be able to do these two IOs in 
 parallel to reduce read latency.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads

2010-03-11 Thread Jay Booth (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844191#action_12844191
 ] 

Jay Booth commented on HDFS-918:


I'll enthusiastically cheerlead?  :)

More seriously, I'm willing to put in the work to get everyone to a comfort 
level with this patch so it gets committed in some form and we get some wins 
out of it, but the write path is more complicated, I don't understand it as 
well and I'm honestly going to need a little bit of a break over the summer.  
I'd love to help in whatever way I can but I'm probably not going to have the 
bandwidth to be the main driver of it in the short term..  I don't fully grok 
the write pipeline and the intricacies of append yet, so at the least someone 
else would have to be involved. 

 Use single Selector and small thread pool to replace many instances of 
 BlockSender for reads
 

 Key: HDFS-918
 URL: https://issues.apache.org/jira/browse/HDFS-918
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: data-node
Reporter: Jay Booth
 Fix For: 0.22.0

 Attachments: hdfs-918-20100201.patch, hdfs-918-20100203.patch, 
 hdfs-918-20100211.patch, hdfs-918-20100228.patch, hdfs-918-20100309.patch, 
 hdfs-multiplex.patch


 Currently, on read requests, the DataXCeiver server allocates a new thread 
 per request, which must allocate its own buffers and leads to 
 higher-than-optimal CPU and memory usage by the sending threads.  If we had a 
 single selector and a small threadpool to multiplex request packets, we could 
 theoretically achieve higher performance while taking up fewer resources and 
 leaving more CPU on datanodes available for mapred, hbase or whatever.  This 
 can be done without changing any wire protocols.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads

2010-03-08 Thread Jay Booth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jay Booth updated HDFS-918:
---

Attachment: hdfs-918-20100309.patch

New patch and better benchmarks:

Environment:  
8x2GHz, 7GB RAM, namenode and dfs client
8x2GHz, 7GB RAM, datanode

Streaming:
Single threaded:  60 runs over 100MB file, presumed in memory so network is 
chokepoint
Current DFS : 92MB/s over 60 runs
Multiplex : 97 MB/s over 60 runs
* Either random variation, or maybe larger packet size helps

Multi-threaded - 32 threads reading 100MB file, 60X each
Both around 3.25MB/s/thread, 104 MB/s aggregate
Network saturation


Random reads:
The multiplexed implementation saves about 1.5 ms, probably by avoiding extra 
file-opens and buffer allocation.
 - 5 iterations of 2000 reads each, 32kb, front of file, singlethreaded
 - splits for current DFS: 5.3, 4.6, 5.0, 4.4, 6.4
 - splits for multiplex:3.2, 3.0, 4.6, 3.3 ,3.2
 - multithreaded concurrent read speeds on a single host converged with more 
threads -- probably client-side delay negotiating lots of new tcp connections 


File handle consumption:
Both rest at 401 open files (mostly jars)

When doing random reads across 128 threads, BlockSender spikes to the 1150, 
opening a blockfile, metafile, selector, and socket for each concurrent 
connection.

MultiplexedBlockSender only jumps to 530, with just the socket as a 
per-connection resource, blockfiles, metafiles and the single selector are 
shared.



I'll post a comment later with an updated description of the patch, and when I 
get a chance, I'll run some more disk-bound benchmarks, I think the 
asynchronous approach will pay some dividends there by letting the operating 
system do more of the work.  

Super brief patch notes:
eliminated silly add'l dependency on commons-math, now has no new dependencies
incorporated Zlatin's suggestions upthread to do asynchronous I/O, 1 shared 
selector
BlockChannelPool is shared across threads
Buffers are threadlocal so they'll tend to be re-used rather than re-allocated




 Use single Selector and small thread pool to replace many instances of 
 BlockSender for reads
 

 Key: HDFS-918
 URL: https://issues.apache.org/jira/browse/HDFS-918
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: data-node
Reporter: Jay Booth
 Fix For: 0.22.0

 Attachments: hdfs-918-20100201.patch, hdfs-918-20100203.patch, 
 hdfs-918-20100211.patch, hdfs-918-20100228.patch, hdfs-918-20100309.patch, 
 hdfs-multiplex.patch


 Currently, on read requests, the DataXCeiver server allocates a new thread 
 per request, which must allocate its own buffers and leads to 
 higher-than-optimal CPU and memory usage by the sending threads.  If we had a 
 single selector and a small threadpool to multiplex request packets, we could 
 theoretically achieve higher performance while taking up fewer resources and 
 leaving more CPU on datanodes available for mapred, hbase or whatever.  This 
 can be done without changing any wire protocols.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads

2010-02-28 Thread Jay Booth (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jay Booth updated HDFS-918:
---

Attachment: hdfs-918-20100228.patch

New patch -- Took Zlatin's advice and utilized selectionKey.interestOps(0) to
avoid busy waits, so we're back to a single selector and an ExecutorService.
The ExecutorService reuses threads if possible, destroying threads that haven't
been used in 60 seconds. Analyzed logs and the selectorThread doesn't seem to
busy wait ever. Buffers are now stored in threadlocals and allocated per
thread (they're now HeapByteBuffers since we might have some churn and most of
our transfer is using transferTo anyways). Still uses shared BlockChannelPool
implemented via ReadWriteLock.

I think this will be pretty good, will benchmark tonight.

Use single Selector and small thread pool to replace many instances of
BlockSender for reads

Key: HDFS-918
URL: https://issues.apache.org/jira/browse/HDFS-918
Project: Hadoop HDFS
Issue Type: Improvement
Components: data-node
Reporter: Jay Booth
Fix For: 0.22.0

Attachments: hdfs-918-20100201.patch, hdfs-918-20100203.patch,
hdfs-918-20100211.patch, hdfs-918-20100228.patch, hdfs-multiplex.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads

2010-02-16 Thread Jay Booth (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12834252#action_12834252
 ] 

Jay Booth commented on HDFS-918:


Yeah, I'll do my best to get benchmarks by the end of the weekend, kind of a 
crazy week this week and I moved this past weekend, so I don't have a ton of 
time.  Todd, if you feel like blasting a couple stream-of-consciousness 
comments to me via email, go right ahead, otherwise I'll run the benchmarks 
this weekend and wait for the well-written version :).  

Zlatin, I originally had a similar architecture to what you're describing, 
using a BlockingQueue to funnel the actual work to a threadpool, but I had some 
issues with being able to get the locking quite right, either I wasn't getting 
things into the queue as fast as possible, or I was burning a lot of empty 
cycles in the selector thread.  Specifically, I can't cancel a SelectionKey and 
then re-register with the same selector afterwards, it leads to exceptions, so 
my Selector thread was spinning in a tight loop verifying that, yes, all of 
these writable SelectionKeys are currently enqueued for work, whenever anything 
was being processed.  But that was a couple iterations ago, maybe I'll have 
better luck trying it now.  What we really need is a libevent-like framework, 
I'll spend a little time reviewing the outward facing API for that and 
scratching my noggin.

Ultimately, only so much I/O can actually happen at a time before the disk is 
swamped, so it might be that a set of, say, 32 selector threads gets the same 
performance as 1024 threads.  In that case, we'd be taking up fewer resources 
for the same performance.  At any rate, I need to benchmark before speculating 
further.



 Use single Selector and small thread pool to replace many instances of 
 BlockSender for reads
 

 Key: HDFS-918
 URL: https://issues.apache.org/jira/browse/HDFS-918
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: data-node
Reporter: Jay Booth
 Fix For: 0.22.0

 Attachments: hdfs-918-20100201.patch, hdfs-918-20100203.patch, 
 hdfs-918-20100211.patch, hdfs-multiplex.patch


 Currently, on read requests, the DataXCeiver server allocates a new thread 
 per request, which must allocate its own buffers and leads to 
 higher-than-optimal CPU and memory usage by the sending threads.  If we had a 
 single selector and a small threadpool to multiplex request packets, we could 
 theoretically achieve higher performance while taking up fewer resources and 
 leaving more CPU on datanodes available for mapred, hbase or whatever.  This 
 can be done without changing any wire protocols.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads

2010-02-16 Thread Jay Booth (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12834321#action_12834321
 ] 

Jay Booth commented on HDFS-918:


Thanks Zlatin, I think you're right.  I'll look at finding a way to remove 
writable interest without cancelling the key, that could fix the busy looping 
issue, then I could use a condition to ensure wakeup when something is newly 
writable-interested (via completed packet or new request) and refactor back to 
a single selector thread and several executing threads.  I'll make a copy of 
the patch and try benchmarking both methods.

 Use single Selector and small thread pool to replace many instances of 
 BlockSender for reads
 

 Key: HDFS-918
 URL: https://issues.apache.org/jira/browse/HDFS-918
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: data-node
Reporter: Jay Booth
 Fix For: 0.22.0

 Attachments: hdfs-918-20100201.patch, hdfs-918-20100203.patch, 
 hdfs-918-20100211.patch, hdfs-multiplex.patch


 Currently, on read requests, the DataXCeiver server allocates a new thread 
 per request, which must allocate its own buffers and leads to 
 higher-than-optimal CPU and memory usage by the sending threads.  If we had a 
 single selector and a small threadpool to multiplex request packets, we could 
 theoretically achieve higher performance while taking up fewer resources and 
 leaving more CPU on datanodes available for mapred, hbase or whatever.  This 
 can be done without changing any wire protocols.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads

2010-02-02 Thread Jay Booth (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jay Booth updated HDFS-918:
---

Attachment: hdfs-918-20100203.patch

New patch. Streamlined MultiplexedBlockSender, we now have one selector per
worker thread and no BlockingQueues, writeable connections are handled inline
by each thread as they're available.

Includes a utility class to read a file with a bunch of threads and time them.

Ran some ad hoc jobs on my laptop and got similar performance to existing
BlockSender, slightly faster for single file and slightly slower for 15
competing localhost threads.. which is exactly the opposite of what I boldly
predicted. I read somewhere that linux thread scheduling for Java is disabled
because it requires root, so it ignores priority -- if that's the case, maybe
running more threads is actually an advantage when all the readers are local
and you're directly competing with them for CPU -- you compete more effectively
for limited resources with more threads.

I'm gonna try and write an MR job to run some different scenarios on a cluster
soon (thundering herd, steady medium, large number of idles, individual read)..
I think the architecture here is more suited to large numbers of connections
so if it did ok under a small number, then great. I'll be pretty busy for the
next month or so but will try to get this running in a cluster at some point
and report some more interesting numbers.

Use single Selector and small thread pool to replace many instances of
BlockSender for reads

Key: HDFS-918
URL: https://issues.apache.org/jira/browse/HDFS-918
Project: Hadoop HDFS
Issue Type: Improvement
Components: data-node
Reporter: Jay Booth
Fix For: 0.22.0

Attachments: hdfs-918-20100201.patch, hdfs-918-20100203.patch,
hdfs-multiplex.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads

2010-01-31 Thread Jay Booth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jay Booth updated HDFS-918:
---

Attachment: hdfs-918-20100201.patch

New patch.. 

* new configuration params:  dfs.datanode.multiplexBlockSender=true, 
dfs.datanode.multiplex.packetSize=32k, dfs.datanode.multiplex.numWorkers=3

* Packet size is tuneable, possibly allowing better performance with larger TCP 
buffers enabled

* Workers only wake up when a connection is writable

* 3 new class files, minor changes to DataXceiverServer and DataXceiver, 2 
utility classes added to DataTransferProtocol (one stolen from HDFS-881)

* Passes tests from earlier comment  plus a new one for files with lengths that 
don't match up to checksum chunk size, as well as holding up to some load on 
TestDFSIO

* Still fails all tests relying on SimulatedFSDataset

* Has a large amount of TRACE level debugging going on in 
MultiplexedBlockSender in case anybody wants to watch the output

* Adds dependencies for commons-pool and commons-math (for benchmarking code)

* Doesn't yet have benchmarks, but those should be easy now that the 
configuration is all in place

 Use single Selector and small thread pool to replace many instances of 
 BlockSender for reads
 

 Key: HDFS-918
 URL: https://issues.apache.org/jira/browse/HDFS-918
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: data-node
Reporter: Jay Booth
 Fix For: 0.22.0

 Attachments: hdfs-918-20100201.patch, hdfs-multiplex.patch


 Currently, on read requests, the DataXCeiver server allocates a new thread 
 per request, which must allocate its own buffers and leads to 
 higher-than-optimal CPU and memory usage by the sending threads.  If we had a 
 single selector and a small threadpool to multiplex request packets, we could 
 theoretically achieve higher performance while taking up fewer resources and 
 leaving more CPU on datanodes available for mapred, hbase or whatever.  This 
 can be done without changing any wire protocols.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads

2010-01-31 Thread Jay Booth (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12828009#action_12828009
 ] 

Jay Booth commented on HDFS-918:


I haven't had a chance to run benchmarks yet, but I think that under lots of 
connections, the thread-per-connection model will spend more time swapping 
compared to getting work done, plus it has a few places where they hot block 
by doing while (buff.hasRemaining()) { write() }.  Only selecting the currently 
writeable connections and scheduling them sidesteps both issues while being 
less of a resource footprint - assuming it delivers on the performance.  As 
soon as I get a chance, I'll write some benchmarks.

If anyone wants to take a look at the code in the meantime, I think this patch 
is pretty easy to set up  -- just enable MultiplexBlockSender.LOG for TRACE and 
run tests, and you can see how each packet is built and sent.  'ant compile 
eclipse-files' will set up the extra dependencies on commons-pool and 
commons-math.

 Use single Selector and small thread pool to replace many instances of 
 BlockSender for reads
 

 Key: HDFS-918
 URL: https://issues.apache.org/jira/browse/HDFS-918
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: data-node
Reporter: Jay Booth
 Fix For: 0.22.0

 Attachments: hdfs-918-20100201.patch, hdfs-multiplex.patch


 Currently, on read requests, the DataXCeiver server allocates a new thread 
 per request, which must allocate its own buffers and leads to 
 higher-than-optimal CPU and memory usage by the sending threads.  If we had a 
 single selector and a small threadpool to multiplex request packets, we could 
 theoretically achieve higher performance while taking up fewer resources and 
 leaving more CPU on datanodes available for mapred, hbase or whatever.  This 
 can be done without changing any wire protocols.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads

2010-01-29 Thread Jay Booth (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12806460#action_12806460
]

Jay Booth commented on HDFS-918:

I'll definitely perform lots of benchmarking before asking for more of a formal
review for commit. To be clear, this patch by itself doesn't open up any
additional ports/services and attempts minimal change to existing code -- the
only major change is in opRead on DataXCeiver, replacing creation/execution of
a BlockSender with a quick call to register() on my new readserver class (which
should probably be renamed to MultiplexedBlockSender or something since it's
not actually a server, doesn't expose any new services). The other changes are
just the addition of 3 new classes. My newest patch is still at home and I'll
upload it this weekend when I have some time, the current patch here doesn't
completely work.

Right now I expect this to yield some pretty decent savings in CPU time but
maybe not in wall-clock time. I'm thinking about 2 approaches, running
datanode with time while performing a bunch of IO tasks and checking CPU usage
afterwards, and then measuring total throughput in TestDFSIO on an
oversubscribed cluster so that we get CPU-bound and can demonstrate improvement
that way. Any preference? Other suggestions? I'm a little worried that
simply timing datanode will measure things like, how long I waited after start
to launch a job, or background replication.. but I suppose those problems
exist with any measurement.

Also, I'm planning to add a parameter to configure usage of BlockSender vs
MultiplexedBlockSender, does dfs.datanode.multiplexReads=true sound good?

Use single Selector and small thread pool to replace many instances of
BlockSender for reads

Key: HDFS-918
URL: https://issues.apache.org/jira/browse/HDFS-918
Project: Hadoop HDFS
Issue Type: Improvement
Components: data-node
Reporter: Jay Booth
Fix For: 0.22.0

Attachments: hdfs-multiplex.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads

2010-01-29 Thread Jay Booth (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12806476#action_12806476
 ] 

Jay Booth commented on HDFS-918:


Good call Todd, that's definitely the canonical case for multiplexing, plus it 
should show some benefit because I recycle file handles across requests by 
pooling them, rather than opening a new file per request.  I'll see if I can 
get something along those lines set up.

 Use single Selector and small thread pool to replace many instances of 
 BlockSender for reads
 

 Key: HDFS-918
 URL: https://issues.apache.org/jira/browse/HDFS-918
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: data-node
Reporter: Jay Booth
 Fix For: 0.22.0

 Attachments: hdfs-multiplex.patch


 Currently, on read requests, the DataXCeiver server allocates a new thread 
 per request, which must allocate its own buffers and leads to 
 higher-than-optimal CPU and memory usage by the sending threads.  If we had a 
 single selector and a small threadpool to multiplex request packets, we could 
 theoretically achieve higher performance while taking up fewer resources and 
 leaving more CPU on datanodes available for mapred, hbase or whatever.  This 
 can be done without changing any wire protocols.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads

2010-01-25 Thread Jay Booth (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12804561#action_12804561
]

Jay Booth commented on HDFS-918:

Current patch is having some issues in terms of actual use -- seems to pass a
lot of tests but I'm having problems with ChecksumExceptions running actual MR
jobs over it, so clearly it's a work in progress still :) I have a better
patch that fixes some of the issues but it's still not 100%, so I'll upload
with some new tests once I resolve remaining issues. Currently passing
TestPread, TestDistributedFileSystem, TestDataTransferProtocol and
TestClientBlockVerification, but still getting issues when actually running the
thing -- if somebody has recommendations for other tests to debug with, that'd
be very welcome.

ARCHITECTURE

1) On client connect to DataXCeiver server, dispatch a thread as per usual,
except now the thread is extremely short-lived. It simply registers the
connection with server.datanode.ReadServer and dies. (It would be more
efficient to have some sort of accept loop that didn't spawn a thread here, but
I went with lowest impact integration)

2) On register of a connection, ReadServer creates a Connection object and
registers the channel with a selector inside of ReadServer.ResponseWriter.
ResponseWriter maintains an ArrayBlockingQueueConnection workQueue and polls
the selector, cancelling keys and adding connections which are ready for write
to the work queue. ReadServer also maintains a BlockChannelPool, which is a
pool of BlockChannel objects -- each BlockChannel represents the file and
meta-file for a given block.

3) A small Handler pool takes items off of this work queue and calls
connection.sendPacket(buffer, channelPool). Each handler maintains a
DirectByteBuffer, instantiated at startup time, which it uses for all requests.

4) Connection.sendPacket(buffer, channelPool) consults internal state about
what needs to be sent next (response headers, packet headers, checksums, bytes)
and sends what it can, updating internal state variables. Uses the provided
buffer and channelPool to do its work. Uses transferTo unless the config
property for transferTo is disabled. Right now it actually sends 2 packets per
packet (header+sums and then bytes), once I resolve all correctness bugs it may
be worth combining the two into one packet for small reads.

5) After work has been done and internal state updated (even if only 1 byte
was sent), Handler re-registers the Connection with ResponseWriter for further
writes, or closes it if we're done.

Once I have this fully working, I'd expect CPU savings from fewer long-running
threads and less garbage collection of buffers, perhaps a small performance
boost from the select-based architecture and using DirectByteBuffers instead of
HeapByteBuffers, and a slight reduction in IOWAIT time under some circumstances
because we're pooling file channels rather than re-opening for every request.
It should also consume far fewer xceiver threads and open file handles while
running -- the pool is capped, so if we start getting crazy numbers of
requests, we'll close/re-open files as necessary to stay under the cap.

INTEGRATION
As I said, I made the DataXCeiver thread for opReadBlock register the channel
and then die. This is probably the best way to go even though it's not optimal
from a performance standpoint. Unfortunately, since DataXCeiver threads close
their sockets when they die, I had to put a special boolean case 'skipClose' to
avoid that if op == Op.READ, which is kind of ugly -- recommendations are
welcome for what to do here.

Also, as I noted earlier, the BlockChannelPool requires an instance of
FSDataset to function, rather than FSDatasetInterface, because Interface
doesn't supply any getFile() methods, just getInputStream() methods. Probably
the best way to handle this for tests would be to have the SimulatedFSDataset
write actual files to /tmp somewhere and provide handles to those files when
running tests. Any thoughts from anyone?

FUTURE
Once I get this working, it might be worth exploring using this as a mechanism
for repeated reads over the same datanode connection, which could give some
pretty big performance gains to certain applications. Upon completion of a
request, the ReadServer could simply put the connections in a pool that's
polled every so often for isReadable() -- if it's readable, read the request
and re-register with the ResponseWriter.

Along those lines, once we get there it might wind up being simpler to do the
initial read-request through that mechanism as well, which would mean that we
could get rid of some of the messy integration with DataXceiverServer --
however, it would require opening up another port just for reads. What are
people's thoughts on that? I won't make that a goal of this patch but I'd be

[jira] Created: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads

2010-01-23 Thread Jay Booth (JIRA)

Use single Selector and small thread pool to replace many instances of 
BlockSender for reads


 Key: HDFS-918
 URL: https://issues.apache.org/jira/browse/HDFS-918
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: data-node
Reporter: Jay Booth
 Fix For: 0.22.0


Currently, on read requests, the DataXCeiver server allocates a new thread per 
request, which must allocate its own buffers and leads to higher-than-optimal 
CPU and memory usage by the sending threads.  If we had a single selector and a 
small threadpool to multiplex request packets, we could theoretically achieve 
higher performance while taking up fewer resources and leaving more CPU on 
datanodes available for mapred, hbase or whatever.  This can be done without 
changing any wire protocols.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads

2010-01-23 Thread Jay Booth (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jay Booth updated HDFS-918:
---

Attachment: hdfs-multiplex.patch

Here's a first implementation -- it works, passes TestDistributedFileSystem,
TestDataTransferProtocol and TestPread. However, it has a direct dependency on
FSDataset (not FSDatasetInterface) because it needs to get ahold of files
directly to open FileChannels. This leads to ClassCastExceptions in all tests
relying on SimulatedFSDataset. Would love to hear feedback about a way to
resolve this.

Have not benchmarked yet, I'll post another comment with an architectural
description.

Use single Selector and small thread pool to replace many instances of
BlockSender for reads

Key: HDFS-918
URL: https://issues.apache.org/jira/browse/HDFS-918
Project: Hadoop HDFS
Issue Type: Improvement
Components: data-node
Reporter: Jay Booth
Fix For: 0.22.0

Attachments: hdfs-multiplex.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HDFS-516) Low Latency distributed reads

2010-01-23 Thread Jay Booth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jay Booth updated HDFS-516:
---

Resolution: Won't Fix
Status: Resolved  (was: Patch Available)

This work didn't really go anywhere, I tried multithreading on the client side 
to make a request-per-packet approach viable, but it wound up being even slower 
than singlethreaded request-per-packet.  If there's interest in the client-side 
byte cache, I could create another issue with just that improvement.

 Low Latency distributed reads
 -

 Key: HDFS-516
 URL: https://issues.apache.org/jira/browse/HDFS-516
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Jay Booth
Priority: Minor
 Attachments: hdfs-516-20090912.patch, radfs.odp

   Original Estimate: 168h
  Remaining Estimate: 168h

 I created a method for low latency random reads using NIO on the server side 
 and simulated OS paging with LRU caching and lookahead on the client side.  
 Some applications could include lucene searching (term-doc and doc-offset 
 mappings are likely to be in local cache, thus much faster than nutch's 
 current FsDirectory impl and binary search through record files (bytes at 
 1/2, 1/4, 1/8 marks are likely to be cached)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-516) Low Latency distributed reads

2009-09-15 Thread Jay Booth (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12755823#action_12755823
]

Jay Booth commented on HDFS-516:

Yeah, I was puzzled by the performance too. I dug through the DFS code and I'm
saving a bit on new socket and object creation, maybe a couple instructions
here and there, but that shouldn't add up to 100 seconds for a gigabyte (approx
20 blocks). I'm calling read() a bajillion times in a row so it's conceivable
(although unlikely) that I'm pegging the CPU and that's the limiting factor.

I'm busy for a couple days but will get back to you with some figures from
netstat, top and whatever else I can think of, along with another streaming
case that works with read(b, off, len) to see if that changes things. I'll do
a little more digging into DFS as well to see if I can isolate the cause. I
definitely did run them several times on the same machine and another time on a
different cluster with similar results, so it wasn't simply bad luck on the
rack placement on EC2 (well maybe but unlikely).

Will report back when I have more numbers. After I get those, my roadmap for
this is to add checksum support and better DatanodeInfo caching. User groups
would come after that.

Low Latency distributed reads
-

Key: HDFS-516
URL: https://issues.apache.org/jira/browse/HDFS-516
Project: Hadoop HDFS
Issue Type: New Feature
Reporter: Jay Booth
Priority: Minor
Attachments: hdfs-516-20090912.patch

Original Estimate: 168h
Remaining Estimate: 168h

I created a method for low latency random reads using NIO on the server side
and simulated OS paging with LRU caching and lookahead on the client side.
Some applications could include lucene searching (term-doc and doc-offset
mappings are likely to be in local cache, thus much faster than nutch's
current FsDirectory impl and binary search through record files (bytes at
1/2, 1/4, 1/8 marks are likely to be cached)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-516) Low Latency distributed reads

2009-09-14 Thread Jay Booth (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12755120#action_12755120
]

Jay Booth commented on HDFS-516:

Hey Todd, in short, I agree, we should be looking at moving performance
improvements over to the main FS implementation. Right now, my version doesn't
support user permissions or checksumming. I'd say it makes sense to keep it in
contrib as a sandbox for now, and work towards full compatibility with the main
DFS implementation at which point we could consider swapping in the new reading
subsystem? User permissioning would require some model changes but should be
workable, checksumming probably won't be too bad if I read the code right.

So, I suppose keep it in contrib as a sandbox initially with an explicit goal
of moving it over to DFS when it reaches compatibility? It doesn't really lend
itself to moving over piecemeal, as it has several components which all pretty
much need each other. However, it's pretty well integrated with the DFS API
and only replaces one method on the filesystem class.

Low Latency distributed reads
-

Key: HDFS-516
URL: https://issues.apache.org/jira/browse/HDFS-516
Project: Hadoop HDFS
Issue Type: New Feature
Reporter: Jay Booth
Priority: Minor
Attachments: hdfs-516-20090912.patch

Original Estimate: 168h
Remaining Estimate: 168h

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HDFS-516) Low Latency distributed reads

2009-09-12 Thread Jay Booth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jay Booth updated HDFS-516:
---

Attachment: (was: hdfs-516-20090831.patch)

 Low Latency distributed reads
 -

 Key: HDFS-516
 URL: https://issues.apache.org/jira/browse/HDFS-516
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Jay Booth
Priority: Minor
 Attachments: hdfs-516-20090912.patch

   Original Estimate: 168h
  Remaining Estimate: 168h

 I created a method for low latency random reads using NIO on the server side 
 and simulated OS paging with LRU caching and lookahead on the client side.  
 Some applications could include lucene searching (term-doc and doc-offset 
 mappings are likely to be in local cache, thus much faster than nutch's 
 current FsDirectory impl and binary search through record files (bytes at 
 1/2, 1/4, 1/8 marks are likely to be cached)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HDFS-516) Low Latency distributed reads

2009-09-12 Thread Jay Booth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jay Booth updated HDFS-516:
---

Attachment: (was: radfs.patch)

 Low Latency distributed reads
 -

 Key: HDFS-516
 URL: https://issues.apache.org/jira/browse/HDFS-516
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Jay Booth
Priority: Minor
 Attachments: hdfs-516-20090912.patch

   Original Estimate: 168h
  Remaining Estimate: 168h

 I created a method for low latency random reads using NIO on the server side 
 and simulated OS paging with LRU caching and lookahead on the client side.  
 Some applications could include lucene searching (term-doc and doc-offset 
 mappings are likely to be in local cache, thus much faster than nutch's 
 current FsDirectory impl and binary search through record files (bytes at 
 1/2, 1/4, 1/8 marks are likely to be cached)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-516) Low Latency distributed reads

2009-09-01 Thread Jay Booth (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12750246#action_12750246
 ] 

Jay Booth commented on HDFS-516:


I did some benchmarking, here are the results:

Each test ran 1000 searches to warm, then 5000 searches to benchmark.
Binary search of a 20GB sorted sequence file of 20 million 1kb records.
Tests were run from the namenode in a 4-node EC2 medium cluster, 1.7 GB of ram 
each.  1 namenode and 3 datanodes.  

From HDFS to a 512MB cached RadFS there was a 4X average improvement in search 
times, from 102ms to 24ms.
Each search was, theoretically, 24.25 reads (log 2 of 20 million).  Not 
actually measured.
I only ran each set once.  The 90th percent line trends the right way, although 
the max line is a little spikey.  I'll add a 99th % in future benchmarks.

HDFS, baseline:
Warming with 1000 searches
Executed 5000 random searches with FS class 
org.apache.hadoop.hdfs.DistributedFileSystem
Done, Search Times:
Mean: 102.178415
Variance: 5939.660105461091
Median:   97.0
Max:  3095.0
Min:  33.0
90th pct: 130.0

Rad, no cache
Executed 5000 random searches with FS class 
org.apache.hadoop.hdfs.rad.RadFileSystem
Done, Search Times: 
Mean: 68.556402
Variance: 233.8335857571515
Median:   67.0
Max:  379.0
Min:  26.0
90th pct: 79.0

Rad, 16MB cache:
Warming with 1000 searches
Executed 5000 random searches with FS class 
org.apache.hadoop.hdfs.rad.RadFileSystem
Done, Search Times: 
Mean: 42.0397985
Variance: 237.83818359671966
Median:   40.0
Max:  203.0
Min:  5.0
90th pct: 59.0

Rad, 128MB cache:
Warming with 1000 searches
Executed 5000 random searches with FS class 
org.apache.hadoop.hdfs.rad.RadFileSystem
Done, Search Times: 
Mean: 29.8506007
Variance: 202.08189601920367
Median:   27.0
Max:  203.0
Min:  1.0
90th pct: 45.0

Rad, 512MB cache:
Warming with 1000 searches
Executed 5000 random searches with FS class 
org.apache.hadoop.hdfs.rad.RadFileSystem
Done, Search Times:
Mean: 24.2746014
Variance: 250.3052558911758
Median:   22.0
Max:  687.0
Min:  0.0
90th pct: 36.0


I could still shave a point or two by cleaning up my caching system to be more 
graceful with its lookahead mechanism, but not bad for now.  I'll pretty it up 
and post a first attempt at a final patch soon.

 Low Latency distributed reads
 -

 Key: HDFS-516
 URL: https://issues.apache.org/jira/browse/HDFS-516
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Jay Booth
Priority: Minor
 Attachments: hdfs-516-20090824.patch, hdfs-516-20090831.patch, 
 radfs.patch

   Original Estimate: 168h
  Remaining Estimate: 168h

 I created a method for low latency random reads using NIO on the server side 
 and simulated OS paging with LRU caching and lookahead on the client side.  
 Some applications could include lucene searching (term-doc and doc-offset 
 mappings are likely to be in local cache, thus much faster than nutch's 
 current FsDirectory impl and binary search through record files (bytes at 
 1/2, 1/4, 1/8 marks are likely to be cached)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HDFS-516) Low Latency distributed reads

2009-08-31 Thread Jay Booth (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jay Booth updated HDFS-516:
---

Attachment: hdfs-516-20090831.patch

New patch, IPC server was too slow for IO operations (like 40 times slower than
DFS without caching) so I wrote a custom ByteServer that's streamlined to avoid
object creation or byte copying whenever possible and defaults to tcp nodelay.
Client connections pool using commons-pool. Uses static methods in
hdfs.rad.ByteServiceProtocol for all serialization, faster than reflection. On
the laptop in pseudodistributed, I'm seeing 5X faster than DFS for random
searches.

Refactored a bunch on the client side, eliminated a few redundant classes,
still need to make lookahead happen via a separate thread in caching
byteservice and tweak a couple things in ByteServer for performance, then this
thing will be pretty fast. I'm gonna run some numbers on EC2 tonight/tomorrow
and see what I come up with.

Also, cleaned up unit tests to JUnit 4 and added some javadoc, probably missed
a bunch of places and could certainly expand on all of it. Haven't added
license to the header of every file yet, license explicitly granted here, will
get to that for next patch.

Low Latency distributed reads
-

Key: HDFS-516
URL: https://issues.apache.org/jira/browse/HDFS-516
Project: Hadoop HDFS
Issue Type: New Feature
Reporter: Jay Booth
Priority: Minor
Attachments: hdfs-516-20090824.patch, hdfs-516-20090831.patch,
radfs.patch

Original Estimate: 168h
Remaining Estimate: 168h

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HDFS-516) Low Latency distributed reads

2009-08-24 Thread Jay Booth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jay Booth updated HDFS-516:
---

Attachment: hdfs-516-20090824.patch

Changed caching system to ehcache, which gives us a configurable api with 
better caching options.  Fixed all known correctness bugs and added lots of 
test coverage.

Got SequenceFileSearcher working, but it depends on HADOOP-6196 -- I could fold 
the searcher code and searcher test into HADOOP-6196 if people are interested 
in the functionality.

Added benchmarking utils, .sh scripts intended to be run from HADOOP_HOME/bin.

Note, if anyone installs, run ant bin-package from project root before trying 
to run radfs unit tests.

TODO:  
benchmark
javadoc, pretty up codebase, change unit tests to JUnit 4 style, add filesystem 
contract unit test

I'll get a fork going on github soon if anyone's interested in trying it out.  

 Low Latency distributed reads
 -

 Key: HDFS-516
 URL: https://issues.apache.org/jira/browse/HDFS-516
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Jay Booth
Priority: Minor
 Attachments: hdfs-516-20090824.patch, radfs.patch

   Original Estimate: 168h
  Remaining Estimate: 168h

 I created a method for low latency random reads using NIO on the server side 
 and simulated OS paging with LRU caching and lookahead on the client side.  
 Some applications could include lucene searching (term-doc and doc-offset 
 mappings are likely to be in local cache, thus much faster than nutch's 
 current FsDirectory impl and binary search through record files (bytes at 
 1/2, 1/4, 1/8 marks are likely to be cached)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-516) Low Latency distributed reads

2009-08-03 Thread Jay Booth (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12738651#action_12738651
 ] 

Jay Booth commented on HDFS-516:


Wow, thanks Raghu, that's awesome and will save me a ton of time.  A couple 
points for discussion:

* The random 4k byte grabber is awesome and I will be using it as part of my 
benchmarking at the first opportunity, however I think it's worth also testing 
some likely applications to really show the strength of client-side caching.  
10MB or so worth of properly warmed cache could mean your first 20 lookups in a 
binary search are almost-free, and having the frontmost 10% of a lucene index 
in cache will mean that almost all of the scoring portion of the search will be 
computed against local memory.  Meanwhile, for truly random reads, having a 
cache that's, say, 5-10% of the size of the data will only get you a small 
improvement.  So I'd like to get some numbers for use cases that really thrive 
on caching in addition to truly random access.But that will be extremely 
useful for tuning the IO layer and establishing a baseline for cache-miss 
performance, so thanks for the heads up.

* I have a feeling that my implementation is significantly slower than the 
default when it comes to streaming, since it relies on successive, small 
positioned reads and a heavy memory footprint rather than a simple stream of 
bytes.  Watching my unit tests run on my laptop with a ton of confounding 
factors sure seemed that way, although that's not a scientific measurement (one 
more item to benchmark).  So while I agree with the urge for simplicity, I feel 
like we need to make that performance tradeoff clear.  Otherwise, we could have 
a lot of very slow mapreduce jobs happening.  Given that MapReduce is the 
primary use case for Hadoop, my instinct was to make RadFileSystem a 
non-default implementation.  Point very well taken about the BlockLocations and 
CRC verification, maybe the best way to handle future integration with DataNode 
would be to develop separately, reuse as much code as possible and then when 
RadFileSystem is mature and benchmarked we can revisit a merge with 
DistributedFileSystem?

Thanks again, I'll try and write a post later tonight with an explicit plan for 
benchmarking and then maybe people can comment and poke holes in it as they see 
fit?

 Low Latency distributed reads
 -

 Key: HDFS-516
 URL: https://issues.apache.org/jira/browse/HDFS-516
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Jay Booth
Priority: Minor
 Attachments: radfs.patch

   Original Estimate: 168h
  Remaining Estimate: 168h

 I created a method for low latency random reads using NIO on the server side 
 and simulated OS paging with LRU caching and lookahead on the client side.  
 Some applications could include lucene searching (term-doc and doc-offset 
 mappings are likely to be in local cache, thus much faster than nutch's 
 current FsDirectory impl and binary search through record files (bytes at 
 1/2, 1/4, 1/8 marks are likely to be cached)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (HDFS-516) Low Latency distributed reads

2009-07-31 Thread Jay Booth (JIRA)

Low Latency distributed reads
-

 Key: HDFS-516
 URL: https://issues.apache.org/jira/browse/HDFS-516
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Jay Booth
Priority: Minor


I created a method for low latency random reads using NIO on the server side 
and simulated OS paging with LRU caching and lookahead on the client side.  
Some applications could include lucene searching (term-doc and doc-offset 
mappings are likely to be in local cache, thus much faster than nutch's current 
FsDirectory impl and binary search through record files (bytes at 1/2, 1/4, 1/8 
marks are likely to be cached)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HDFS-516) Low Latency distributed reads

2009-07-31 Thread Jay Booth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jay Booth updated HDFS-516:
---

Status: Patch Available  (was: Open)

Here's the initial patch.  I have some pretty decent byte-consistency and 
integration tests wrapping it but no actual applications built using it.  I 
started on a SequenceFileSearcher that implemented binary search but dealing 
with the sync points was sticky so it's not working yet (and commented out, 
with associated failing test).  I'm hoping to get that working in the next 
couple weeks when I have some time as an example along with doing a performance 
comparison between nutch FsDirectory using DistributedFileSystem and 
RadFileSystem to see if this shows the gains that I'm thinking it will.  My 
only change to core HDFS was to add getFile() to FSDatasetInterface so that the 
RadNode (plugged into datanode via ServicePlugin) can open FileChannels.

 Low Latency distributed reads
 -

 Key: HDFS-516
 URL: https://issues.apache.org/jira/browse/HDFS-516
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Jay Booth
Priority: Minor
 Attachments: radfs.patch

   Original Estimate: 168h
  Remaining Estimate: 168h

 I created a method for low latency random reads using NIO on the server side 
 and simulated OS paging with LRU caching and lookahead on the client side.  
 Some applications could include lucene searching (term-doc and doc-offset 
 mappings are likely to be in local cache, thus much faster than nutch's 
 current FsDirectory impl and binary search through record files (bytes at 
 1/2, 1/4, 1/8 marks are likely to be cached)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HDFS-516) Low Latency distributed reads

2009-07-31 Thread Jay Booth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jay Booth updated HDFS-516:
---

Attachment: radfs.patch

 Low Latency distributed reads
 -

 Key: HDFS-516
 URL: https://issues.apache.org/jira/browse/HDFS-516
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Jay Booth
Priority: Minor
 Attachments: radfs.patch

   Original Estimate: 168h
  Remaining Estimate: 168h

 I created a method for low latency random reads using NIO on the server side 
 and simulated OS paging with LRU caching and lookahead on the client side.  
 Some applications could include lucene searching (term-doc and doc-offset 
 mappings are likely to be in local cache, thus much faster than nutch's 
 current FsDirectory impl and binary search through record files (bytes at 
 1/2, 1/4, 1/8 marks are likely to be cached)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-516) Low Latency distributed reads

2009-07-31 Thread Jay Booth (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12737713#action_12737713
 ] 

Jay Booth commented on HDFS-516:


Here's some architectural overview and a general request for comments on the 
matter, I'll be away and busy the next few days but should be able to get back 
to this in the middle of next week.

The basic workflow is I created a RadFileSystem (RandomAccessDistributed FS) 
which wraps DistributedFileSystem and delegates to it for everything except for 
getFSDataInputStream.  That returns a custom FSDataInputStream which wraps a 
CachingByteService which itself wraps a RadFSByteService.  The caching byte 
services share a cache which is managed by the RadFSClient class (could maybe 
factor that away and put it in RadFileSystem instead).  They try to hit the 
cache, and if they miss, they call the underlying RadFSClientByteService to get 
the requested page plus a few pages of lookahead.  The RadFSClientByteService 
calls the namenode to get appropriate block locations (todo, cache these 
effectively) and then calls RadNode, which is embedded in DataNode via 
ServicePlugin and maintains an IPCServer and a set of FileChannels to the local 
blocks.  On repeated requests for the same data, the RadFSClient tends to favor 
going to the same host, figuring that the benefit of hitting the DataNode's OS 
cache for the given bytes outweighs the penalty of hopping a rack in terms of 
reducing latency (untested assumption).  

The intended use case is pretty different from MapReduce so I think this should 
be a contrib module that has to be explicitly invoked by clients.  It really 
underperforms DFS in terms of streaming but should (haven't tested extensively 
outside of localhost) significantly outperform it in terms of random reads.  In 
terms of files with 'hot paths', such as lucene indices or binary search over a 
normal file, cache hit percentage is likely to be pretty high so it should 
probably perform pretty well.  Currently, it makes a fresh request to the 
NameNode for every read, which is inefficient but more likely to be correct.  
Going forward, I'd like to tighten this up, make sure it plays nice with append 
and get it into a future Hadoop release.   

 Low Latency distributed reads
 -

 Key: HDFS-516
 URL: https://issues.apache.org/jira/browse/HDFS-516
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Jay Booth
Priority: Minor
 Attachments: radfs.patch

   Original Estimate: 168h
  Remaining Estimate: 168h

 I created a method for low latency random reads using NIO on the server side 
 and simulated OS paging with LRU caching and lookahead on the client side.  
 Some applications could include lucene searching (term-doc and doc-offset 
 mappings are likely to be in local cache, thus much faster than nutch's 
 current FsDirectory impl and binary search through record files (bytes at 
 1/2, 1/4, 1/8 marks are likely to be cached)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HDFS-516) Low Latency distributed reads

2009-07-31 Thread Jay Booth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jay Booth updated HDFS-516:
---

Attachment: radfs.tgz

Oops.  Shows what I get for posting at the end of the day, svn diff misses new 
files, duh.  Here's a tarball of src/contrib/radfs.  build.xml should play well 
with the rest of hadoop

 Low Latency distributed reads
 -

 Key: HDFS-516
 URL: https://issues.apache.org/jira/browse/HDFS-516
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Jay Booth
Priority: Minor
 Attachments: radfs.patch, radfs.tgz

   Original Estimate: 168h
  Remaining Estimate: 168h

 I created a method for low latency random reads using NIO on the server side 
 and simulated OS paging with LRU caching and lookahead on the client side.  
 Some applications could include lucene searching (term-doc and doc-offset 
 mappings are likely to be in local cache, thus much faster than nutch's 
 current FsDirectory impl and binary search through record files (bytes at 
 1/2, 1/4, 1/8 marks are likely to be cached)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-516) Low Latency distributed reads

2009-07-31 Thread Jay Booth (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12737799#action_12737799
 ] 

Jay Booth commented on HDFS-516:


Ok, thanks, is one big patch preferred to the tiny patch + tarball?  

 Low Latency distributed reads
 -

 Key: HDFS-516
 URL: https://issues.apache.org/jira/browse/HDFS-516
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Jay Booth
Priority: Minor
 Attachments: radfs.patch, radfs.tgz

   Original Estimate: 168h
  Remaining Estimate: 168h

 I created a method for low latency random reads using NIO on the server side 
 and simulated OS paging with LRU caching and lookahead on the client side.  
 Some applications could include lucene searching (term-doc and doc-offset 
 mappings are likely to be in local cache, thus much faster than nutch's 
 current FsDirectory impl and binary search through record files (bytes at 
 1/2, 1/4, 1/8 marks are likely to be cached)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HDFS-516) Low Latency distributed reads

2009-07-31 Thread Jay Booth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jay Booth updated HDFS-516:
---

Attachment: (was: radfs.patch)

 Low Latency distributed reads
 -

 Key: HDFS-516
 URL: https://issues.apache.org/jira/browse/HDFS-516
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Jay Booth
Priority: Minor
   Original Estimate: 168h
  Remaining Estimate: 168h

 I created a method for low latency random reads using NIO on the server side 
 and simulated OS paging with LRU caching and lookahead on the client side.  
 Some applications could include lucene searching (term-doc and doc-offset 
 mappings are likely to be in local cache, thus much faster than nutch's 
 current FsDirectory impl and binary search through record files (bytes at 
 1/2, 1/4, 1/8 marks are likely to be cached)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HDFS-516) Low Latency distributed reads

2009-07-31 Thread Jay Booth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jay Booth updated HDFS-516:
---

Attachment: (was: radfs.patch)

 Low Latency distributed reads
 -

 Key: HDFS-516
 URL: https://issues.apache.org/jira/browse/HDFS-516
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Jay Booth
Priority: Minor
   Original Estimate: 168h
  Remaining Estimate: 168h

 I created a method for low latency random reads using NIO on the server side 
 and simulated OS paging with LRU caching and lookahead on the client side.  
 Some applications could include lucene searching (term-doc and doc-offset 
 mappings are likely to be in local cache, thus much faster than nutch's 
 current FsDirectory impl and binary search through record files (bytes at 
 1/2, 1/4, 1/8 marks are likely to be cached)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HDFS-516) Low Latency distributed reads

2009-07-31 Thread Jay Booth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jay Booth updated HDFS-516:
---

Attachment: radfs.patch

 Low Latency distributed reads
 -

 Key: HDFS-516
 URL: https://issues.apache.org/jira/browse/HDFS-516
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Jay Booth
Priority: Minor
 Attachments: radfs.patch

   Original Estimate: 168h
  Remaining Estimate: 168h

 I created a method for low latency random reads using NIO on the server side 
 and simulated OS paging with LRU caching and lookahead on the client side.  
 Some applications could include lucene searching (term-doc and doc-offset 
 mappings are likely to be in local cache, thus much faster than nutch's 
 current FsDirectory impl and binary search through record files (bytes at 
 1/2, 1/4, 1/8 marks are likely to be cached)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

51 matches

Mail list logo