[jira] Commented: (HDFS-516) Low Latency distributed reads

2009-09-16 Thread Raghu Angadi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12756214#action_12756214
 ] 

Raghu Angadi commented on HDFS-516:
---


When you get a change please point me to the streaming test/benchmark.

bq. After I get those, my roadmap for this is to add checksum support and 
better DatanodeInfo caching. User groups would come after that.

Unless you want to add checksums for better comparison,  I don't think it is 
every essential. You need not spend much time on getting feature parity with 
HDFS. 

For more users to benefit from your work, I think it is better to extract the 
features that are complementary to HDFS. and we can work on getting those into 
HDFS. 

 Low Latency distributed reads
 -

 Key: HDFS-516
 URL: https://issues.apache.org/jira/browse/HDFS-516
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Jay Booth
Priority: Minor
 Attachments: hdfs-516-20090912.patch

   Original Estimate: 168h
  Remaining Estimate: 168h

 I created a method for low latency random reads using NIO on the server side 
 and simulated OS paging with LRU caching and lookahead on the client side.  
 Some applications could include lucene searching (term-doc and doc-offset 
 mappings are likely to be in local cache, thus much faster than nutch's 
 current FsDirectory impl and binary search through record files (bytes at 
 1/2, 1/4, 1/8 marks are likely to be cached)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-516) Low Latency distributed reads

2009-09-15 Thread Raghu Angadi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12755627#action_12755627
 ] 

Raghu Angadi commented on HDFS-516:
---


bq. somehow, from 213 seconds to 112 seconds to stream 1GB from a remote HDFS 
file.

This is 5MBps for HDFS and 9MBps for RadFS. Assuming 9MBps is probably 100Mbps 
network limit (is it?), 5MBps is too low for any FS. Since both reads are from 
the same physical files, this may not be hardware related. Could you check what 
is causing this delay? This might be affecting other benchmarks as well. 
Checking netstat on the client while this read is going on might help.

Regd reads in RAD fs, does client fetch 32KB each time (single RPC) or does it 
pipeline (multiple requests for single client's stream)?

@Todd, I essentially see this as POC of what could/should be improved in HDFS 
for addressing latency issues. Contrib makes sense, but I would not expect this 
to go to production in this form and should be marked 'Experimental'. The 
benchmarks also help greatly in setting priorities for features. I don't think 
this needs a branch since it does not touch core at all.  

 Low Latency distributed reads
 -

 Key: HDFS-516
 URL: https://issues.apache.org/jira/browse/HDFS-516
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Jay Booth
Priority: Minor
 Attachments: hdfs-516-20090912.patch

   Original Estimate: 168h
  Remaining Estimate: 168h

 I created a method for low latency random reads using NIO on the server side 
 and simulated OS paging with LRU caching and lookahead on the client side.  
 Some applications could include lucene searching (term-doc and doc-offset 
 mappings are likely to be in local cache, thus much faster than nutch's 
 current FsDirectory impl and binary search through record files (bytes at 
 1/2, 1/4, 1/8 marks are likely to be cached)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-516) Low Latency distributed reads

2009-09-15 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12755640#action_12755640
 ] 

Todd Lipcon commented on HDFS-516:
--

bq. I essentially see this as POC of what could/should be improved in HDFS for 
addressing latency issues. Contrib makes sense, but I would not expect this to 
go to production in this form and should be marked 'Experimental'.

+1

 Low Latency distributed reads
 -

 Key: HDFS-516
 URL: https://issues.apache.org/jira/browse/HDFS-516
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Jay Booth
Priority: Minor
 Attachments: hdfs-516-20090912.patch

   Original Estimate: 168h
  Remaining Estimate: 168h

 I created a method for low latency random reads using NIO on the server side 
 and simulated OS paging with LRU caching and lookahead on the client side.  
 Some applications could include lucene searching (term-doc and doc-offset 
 mappings are likely to be in local cache, thus much faster than nutch's 
 current FsDirectory impl and binary search through record files (bytes at 
 1/2, 1/4, 1/8 marks are likely to be cached)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-516) Low Latency distributed reads

2009-09-15 Thread Jay Booth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12755823#action_12755823
 ] 

Jay Booth commented on HDFS-516:


Yeah, I was puzzled by the performance too.  I dug through the DFS code and I'm 
saving a bit on new socket and object creation, maybe a couple instructions 
here and there, but that shouldn't add up to 100 seconds for a gigabyte (approx 
20 blocks).  I'm calling read() a bajillion times in a row so it's conceivable 
(although unlikely) that I'm pegging the CPU and that's the limiting factor.  

I'm busy for a couple days but will get back to you with some figures from 
netstat, top and whatever else I can think of, along with another streaming 
case that works with read(b, off, len) to see if that changes things.  I'll do 
a little more digging into DFS as well to see if I can isolate the cause.  I 
definitely did run them several times on the same machine and another time on a 
different cluster with similar results, so it wasn't simply bad luck on the 
rack placement on EC2 (well maybe but unlikely).

Will report back when I have more numbers.  After I get those, my roadmap for 
this is to add checksum support and better DatanodeInfo caching.  User groups 
would come after that.

 Low Latency distributed reads
 -

 Key: HDFS-516
 URL: https://issues.apache.org/jira/browse/HDFS-516
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Jay Booth
Priority: Minor
 Attachments: hdfs-516-20090912.patch

   Original Estimate: 168h
  Remaining Estimate: 168h

 I created a method for low latency random reads using NIO on the server side 
 and simulated OS paging with LRU caching and lookahead on the client side.  
 Some applications could include lucene searching (term-doc and doc-offset 
 mappings are likely to be in local cache, thus much faster than nutch's 
 current FsDirectory impl and binary search through record files (bytes at 
 1/2, 1/4, 1/8 marks are likely to be cached)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-516) Low Latency distributed reads

2009-09-14 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12755048#action_12755048
 ] 

Todd Lipcon commented on HDFS-516:
--

I haven't had a chance to look over the patch as of yet, but I have one concern:

Is there a plan for deprecation in the event that HDFS itself achieves similar 
performance? I think having an entirely separate FS implementation that only 
differs in performance is not a good idea longterm. Using this contrib project 
as an experimentation ground sounds great, but I think long term we should 
focus on improving DistributedFileSystem's performance itself, and not 
bifurcate the code into the fast version that we dont really support because 
it's contrib and slow version that we do.

I'll try to find a chance to look over the patch soon, but in the meantime do 
you have any thoughts on the above?

 Low Latency distributed reads
 -

 Key: HDFS-516
 URL: https://issues.apache.org/jira/browse/HDFS-516
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Jay Booth
Priority: Minor
 Attachments: hdfs-516-20090912.patch

   Original Estimate: 168h
  Remaining Estimate: 168h

 I created a method for low latency random reads using NIO on the server side 
 and simulated OS paging with LRU caching and lookahead on the client side.  
 Some applications could include lucene searching (term-doc and doc-offset 
 mappings are likely to be in local cache, thus much faster than nutch's 
 current FsDirectory impl and binary search through record files (bytes at 
 1/2, 1/4, 1/8 marks are likely to be cached)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-516) Low Latency distributed reads

2009-09-14 Thread Raghu Angadi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12755033#action_12755033
 ] 

Raghu Angadi commented on HDFS-516:
---

Hi Jay,

will go through the patch. I hope a few others get a chance to look at it as 
well.

Since it is contrib, it certainly makes it easier to include in trunk. I am not 
sure about 0.21 timeline. 

 Low Latency distributed reads
 -

 Key: HDFS-516
 URL: https://issues.apache.org/jira/browse/HDFS-516
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Jay Booth
Priority: Minor
 Attachments: hdfs-516-20090912.patch

   Original Estimate: 168h
  Remaining Estimate: 168h

 I created a method for low latency random reads using NIO on the server side 
 and simulated OS paging with LRU caching and lookahead on the client side.  
 Some applications could include lucene searching (term-doc and doc-offset 
 mappings are likely to be in local cache, thus much faster than nutch's 
 current FsDirectory impl and binary search through record files (bytes at 
 1/2, 1/4, 1/8 marks are likely to be cached)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-516) Low Latency distributed reads

2009-09-14 Thread Jay Booth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12755120#action_12755120
 ] 

Jay Booth commented on HDFS-516:


Hey Todd, in short, I agree, we should be looking at moving performance 
improvements over to the main FS implementation.  Right now, my version doesn't 
support user permissions or checksumming.  I'd say it makes sense to keep it in 
contrib as a sandbox for now, and work towards full compatibility with the main 
DFS implementation at which point we could consider swapping in the new reading 
subsystem?  User permissioning would require some model changes but should be 
workable, checksumming probably won't be too bad if I read the code right.

So, I suppose keep it in contrib as a sandbox initially with an explicit goal 
of moving it over to DFS when it reaches compatibility?  It doesn't really lend 
itself to moving over piecemeal, as it has several components which all pretty 
much need each other.  However, it's pretty well integrated with the DFS API 
and only replaces one method on the filesystem class.

 Low Latency distributed reads
 -

 Key: HDFS-516
 URL: https://issues.apache.org/jira/browse/HDFS-516
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Jay Booth
Priority: Minor
 Attachments: hdfs-516-20090912.patch

   Original Estimate: 168h
  Remaining Estimate: 168h

 I created a method for low latency random reads using NIO on the server side 
 and simulated OS paging with LRU caching and lookahead on the client side.  
 Some applications could include lucene searching (term-doc and doc-offset 
 mappings are likely to be in local cache, thus much faster than nutch's 
 current FsDirectory impl and binary search through record files (bytes at 
 1/2, 1/4, 1/8 marks are likely to be cached)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-516) Low Latency distributed reads

2009-09-14 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12755121#action_12755121
 ] 

Todd Lipcon commented on HDFS-516:
--

Jay: that sounds good to me. Let's explicitly mark the API as Experimental 
and Likely to disappear in a future release with little or no warning - use at 
your own risk :) I think especially as security and authentication begin to 
take form in the next couple months it will be a headache to try to maintain 
this code.

If branching in SVN weren't such a pain, I'd suggest we maintain this in a 
separate branch that didn't go into releases... c'est la vie ;-)

 Low Latency distributed reads
 -

 Key: HDFS-516
 URL: https://issues.apache.org/jira/browse/HDFS-516
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Jay Booth
Priority: Minor
 Attachments: hdfs-516-20090912.patch

   Original Estimate: 168h
  Remaining Estimate: 168h

 I created a method for low latency random reads using NIO on the server side 
 and simulated OS paging with LRU caching and lookahead on the client side.  
 Some applications could include lucene searching (term-doc and doc-offset 
 mappings are likely to be in local cache, thus much faster than nutch's 
 current FsDirectory impl and binary search through record files (bytes at 
 1/2, 1/4, 1/8 marks are likely to be cached)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-516) Low Latency distributed reads

2009-09-01 Thread Jay Booth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12750246#action_12750246
 ] 

Jay Booth commented on HDFS-516:


I did some benchmarking, here are the results:

Each test ran 1000 searches to warm, then 5000 searches to benchmark.
Binary search of a 20GB sorted sequence file of 20 million 1kb records.
Tests were run from the namenode in a 4-node EC2 medium cluster, 1.7 GB of ram 
each.  1 namenode and 3 datanodes.  

From HDFS to a 512MB cached RadFS there was a 4X average improvement in search 
times, from 102ms to 24ms.
Each search was, theoretically, 24.25 reads (log 2 of 20 million).  Not 
actually measured.
I only ran each set once.  The 90th percent line trends the right way, although 
the max line is a little spikey.  I'll add a 99th % in future benchmarks.

HDFS, baseline:
Warming with 1000 searches
Executed 5000 random searches with FS class 
org.apache.hadoop.hdfs.DistributedFileSystem
Done, Search Times:
Mean: 102.178415
Variance: 5939.660105461091
Median:   97.0
Max:  3095.0
Min:  33.0
90th pct: 130.0

Rad, no cache
Executed 5000 random searches with FS class 
org.apache.hadoop.hdfs.rad.RadFileSystem
Done, Search Times: 
Mean: 68.556402
Variance: 233.8335857571515
Median:   67.0
Max:  379.0
Min:  26.0
90th pct: 79.0

Rad, 16MB cache:
Warming with 1000 searches
Executed 5000 random searches with FS class 
org.apache.hadoop.hdfs.rad.RadFileSystem
Done, Search Times: 
Mean: 42.0397985
Variance: 237.83818359671966
Median:   40.0
Max:  203.0
Min:  5.0
90th pct: 59.0

Rad, 128MB cache:
Warming with 1000 searches
Executed 5000 random searches with FS class 
org.apache.hadoop.hdfs.rad.RadFileSystem
Done, Search Times: 
Mean: 29.8506007
Variance: 202.08189601920367
Median:   27.0
Max:  203.0
Min:  1.0
90th pct: 45.0

Rad, 512MB cache:
Warming with 1000 searches
Executed 5000 random searches with FS class 
org.apache.hadoop.hdfs.rad.RadFileSystem
Done, Search Times:
Mean: 24.2746014
Variance: 250.3052558911758
Median:   22.0
Max:  687.0
Min:  0.0
90th pct: 36.0


I could still shave a point or two by cleaning up my caching system to be more 
graceful with its lookahead mechanism, but not bad for now.  I'll pretty it up 
and post a first attempt at a final patch soon.

 Low Latency distributed reads
 -

 Key: HDFS-516
 URL: https://issues.apache.org/jira/browse/HDFS-516
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Jay Booth
Priority: Minor
 Attachments: hdfs-516-20090824.patch, hdfs-516-20090831.patch, 
 radfs.patch

   Original Estimate: 168h
  Remaining Estimate: 168h

 I created a method for low latency random reads using NIO on the server side 
 and simulated OS paging with LRU caching and lookahead on the client side.  
 Some applications could include lucene searching (term-doc and doc-offset 
 mappings are likely to be in local cache, thus much faster than nutch's 
 current FsDirectory impl and binary search through record files (bytes at 
 1/2, 1/4, 1/8 marks are likely to be cached)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-516) Low Latency distributed reads

2009-08-03 Thread Raghu Angadi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12738565#action_12738565
 ] 

Raghu Angadi commented on HDFS-516:
---

Jay, random read is an (increasingly more) important feature for HDFS to 
support. Currently latency is the biggest draw back. See HDFS-236. It is good 
to see your work on this. You could also run simple benchmark in HDFS-236 that 
does simple random read on a file and does not depend on a sequence file.

From your architecture description this reduces the latency through following 
improvements :
  
   * Connection caching (Through RPC).
   * File Channel  caching on Server
   * Local cache on the client.

These are complementary to existing datanode. I might be a lot more simpler to 
add these features to existing implementation rather than requiring a user to 
choose an implementation based on the access. As such you will have to 
re-implement many features (BlockLocations on client, CRC verification, 
effcient bulk transfers AVRO-24, etc )




 Low Latency distributed reads
 -

 Key: HDFS-516
 URL: https://issues.apache.org/jira/browse/HDFS-516
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Jay Booth
Priority: Minor
 Attachments: radfs.patch

   Original Estimate: 168h
  Remaining Estimate: 168h

 I created a method for low latency random reads using NIO on the server side 
 and simulated OS paging with LRU caching and lookahead on the client side.  
 Some applications could include lucene searching (term-doc and doc-offset 
 mappings are likely to be in local cache, thus much faster than nutch's 
 current FsDirectory impl and binary search through record files (bytes at 
 1/2, 1/4, 1/8 marks are likely to be cached)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-516) Low Latency distributed reads

2009-08-03 Thread Jay Booth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12738651#action_12738651
 ] 

Jay Booth commented on HDFS-516:


Wow, thanks Raghu, that's awesome and will save me a ton of time.  A couple 
points for discussion:

* The random 4k byte grabber is awesome and I will be using it as part of my 
benchmarking at the first opportunity, however I think it's worth also testing 
some likely applications to really show the strength of client-side caching.  
10MB or so worth of properly warmed cache could mean your first 20 lookups in a 
binary search are almost-free, and having the frontmost 10% of a lucene index 
in cache will mean that almost all of the scoring portion of the search will be 
computed against local memory.  Meanwhile, for truly random reads, having a 
cache that's, say, 5-10% of the size of the data will only get you a small 
improvement.  So I'd like to get some numbers for use cases that really thrive 
on caching in addition to truly random access.But that will be extremely 
useful for tuning the IO layer and establishing a baseline for cache-miss 
performance, so thanks for the heads up.

* I have a feeling that my implementation is significantly slower than the 
default when it comes to streaming, since it relies on successive, small 
positioned reads and a heavy memory footprint rather than a simple stream of 
bytes.  Watching my unit tests run on my laptop with a ton of confounding 
factors sure seemed that way, although that's not a scientific measurement (one 
more item to benchmark).  So while I agree with the urge for simplicity, I feel 
like we need to make that performance tradeoff clear.  Otherwise, we could have 
a lot of very slow mapreduce jobs happening.  Given that MapReduce is the 
primary use case for Hadoop, my instinct was to make RadFileSystem a 
non-default implementation.  Point very well taken about the BlockLocations and 
CRC verification, maybe the best way to handle future integration with DataNode 
would be to develop separately, reuse as much code as possible and then when 
RadFileSystem is mature and benchmarked we can revisit a merge with 
DistributedFileSystem?

Thanks again, I'll try and write a post later tonight with an explicit plan for 
benchmarking and then maybe people can comment and poke holes in it as they see 
fit?

 Low Latency distributed reads
 -

 Key: HDFS-516
 URL: https://issues.apache.org/jira/browse/HDFS-516
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Jay Booth
Priority: Minor
 Attachments: radfs.patch

   Original Estimate: 168h
  Remaining Estimate: 168h

 I created a method for low latency random reads using NIO on the server side 
 and simulated OS paging with LRU caching and lookahead on the client side.  
 Some applications could include lucene searching (term-doc and doc-offset 
 mappings are likely to be in local cache, thus much faster than nutch's 
 current FsDirectory impl and binary search through record files (bytes at 
 1/2, 1/4, 1/8 marks are likely to be cached)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-516) Low Latency distributed reads

2009-07-31 Thread Jay Booth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12737713#action_12737713
 ] 

Jay Booth commented on HDFS-516:


Here's some architectural overview and a general request for comments on the 
matter, I'll be away and busy the next few days but should be able to get back 
to this in the middle of next week.

The basic workflow is I created a RadFileSystem (RandomAccessDistributed FS) 
which wraps DistributedFileSystem and delegates to it for everything except for 
getFSDataInputStream.  That returns a custom FSDataInputStream which wraps a 
CachingByteService which itself wraps a RadFSByteService.  The caching byte 
services share a cache which is managed by the RadFSClient class (could maybe 
factor that away and put it in RadFileSystem instead).  They try to hit the 
cache, and if they miss, they call the underlying RadFSClientByteService to get 
the requested page plus a few pages of lookahead.  The RadFSClientByteService 
calls the namenode to get appropriate block locations (todo, cache these 
effectively) and then calls RadNode, which is embedded in DataNode via 
ServicePlugin and maintains an IPCServer and a set of FileChannels to the local 
blocks.  On repeated requests for the same data, the RadFSClient tends to favor 
going to the same host, figuring that the benefit of hitting the DataNode's OS 
cache for the given bytes outweighs the penalty of hopping a rack in terms of 
reducing latency (untested assumption).  

The intended use case is pretty different from MapReduce so I think this should 
be a contrib module that has to be explicitly invoked by clients.  It really 
underperforms DFS in terms of streaming but should (haven't tested extensively 
outside of localhost) significantly outperform it in terms of random reads.  In 
terms of files with 'hot paths', such as lucene indices or binary search over a 
normal file, cache hit percentage is likely to be pretty high so it should 
probably perform pretty well.  Currently, it makes a fresh request to the 
NameNode for every read, which is inefficient but more likely to be correct.  
Going forward, I'd like to tighten this up, make sure it plays nice with append 
and get it into a future Hadoop release.   

 Low Latency distributed reads
 -

 Key: HDFS-516
 URL: https://issues.apache.org/jira/browse/HDFS-516
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Jay Booth
Priority: Minor
 Attachments: radfs.patch

   Original Estimate: 168h
  Remaining Estimate: 168h

 I created a method for low latency random reads using NIO on the server side 
 and simulated OS paging with LRU caching and lookahead on the client side.  
 Some applications could include lucene searching (term-doc and doc-offset 
 mappings are likely to be in local cache, thus much faster than nutch's 
 current FsDirectory impl and binary search through record files (bytes at 
 1/2, 1/4, 1/8 marks are likely to be cached)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-516) Low Latency distributed reads

2009-07-31 Thread Tsz Wo (Nicholas), SZE (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12737797#action_12737797
 ] 

Tsz Wo (Nicholas), SZE commented on HDFS-516:
-

 .., svn diff misses new files, ...
For new files, run svn add /path/to/new/files  before svn diff.

 Low Latency distributed reads
 -

 Key: HDFS-516
 URL: https://issues.apache.org/jira/browse/HDFS-516
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Jay Booth
Priority: Minor
 Attachments: radfs.patch, radfs.tgz

   Original Estimate: 168h
  Remaining Estimate: 168h

 I created a method for low latency random reads using NIO on the server side 
 and simulated OS paging with LRU caching and lookahead on the client side.  
 Some applications could include lucene searching (term-doc and doc-offset 
 mappings are likely to be in local cache, thus much faster than nutch's 
 current FsDirectory impl and binary search through record files (bytes at 
 1/2, 1/4, 1/8 marks are likely to be cached)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-516) Low Latency distributed reads

2009-07-31 Thread Jay Booth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12737799#action_12737799
 ] 

Jay Booth commented on HDFS-516:


Ok, thanks, is one big patch preferred to the tiny patch + tarball?  

 Low Latency distributed reads
 -

 Key: HDFS-516
 URL: https://issues.apache.org/jira/browse/HDFS-516
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Jay Booth
Priority: Minor
 Attachments: radfs.patch, radfs.tgz

   Original Estimate: 168h
  Remaining Estimate: 168h

 I created a method for low latency random reads using NIO on the server side 
 and simulated OS paging with LRU caching and lookahead on the client side.  
 Some applications could include lucene searching (term-doc and doc-offset 
 mappings are likely to be in local cache, thus much faster than nutch's 
 current FsDirectory impl and binary search through record files (bytes at 
 1/2, 1/4, 1/8 marks are likely to be cached)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-516) Low Latency distributed reads

2009-07-31 Thread Tsz Wo (Nicholas), SZE (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12737806#action_12737806
 ] 

Tsz Wo (Nicholas), SZE commented on HDFS-516:
-

 Ok, thanks, is one big patch preferred to the tiny patch + tarball? 
Yes.  Otherwise, the automated build won't work.

 Low Latency distributed reads
 -

 Key: HDFS-516
 URL: https://issues.apache.org/jira/browse/HDFS-516
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Jay Booth
Priority: Minor
 Attachments: radfs.patch, radfs.tgz

   Original Estimate: 168h
  Remaining Estimate: 168h

 I created a method for low latency random reads using NIO on the server side 
 and simulated OS paging with LRU caching and lookahead on the client side.  
 Some applications could include lucene searching (term-doc and doc-offset 
 mappings are likely to be in local cache, thus much faster than nutch's 
 current FsDirectory impl and binary search through record files (bytes at 
 1/2, 1/4, 1/8 marks are likely to be cached)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.