All the documentation for HDFS says that it's for large streaming
jobs, but I couldn't find an explicit answer to this, so I'll try
asking here.  How is HDFS's random seek performance within an
FSDataInputStream?  I use lucene with a lot of indices (potentially
thousands), so I was thinking of putting them into HDFS and
reimplementing my search as a Hadoop map-reduce.  I've noticed that
lucene tends to do a bit of random seeking when searching though; I
don't believe that it guarantees that all seeks be to increasing file
positions either.

Would HDFS be a bad fit for an access pattern that involves seeks to
random positions within a stream?

Also, is getFileStatus the typical way of getting the length of a file
in HDFS, or is there some method on FSDataInputStream that I'm not
seeing?

Please cc: me on any reply; I'm not on the hadoop list.  Thanks!

Reply via email to