[ 
https://issues.apache.org/jira/browse/SOLR-5150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13743215#comment-13743215
 ] 

Aaron McCurry commented on SOLR-5150:
-------------------------------------

First off I'm really happy to see other people trying to improve performance of 
the HDFSDirectory.  So I will offer some reasons as to why I have landed on the 
current implementation in Blur.

Why Blur doesn't clone the HDFS file handle for clone in Lucene.
 - Mainly because since Lucene 4 cloned file handles don't seem to get closed 
all the time.  So I didn't want to have all those objects hanging around for 
long periods of time and not being closed.  Related: Also for those that are 
interested, Blur has Directory reference counter so that files that are deleted 
by Lucene stick around long enough for running queries to finish.

Why Blur doesn't use the read[Fully](position,buf,off,len) method instead of 
the seek plus read[Fully](buf,off,len).
 - When accessing the local file system with the call would take a huge amount 
of time because of some internal setup the Hadoop was doing for every call.  
This didn't seem to be an issue when using HDFS, but if you start using 
short-circuit reads it might become a problem.  I have not tested this for 6 
months, so this may have been improved in the newer versions of Hadoop.

Why Blur uses readFully versus read.
 - Laziness?  Not sure, I'm sure that I thought that a single call to seek + 
read from the filesystem would be better (even if it was more operations) than 
multiple calls with multiple seeks + reads.  Perhaps though it would be better 
to not use the readFully as you all are discussing because of the sync call.

How would I really like to implement it?
 - I would like to implement the file access system as a pool of file handles 
for each file.  So that each file would have up to N (configured by default to 
10 or something like that) file handles open and all the accesses from the base 
file objects and clones would check out the handle and release it when 
finished.  So that way there is some limit to the number of handles but some 
parallel accesses are allowed.

Hope this helps to explain why Blur has the implementation that is does.

Aaron
                
> HdfsIndexInput may not fully read requested bytes.
> --------------------------------------------------
>
>                 Key: SOLR-5150
>                 URL: https://issues.apache.org/jira/browse/SOLR-5150
>             Project: Solr
>          Issue Type: Bug
>    Affects Versions: 4.4
>            Reporter: Mark Miller
>            Assignee: Mark Miller
>             Fix For: 4.5, 5.0
>
>         Attachments: SOLR-5150.patch
>
>
> Patrick Hunt noticed that our HdfsDirectory code was a bit behind Blur here - 
> the read call we are using may not read all of the requested bytes - it 
> returns the number of bytes actually written - which we ignore.
> Blur moved to using a seek and then readFully call - synchronizing across the 
> two calls to deal with clones.
> We have seen that really kills performance, and using the readFully call that 
> lets you pass the position rather than first doing a seek, performs much 
> better and does not require the synchronization.
> I also noticed that the seekInternal impl should not seek but be a no op 
> since we are seeking on the read.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to