[ https://issues.apache.org/jira/browse/SOLR-5150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13743215#comment-13743215 ]
Aaron McCurry commented on SOLR-5150: ------------------------------------- First off I'm really happy to see other people trying to improve performance of the HDFSDirectory. So I will offer some reasons as to why I have landed on the current implementation in Blur. Why Blur doesn't clone the HDFS file handle for clone in Lucene. - Mainly because since Lucene 4 cloned file handles don't seem to get closed all the time. So I didn't want to have all those objects hanging around for long periods of time and not being closed. Related: Also for those that are interested, Blur has Directory reference counter so that files that are deleted by Lucene stick around long enough for running queries to finish. Why Blur doesn't use the read[Fully](position,buf,off,len) method instead of the seek plus read[Fully](buf,off,len). - When accessing the local file system with the call would take a huge amount of time because of some internal setup the Hadoop was doing for every call. This didn't seem to be an issue when using HDFS, but if you start using short-circuit reads it might become a problem. I have not tested this for 6 months, so this may have been improved in the newer versions of Hadoop. Why Blur uses readFully versus read. - Laziness? Not sure, I'm sure that I thought that a single call to seek + read from the filesystem would be better (even if it was more operations) than multiple calls with multiple seeks + reads. Perhaps though it would be better to not use the readFully as you all are discussing because of the sync call. How would I really like to implement it? - I would like to implement the file access system as a pool of file handles for each file. So that each file would have up to N (configured by default to 10 or something like that) file handles open and all the accesses from the base file objects and clones would check out the handle and release it when finished. So that way there is some limit to the number of handles but some parallel accesses are allowed. Hope this helps to explain why Blur has the implementation that is does. Aaron > HdfsIndexInput may not fully read requested bytes. > -------------------------------------------------- > > Key: SOLR-5150 > URL: https://issues.apache.org/jira/browse/SOLR-5150 > Project: Solr > Issue Type: Bug > Affects Versions: 4.4 > Reporter: Mark Miller > Assignee: Mark Miller > Fix For: 4.5, 5.0 > > Attachments: SOLR-5150.patch > > > Patrick Hunt noticed that our HdfsDirectory code was a bit behind Blur here - > the read call we are using may not read all of the requested bytes - it > returns the number of bytes actually written - which we ignore. > Blur moved to using a seek and then readFully call - synchronizing across the > two calls to deal with clones. > We have seen that really kills performance, and using the readFully call that > lets you pass the position rather than first doing a seek, performs much > better and does not require the synchronization. > I also noticed that the seekInternal impl should not seek but be a no op > since we are seeking on the read. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org