[ https://issues.apache.org/jira/browse/SOLR-5150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13743168#comment-13743168 ]
Patrick Hunt commented on SOLR-5150: ------------------------------------ Hi [~markrmil...@gmail.com], thanks for filing this while I was out. I was trying to track down another issue and happened across it while reviewing code (then noticed that blur had changed from the original). I realized the seekInternal change while on vacation, was going to mention that but it looks like you fixed it already. ;-) I reviewed the HDFS client code for readInternal with a member of our HDFS team before generating the original patch. Based on the feedback I got the understood was that doing the seek followed by the readFully should have been highest performance. It's interesting that the query performance was so negatively impacted. We should followup with those folks again, perhaps you could provide more insight (than I) into how lucene accesses the underlying filesystem for query based reads vs other access patterns? Might help get more insight from the HDFS devs. Perhaps there is some way to trace those accesses... We have not yet tried "short circuit local HDFS client reads" (see 12.11.2 here http://hbase.apache.org/book/perf.hdfs.html) but we should at some point (soon) and that will further complicate things. Based on the results other clients have seen we should see significant performance benefits from that (at least when the blocks are indeed local). > HdfsIndexInput may not fully read requested bytes. > -------------------------------------------------- > > Key: SOLR-5150 > URL: https://issues.apache.org/jira/browse/SOLR-5150 > Project: Solr > Issue Type: Bug > Affects Versions: 4.4 > Reporter: Mark Miller > Assignee: Mark Miller > Fix For: 4.5, 5.0 > > Attachments: SOLR-5150.patch > > > Patrick Hunt noticed that our HdfsDirectory code was a bit behind Blur here - > the read call we are using may not read all of the requested bytes - it > returns the number of bytes actually written - which we ignore. > Blur moved to using a seek and then readFully call - synchronizing across the > two calls to deal with clones. > We have seen that really kills performance, and using the readFully call that > lets you pass the position rather than first doing a seek, performs much > better and does not require the synchronization. > I also noticed that the seekInternal impl should not seek but be a no op > since we are seeking on the read. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org