[GitHub] [lucene-solr] mikemccand commented on pull request #2052: LUCENE-8982: Make NativeUnixDirectory pure java with FileChannel direct IO flag, and rename to DirectIODirectory

GitBox Fri, 20 Nov 2020 07:25:57 -0800


mikemccand commented on pull request #2052:
URL: https://github.com/apache/lucene-solr/pull/2052#issuecomment-731234052



   > > Second, it is extremely experimental and not clear when it provides 
benefits / what risks there are / etc. We need to learn much more about it, in 
diverse usage, to help here. I'd love to hear from Elasticsearch or Solr users 
if this helps, since those applications do simultaneous indexing (merging) and 
searching on the same box.
   > 
   > I am sure, you at amazon will test it extensively. But I agree: I would 
not make this any default, I am still in favour of using plain MMAPDirectory. 
The risks of making it worse by using direct io is too heavy.
   
   LOL, actually, no, we at Amazon are not really planning on testing this 
extensively!  Amazon (well, specifically our customer facing product search 
built directly on Lucene) uses [Lucene's fast segment replication 
feature](http://blog.mikemccandless.com/2017/09/lucenes-near-real-time-segment-index.html),
 which is much more efficient than Elasticsearch/Solr document replication when 
you need deep replicas because you have high peak QPS.  So, at Amazon, at least 
for product search, we never index and search on the same JVM/hardware.  
Instead we have a few dedicated boxes for pure indexing, then replicate 
segments via S3 out to many boxes dedicated to searching.
   
   Lucene's segment replication feature allows us to use much less hardware to 
simultaneously handle high indexing throughput and high query throughput.
   
   But, since Elasticsearch/Solr do concurrent indexing (merging) and searching 
on a single box, by design, I think this Directory would be very interesting to 
test.  It is likely a massive improvement in long-pole query latencies when 
heavy merges are running, since the merges would now bypass the OS's buffer 
(IO) cache entirely, using direct IO.
   
   > > Third, users are able to choose to use this when they instantiate the 
Directory implementation for their search application, so it is straightforward 
to adopt and play with, even if Lucene's core does not do so by default.
   > 
   > +1
   > 
   > Elasticsearch may play with it and may also improve the parts where it is 
actually used. We do not know yet if it is a good idea to use it when you merge 
stuff that needs heavy random access to index (like you have a 
FilterCodecReader during merging, transform an index, resort it,...). Also it 
depends on codecs and how they are implemented. Unless we know that it works 
well for merging all partsof Lucene's core codecs, we may do a recommendation.
   > 
   > If we decide to make it part of Lucene core, we can just move it. It will 
compile and work out of box with current Java versions and most file systems.
   
   Yeah, this is an awesome improvement thanks to this PR -- it becomes pure 
Java, yay!  But we need more data of actual usage to decide if this is worth 
moving to core, let alone somehow defaulting to.
   
   Thanks @zacharymorn!
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] mikemccand commented on pull request #2052: LUCENE-8982: Make NativeUnixDirectory pure java with FileChannel direct IO flag, and rename to DirectIODirectory

Reply via email to