[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17027006#comment-17027006 ]
David Smiley commented on LUCENE-8962: -------------------------------------- Woah; I defer to your expertise [~mikemccand]. IndexWriter has a huge bus factor and I haven't delved into it. Still... I want to confirm what I think you are telling me. Based on my understanding of when merges are triggered and observed (by a reader/searcher), I wrote the following test on {{TestIndexWriterMergePolicy}}: {code:java} public void testMergeOnCommitIsSearchable() throws IOException { try ( Directory dir = newDirectory(); IndexWriter writer = new IndexWriter(dir, newIndexWriterConfig(new MockAnalyzer(random())) .setMaxBufferedDocs(10) .setMergePolicy(new LogDocMergePolicy()) .setMergeScheduler(new SerialMergeScheduler())) ) { for (int i = 0; i < 99; i++) { addDoc(writer); checkInvariants(writer); } assertEquals(9, writer.getSegmentCount()); assertEquals(9, writer.getNumBufferedDocuments()); writer.commit(); try (DirectoryReader reader = DirectoryReader.open(writer)) { assertEquals(1, reader.getSequentialSubReaders().size()); } } } {code} Generally speaking and scene here, after a commit, a search application will open a new reader to be able to search over the recently committed documents. In the scenario above, I index a bunch of documents, some of which have been flushed already, some pending. Also notice the SerialMergeScheduler so that the writing thread merges in-process / synchronously. Then see I open a NRT reader from the writer and count the segments. I get 1, because the flushed buffer will produce the 10th segment and the configured LogDocMergePolicy will merge altogether on 10. _The test passes_. So; am I missing something? > Can we merge small segments during refresh, for faster searching? > ----------------------------------------------------------------- > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index > Reporter: Michael McCandless > Priority: Major > Attachments: LUCENE-8962_demo.png > > Time Spent: 3h 10m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org