That said... if we generate the global DocMap up front, there's no reason to not execute the merge of the segments more efficiently, i.e. without wrapping them in a SlowCompositeReaderWrapper.
But that's not work for SortingMergePolicy, it's either a special SortingAtomicReader which wraps a group of readers + a global DocMap, and then merge-sorts them more efficiently than how it's done now. Or we tap into SegmentMerger .. which is way more complicated. Perhaps it would be worth to explore a SortingMultiSortedAtomicReader which merge-sorts the postings and other data that way ... I look at e.g how doc-values are merged .. not sure it will improve performance. But if you want to cons up a patch, that'd be awesome! Shai On Tue, Jun 17, 2014 at 8:01 PM, Shai Erera <ser...@gmail.com> wrote: > OK I think I now understand what you're asking :). It's unrelated though > to SortingMergePolicy. You propose to do the "merge" part of a merge-sort, > since we know the indexes are already sorted, right? > > This is something we've considered in the past, but it is very tricky (see > below) and we went with the SortingAR for simplicity and speed of coding. > If however you have an idea how we can easily implement that, that would be > awesome. > > So let's consider merging the posting lists of f:val from the N readers. > Say that each returns docs 0-3, and the merged posting will have 4*N > entries (say we don't have deletes). To properly merge them, you need to > lookup the sort-value of each document from each reader, and compare > according to it. > > Now you move on to f:val2 (another posting) and it wants to merge 100 > other docs. So you need to lookup the value of each document, compare by > it, and merge them. And the process continues ... > > These lookups are expensive and will be done millions of times (each term, > each DV field, each .. everything). > > More than that, there's a serious issue of correctness, because you never > make a global sorting decision. So if f:val sees only a single document - > 0, in all segments, you want to map them to 4 GLOBALLY SORTED documents. If > you make a local decision based on these 4 documents, you will end up w/ a > completely messed up segment. > > I think the global DocMap is really required. Forget about that that other > code, e.g. IndexWriter relies on this in order to properly apply incoming > document deletions and field updates while the segments were merging. It's > just a matter of correctness - we need to know the global sorted segment > map. > > Shai > > > On Tue, Jun 17, 2014 at 3:41 PM, Ravikumar Govindarajan < > ravikumar.govindara...@gmail.com> wrote: > >> > >> > Therefore the DocMap is initialized only when the >> > merge actually executes ... what is there more to postpone? >> >> >> Agreed. However, what I am asking is, if there is an alternative to >> DocMap, >> will that be better? Plz read-on >> >> And besides, if the segments are already sorted, you should return a >> null DocMap, >> > like Lucene code does ... >> >> >> What I am trying to say is, my individual segments are sorted. However, >> when a merge combines "N" individual sorted-segments, there needs to be a >> global sort-order for writing the new segment. Passing null DocMap won't >> work here, no? >> >> DocMap is one-way of bringing the global order during a merge. Another way >> is to use something like a MergedIterator<SegmentReader> instead of >> DocMap, >> which doesn't need any memory >> >> I was trying to get a heads-up on these 2 approaches. Please do let me >> know >> if I have understood correctly >> >> -- >> Ravi >> >> >> >> >> On Tue, Jun 17, 2014 at 5:42 PM, Shai Erera <ser...@gmail.com> wrote: >> >> > > >> > > I am afraid the DocMap still maintains doc-id mappings till merge and >> I >> > am >> > > trying to avoid it... >> > > >> > >> > What do you mean 'till merge'? The method OneMerge.getMergeReaders() is >> > called only when the merge is executed, not when the MergePolicy >> decided to >> > merge those segments. Therefore the DocMap is initialized only when the >> > merge actually executes ... what is there more to postpone? >> > >> > And besides, if the segments are already sorted, you should return a >> null >> > DocMap, like Lucene code does ... >> > >> > If I miss your point, I'd appreciate if you can point me to a code >> example, >> > preferably in Lucene source, which demonstrates the problem. >> > >> > Shai >> > >> > >> > On Tue, Jun 17, 2014 at 3:03 PM, Ravikumar Govindarajan < >> > ravikumar.govindara...@gmail.com> wrote: >> > >> > > I am afraid the DocMap still maintains doc-id mappings till merge and >> I >> > am >> > > trying to avoid it... >> > > >> > > I think lucene itself has a MergeIterator in o.a.l.util package. >> > > >> > > A MergePolicy can wrap a simple MergeIterator for iterating docs >> across >> > > different AtomicReaders in correct sort-order for a given field/term >> > > >> > > That should be fine right? >> > > >> > > -- >> > > Ravi >> > > >> > > -- >> > > Ravi >> > > >> > > >> > > On Tue, Jun 17, 2014 at 1:24 PM, Shai Erera <ser...@gmail.com> wrote: >> > > >> > > > loadSortTerm is your method right? In the current Sorter.sort >> > > > implementation, I see this code: >> > > > >> > > > boolean sorted = true; >> > > > for (int i = 1; i < maxDoc; ++i) { >> > > > if (comparator.compare(i-1, i) > 0) { >> > > > sorted = false; >> > > > break; >> > > > } >> > > > } >> > > > if (sorted) { >> > > > return null; >> > > > } >> > > > >> > > > Perhaps you can write similar code? >> > > > >> > > > Also note that the sorting interface has changed, I think in 4.8, >> and >> > now >> > > > you don't really need to implement a Sorter, but rather pass a >> > SortField, >> > > > if that works for you. >> > > > >> > > > Shai >> > > > >> > > > >> > > > On Tue, Jun 17, 2014 at 9:41 AM, Ravikumar Govindarajan < >> > > > ravikumar.govindara...@gmail.com> wrote: >> > > > >> > > > > Shai, >> > > > > >> > > > > This is the code snippet I use inside my class... >> > > > > >> > > > > public class MySorter extends Sorter { >> > > > > >> > > > > @Override >> > > > > >> > > > > public DocMap sort(AtomicReader reader) throws IOException { >> > > > > >> > > > > final Map<Integer, BytesRef> docVsId = loadSortTerm(reader); >> > > > > >> > > > > final Sorter.DocComparator comparator = new >> Sorter.DocComparator() >> > { >> > > > > >> > > > > @Override >> > > > > >> > > > > public int compare(int docID1, int docID2) { >> > > > > >> > > > > BytesRef v1 = docVsId.get(docID1); >> > > > > >> > > > > BytesRef v2 = docVsId.get(docID2); >> > > > > >> > > > > return v1.compareTo(v2); >> > > > > >> > > > > } >> > > > > >> > > > > }; >> > > > > >> > > > > return sort(reader.maxDoc(), comparator); >> > > > > >> > > > > } >> > > > > } >> > > > > >> > > > > My Problem is, the "AtomicReader" passed to Sorter.sort method is >> > > > actually >> > > > > a SlowCompositeReader, composed of a list of AtomicReaders each of >> > > which >> > > > is >> > > > > already sorted. >> > > > > >> > > > > I find this "loadSortTerm(compositeReader)" to be a bit heavy >> where >> > it >> > > > > tries to all load the doc-to-term mappings eagerly... >> > > > > >> > > > > Are there some alternatives for this? >> > > > > >> > > > > -- >> > > > > Ravi >> > > > > >> > > > > >> > > > > On Tue, Jun 17, 2014 at 10:58 AM, Shai Erera <ser...@gmail.com> >> > wrote: >> > > > > >> > > > > > I'm not sure that I follow ... where do you see DocMap being >> loaded >> > > up >> > > > > > front? Specifically, Sorter.sort may return null of the readers >> are >> > > > > already >> > > > > > sorted ... I think we already optimized for the case where the >> > > readers >> > > > > are >> > > > > > sorted. >> > > > > > >> > > > > > Shai >> > > > > > >> > > > > > >> > > > > > On Tue, Jun 17, 2014 at 4:04 AM, Ravikumar Govindarajan < >> > > > > > ravikumar.govindara...@gmail.com> wrote: >> > > > > > >> > > > > > > I am planning to use SortingMergePolicy where all the >> > > > > merge-participating >> > > > > > > segments are already sorted... I understand that I need to >> > define a >> > > > > > DocMap >> > > > > > > with old-new doc-id mappings. >> > > > > > > >> > > > > > > Is it possible to optimize the eager loading of DocMap and >> make >> > it >> > > > kind >> > > > > > of >> > > > > > > lazy load on-demand? >> > > > > > > >> > > > > > > Ex: Pass List<AtomicReader> to the caller and ask for next >> > new-old >> > > > doc >> > > > > > > mapping.. >> > > > > > > >> > > > > > > Since my segments are already sorted, I could save on memory a >> > > > > little-bit >> > > > > > > this way, instead of loading the full DocMap upfront >> > > > > > > >> > > > > > > -- >> > > > > > > Ravi >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> > >