Re: SortingMergePolicy for already sorted segments

Shai Erera Tue, 17 Jun 2014 10:21:43 -0700

That said... if we generate the global DocMap up front, there's no reason
to not execute the merge of the segments more efficiently, i.e. without
wrapping them in a SlowCompositeReaderWrapper.


But that's not work for SortingMergePolicy, it's either a special
SortingAtomicReader which wraps a group of readers + a global DocMap, and
then merge-sorts them more efficiently than how it's done now. Or we tap
into SegmentMerger .. which is way more complicated.

Perhaps it would be worth to explore a SortingMultiSortedAtomicReader which
merge-sorts the postings and other data that way ... I look at e.g how
doc-values are merged .. not sure it will improve performance. But if you
want to cons up a patch, that'd be awesome!

Shai


On Tue, Jun 17, 2014 at 8:01 PM, Shai Erera <[email protected]> wrote:

> OK I think I now understand what you're asking :). It's unrelated though
> to SortingMergePolicy. You propose to do the "merge" part of a merge-sort,
> since we know the indexes are already sorted, right?
>
> This is something we've considered in the past, but it is very tricky (see
> below) and we went with the SortingAR for simplicity and speed of coding.
> If however you have an idea how we can easily implement that, that would be
> awesome.
>
> So let's consider merging the posting lists of f:val from the N readers.
> Say that each returns docs 0-3, and the merged posting will have 4*N
> entries (say we don't have deletes). To properly merge them, you need to
> lookup the sort-value of each document from each reader, and compare
> according to it.
>
> Now you move on to f:val2 (another posting) and it wants to merge 100
> other docs. So you need to lookup the value of each document, compare by
> it, and merge them. And the process continues ...
>
> These lookups are expensive and will be done millions of times (each term,
> each DV field, each .. everything).
>
> More than that, there's a serious issue of correctness, because you never
> make a global sorting decision. So if f:val sees only a single document -
> 0, in all segments, you want to map them to 4 GLOBALLY SORTED documents. If
> you make a local decision based on these 4 documents, you will end up w/ a
> completely messed up segment.
>
> I think the global DocMap is really required. Forget about that that other
> code, e.g. IndexWriter relies on this in order to properly apply incoming
> document deletions and field updates while the segments were merging. It's
> just a matter of correctness - we need to know the global sorted segment
> map.
>
> Shai
>
>
> On Tue, Jun 17, 2014 at 3:41 PM, Ravikumar Govindarajan <
> [email protected]> wrote:
>
>> >
>> > Therefore the DocMap is initialized only when the
>> > merge actually executes ... what is there more to postpone?
>>
>>
>> Agreed. However, what I am asking is, if there is an alternative to
>> DocMap,
>> will that be better? Plz read-on
>>
>>  And besides, if the segments are already sorted, you should return a
>> null DocMap,
>> > like Lucene code does ...
>>
>>
>> What I am trying to say is, my individual segments are sorted. However,
>> when a merge combines "N" individual sorted-segments, there needs to be a
>> global sort-order for writing the new segment. Passing null DocMap won't
>> work here, no?
>>
>> DocMap is one-way of bringing the global order during a merge. Another way
>> is to use something like a MergedIterator<SegmentReader> instead of
>> DocMap,
>> which doesn't need any memory
>>
>> I was trying to get a heads-up on these 2 approaches. Please do let me
>> know
>> if I have understood correctly
>>
>> --
>> Ravi
>>
>>
>>
>>
>> On Tue, Jun 17, 2014 at 5:42 PM, Shai Erera <[email protected]> wrote:
>>
>> > >
>> > > I am afraid the DocMap still maintains doc-id mappings till merge and
>> I
>> > am
>> > > trying to avoid it...
>> > >
>> >
>> > What do you mean 'till merge'? The method OneMerge.getMergeReaders() is
>> > called only when the merge is executed, not when the MergePolicy
>> decided to
>> > merge those segments. Therefore the DocMap is initialized only when the
>> > merge actually executes ... what is there more to postpone?
>> >
>> > And besides, if the segments are already sorted, you should return a
>> null
>> > DocMap, like Lucene code does ...
>> >
>> > If I miss your point, I'd appreciate if you can point me to a code
>> example,
>> > preferably in Lucene source, which demonstrates the problem.
>> >
>> > Shai
>> >
>> >
>> > On Tue, Jun 17, 2014 at 3:03 PM, Ravikumar Govindarajan <
>> > [email protected]> wrote:
>> >
>> > > I am afraid the DocMap still maintains doc-id mappings till merge and
>> I
>> > am
>> > > trying to avoid it...
>> > >
>> > > I think lucene itself has a MergeIterator in o.a.l.util package.
>> > >
>> > > A MergePolicy can wrap a simple MergeIterator for iterating docs
>> across
>> > > different AtomicReaders in correct sort-order for a given field/term
>> > >
>> > > That should be fine right?
>> > >
>> > > --
>> > > Ravi
>> > >
>> > > --
>> > > Ravi
>> > >
>> > >
>> > > On Tue, Jun 17, 2014 at 1:24 PM, Shai Erera <[email protected]> wrote:
>> > >
>> > > > loadSortTerm is your method right? In the current Sorter.sort
>> > > > implementation, I see this code:
>> > > >
>> > > >     boolean sorted = true;
>> > > >     for (int i = 1; i < maxDoc; ++i) {
>> > > >       if (comparator.compare(i-1, i) > 0) {
>> > > >         sorted = false;
>> > > >         break;
>> > > >       }
>> > > >     }
>> > > >     if (sorted) {
>> > > >       return null;
>> > > >     }
>> > > >
>> > > > Perhaps you can write similar code?
>> > > >
>> > > > Also note that the sorting interface has changed, I think in 4.8,
>> and
>> > now
>> > > > you don't really need to implement a Sorter, but rather pass a
>> > SortField,
>> > > > if that works for you.
>> > > >
>> > > > Shai
>> > > >
>> > > >
>> > > > On Tue, Jun 17, 2014 at 9:41 AM, Ravikumar Govindarajan <
>> > > > [email protected]> wrote:
>> > > >
>> > > > > Shai,
>> > > > >
>> > > > > This is the code snippet I use inside my class...
>> > > > >
>> > > > > public class MySorter extends Sorter {
>> > > > >
>> > > > > @Override
>> > > > >
>> > > > > public DocMap sort(AtomicReader reader) throws IOException {
>> > > > >
>> > > > >   final Map<Integer, BytesRef> docVsId = loadSortTerm(reader);
>> > > > >
>> > > > >   final Sorter.DocComparator comparator = new
>> Sorter.DocComparator()
>> > {
>> > > > >
>> > > > >   @Override
>> > > > >
>> > > > >    public int compare(int docID1, int docID2) {
>> > > > >
>> > > > >       BytesRef v1 = docVsId.get(docID1);
>> > > > >
>> > > > >       BytesRef v2 = docVsId.get(docID2);
>> > > > >
>> > > > >        return v1.compareTo(v2);
>> > > > >
>> > > > >    }
>> > > > >
>> > > > >  };
>> > > > >
>> > > > >  return sort(reader.maxDoc(), comparator);
>> > > > >
>> > > > > }
>> > > > > }
>> > > > >
>> > > > > My Problem is, the "AtomicReader" passed to Sorter.sort method is
>> > > > actually
>> > > > > a SlowCompositeReader, composed of a list of AtomicReaders each of
>> > > which
>> > > > is
>> > > > > already sorted.
>> > > > >
>> > > > > I find this "loadSortTerm(compositeReader)" to be a bit heavy
>> where
>> > it
>> > > > > tries to all load the doc-to-term mappings eagerly...
>> > > > >
>> > > > > Are there some alternatives for this?
>> > > > >
>> > > > > --
>> > > > > Ravi
>> > > > >
>> > > > >
>> > > > > On Tue, Jun 17, 2014 at 10:58 AM, Shai Erera <[email protected]>
>> > wrote:
>> > > > >
>> > > > > > I'm not sure that I follow ... where do you see DocMap being
>> loaded
>> > > up
>> > > > > > front? Specifically, Sorter.sort may return null of the readers
>> are
>> > > > > already
>> > > > > > sorted ... I think we already optimized for the case where the
>> > > readers
>> > > > > are
>> > > > > > sorted.
>> > > > > >
>> > > > > > Shai
>> > > > > >
>> > > > > >
>> > > > > > On Tue, Jun 17, 2014 at 4:04 AM, Ravikumar Govindarajan <
>> > > > > > [email protected]> wrote:
>> > > > > >
>> > > > > > > I am planning to use SortingMergePolicy where all the
>> > > > > merge-participating
>> > > > > > > segments are already sorted... I understand that I need to
>> > define a
>> > > > > > DocMap
>> > > > > > > with old-new doc-id mappings.
>> > > > > > >
>> > > > > > > Is it possible to optimize the eager loading of DocMap and
>> make
>> > it
>> > > > kind
>> > > > > > of
>> > > > > > > lazy load on-demand?
>> > > > > > >
>> > > > > > > Ex: Pass List<AtomicReader> to the caller and ask for next
>> > new-old
>> > > > doc
>> > > > > > > mapping..
>> > > > > > >
>> > > > > > > Since my segments are already sorted, I could save on memory a
>> > > > > little-bit
>> > > > > > > this way, instead of loading the full DocMap upfront
>> > > > > > >
>> > > > > > > --
>> > > > > > > Ravi
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: SortingMergePolicy for already sorted segments

Reply via email to