Re: EarlyTerminatingSortingCollector help needed..

Ravikumar Govindarajan Mon, 23 Jun 2014 06:57:34 -0700

>
> This means that even though you have eg. 15 segments, if you requested
> 50 documents, you will get the top 50 documents out of your
> TopHitsCollector.



Yes, we can get the top-50 docs finally. I am not denying that.

I will probably re-phrase my question. Apologize if I am not clear

How do we ensure global sort-order during search across all segments of the
index, when using ESTC+SMP that works only at per-segment level?


When wondering about stored fields vs doc values, the right trade-off
> is usually to use:
>  - stored fields when looking up several field values for a few documents,
>  - doc values when loading a few field values for many documents.


Thanks for this clarification. Shall surely move towards doc-values...

--
Ravi


On Mon, Jun 23, 2014 at 5:36 PM, Adrien Grand <jpou...@gmail.com> wrote:

> On Sun, Jun 22, 2014 at 6:44 PM, Ravikumar Govindarajan
> <ravikumar.govindara...@gmail.com> wrote:
> > For a normal sorting-query, on a top-level searcher, I execute
> >
> > TopDocs docs = searcher.search(query, 50, sortField)
> >
> > Then I can issue reader.document() for final list of exactly 50 docs,
> which
> > gives me a global order across segments but at the obvious cost of
> memory...
> >
> > SortingMergePolicy + ETSC will make me do 50*N [N=no.of.segments]
> collects,
> > which could increase cost of seeks when each segment collects
> considerable
> > hits...
>
> This is not correct. :) ETSC will collect segments one after another
> but in the end, what you will get are the top hits for all segments.
> This means that even though you have eg. 15 segments, if you requested
> 50 documents, you will get the top 50 documents out of your
> TopHitsCollector.
>
> >  - you can afford the merging overhead (ie. for heavy indexing
> >> workloads, this might not be the best solution)
> >>  - there is a single sort order that is used for most queries
> >>  - you don't need any feature that requires to collect all documents
> >> (like computing the total hit count or facets).
> >
> >
> > Our use-case fits perfectly on all these 3 points and thats why we wanted
> > to explore this. But our final set of results must also be globally
> > ordered. May be it's mistake to assume that Sorting can be entirely
> > replaced with SMP + ETSC...
>
> I don't think it is a mistake, this can help make the execution of
> search requests significantly faster.
>
> > I would not advise to use the stored fields API, even in the context
> >> of early termination. Doc values should be more efficient here?
> >
> >
> > I read your excellent blog on stored-fields compression, where you've
> > mentioned that stored-fields now take only one random seek. [
> >
> http://blog.jpountz.net/post/35667727458/stored-fields-compression-in-lucene-4-1
> > ]
> >
> > If so, then what could make DocValues still a winner?
>
> Yes. If you use eg. 2 doc values fields to run your query, it is true
> that the number of seeks in the worst case would be 2 for doc values
> and only 1 for stored fields, so stored fields might look more
> appropriate. However, doc values play much better with the operating
> system thanks to column-stride storage since:
>  - it allows for lightweight and efficient compression,
>  - the filesystem cache doesn't get loaded on field values that you
> are not interested in.
>
> When wondering about stored fields vs doc values, the right trade-off
> is usually to use:
>  - stored fields when looking up several field values for a few documents,
>  - doc values when loading a few field values for many documents.
>
>
> --
> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: EarlyTerminatingSortingCollector help needed..

Reply via email to