Re: Deduplication of search result with custom with custom sort

Jigar Shah Fri, 09 Oct 2020 08:57:35 -0700

My learnings dealing this problem

We faced a similar problem before, and did the following things:


1) Don't request totalGroupCount, and the response was fast. as computing
group count is an expensive task. If you can live without groupCount.
Although you can approximate pagination up to total count and then group
count will be less so when you get empty results you stop pagination.
2) Have more shards, so you can get the best out of parallel execution.

I have seen use-cases of  60M total documents dedup doc values field, with
4 shards.

Query time SLA is around 5-6 seconds. Not unbearable for users.

Let me know if you find better solution.






On Fri, Oct 9, 2020 at 11:45 AM Diego Ceccarelli (BLOOMBERG/ LONDON) <
dceccarel...@bloomberg.net> wrote:

> As Erick said, can you tell us a bit more about the use case?
> There might be another way to achieve the same result.
>
> What are these documents?
> Why do you need 1000 docs per user?
>
>
> From: java-user@lucene.apache.org At: 10/09/20 14:25:02To:
> java-user@lucene.apache.org
> Subject: Re: Deduplication of search result with custom with custom sort
>
> 6_500_000 is the total count of groups in the entire collection. I only
> return the top 1000 to users.
> I use Lucene where I have documents that can have the same docvalue, and I
> want to deduplicate this documents by this docvalue during search.
> Also, i sort my documents by multiple fields and because of this i can`t
> use DiversifiedTopDocsCollector that works with relevance score only.
>
> пт, 9 окт. 2020 г. в 16:02, Erick Erickson <erickerick...@gmail.com>:
>
> > This is going to be fairly painful. You need to keep a list 6.5M
> > items long, sorted.
> >
> > Before diving in there, I’d really back up and ask what the use-case
> > is. Returning 6.5M docs to a user is useless, so are you’re doing
> > some kind of analytics maybe? In which case, and again
> > assuming you’re using Solr, Streaming Aggregation might
> > be a better option.
> >
> > This really sounds like an XY problem. You’re trying to solve problem X
> > and asking how to accomplish it with Y. What I’m questioning
> > is whether Y (grouping) is a good approach or not. Perhaps if
> > you explained X there’d be a better suggestion.
> >
> > Best,
> > Erick
> >
> > > On Oct 9, 2020, at 8:19 AM, Dmitry Emets <emet...@gmail.com> wrote:
> > >
> > > I have 12_000_000 documents, 6_500_000 groups
> > >
> > > With sort: It takes around 1 sec without grouping, 2 sec with grouping
> > and
> > > 12 sec with setAllGroups(true)
> > > Without sort: It takes around 0.2 sec without grouping, 0.6 sec with
> > > grouping and 10 sec with setAllGroups(true)
> > >
> > > Thank you, Erick, I will look into it
> > >
> > > пт, 9 окт. 2020 г. в 14:32, Erick Erickson <erickerick...@gmail.com>:
> > >
> > >> At the Solr level, CollapsingQParserPlugin see:
> > >>
> >
> https://lucene.apache.org/solr/guide/8_6/collapse-and-expand-results.html
> > >>
> > >> You could perhaps steal some ideas from that if you
> > >> need this at the Lucene level.
> > >>
> > >> Best,
> > >> Erick
> > >>
> > >>> On Oct 9, 2020, at 7:25 AM, Diego Ceccarelli (BLOOMBERG/ LONDON) <
> > >> dceccarel...@bloomberg.net> wrote:
> > >>>
> > >>> Is the field that you are using to dedupe stored as a docvalue?
> > >>>
> > >>> From: java-user@lucene.apache.org At: 10/09/20 12:18:04To:
> > >> java-user@lucene.apache.org
> > >>> Subject: Deduplication of search result with custom with custom sort
> > >>>
> > >>> Hi,
> > >>> I need to deduplicate search results by specific field and I have no
> > idea
> > >>> how to implement this properly.
> > >>> I have tried grouping with setGroupDocsLimit(1) and it gives me
> > expected
> > >>> results, but has not very good performance.
> > >>> I think that I need something like DiversifiedTopDocsCollector, but
> > >>> suitable for collecting TopFieldDocs.
> > >>> Is there any possibility to achieve deduplication with existing
> lucene
> > >>> components, or do I need to implement my own
> > >> DiversifiedTopFieldsCollector?
> > >>>
> > >>>
> > >>
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >>
> > >>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
>
>
>

Re: Deduplication of search result with custom with custom sort

Reply via email to