As Erick said, can you tell us a bit more about the use case? There might be another way to achieve the same result.
What are these documents? Why do you need 1000 docs per user? From: java-user@lucene.apache.org At: 10/09/20 14:25:02To: java-user@lucene.apache.org Subject: Re: Deduplication of search result with custom with custom sort 6_500_000 is the total count of groups in the entire collection. I only return the top 1000 to users. I use Lucene where I have documents that can have the same docvalue, and I want to deduplicate this documents by this docvalue during search. Also, i sort my documents by multiple fields and because of this i can`t use DiversifiedTopDocsCollector that works with relevance score only. пт, 9 окт. 2020 г. в 16:02, Erick Erickson <erickerick...@gmail.com>: > This is going to be fairly painful. You need to keep a list 6.5M > items long, sorted. > > Before diving in there, I’d really back up and ask what the use-case > is. Returning 6.5M docs to a user is useless, so are you’re doing > some kind of analytics maybe? In which case, and again > assuming you’re using Solr, Streaming Aggregation might > be a better option. > > This really sounds like an XY problem. You’re trying to solve problem X > and asking how to accomplish it with Y. What I’m questioning > is whether Y (grouping) is a good approach or not. Perhaps if > you explained X there’d be a better suggestion. > > Best, > Erick > > > On Oct 9, 2020, at 8:19 AM, Dmitry Emets <emet...@gmail.com> wrote: > > > > I have 12_000_000 documents, 6_500_000 groups > > > > With sort: It takes around 1 sec without grouping, 2 sec with grouping > and > > 12 sec with setAllGroups(true) > > Without sort: It takes around 0.2 sec without grouping, 0.6 sec with > > grouping and 10 sec with setAllGroups(true) > > > > Thank you, Erick, I will look into it > > > > пт, 9 окт. 2020 г. в 14:32, Erick Erickson <erickerick...@gmail.com>: > > > >> At the Solr level, CollapsingQParserPlugin see: > >> > https://lucene.apache.org/solr/guide/8_6/collapse-and-expand-results.html > >> > >> You could perhaps steal some ideas from that if you > >> need this at the Lucene level. > >> > >> Best, > >> Erick > >> > >>> On Oct 9, 2020, at 7:25 AM, Diego Ceccarelli (BLOOMBERG/ LONDON) < > >> dceccarel...@bloomberg.net> wrote: > >>> > >>> Is the field that you are using to dedupe stored as a docvalue? > >>> > >>> From: java-user@lucene.apache.org At: 10/09/20 12:18:04To: > >> java-user@lucene.apache.org > >>> Subject: Deduplication of search result with custom with custom sort > >>> > >>> Hi, > >>> I need to deduplicate search results by specific field and I have no > idea > >>> how to implement this properly. > >>> I have tried grouping with setGroupDocsLimit(1) and it gives me > expected > >>> results, but has not very good performance. > >>> I think that I need something like DiversifiedTopDocsCollector, but > >>> suitable for collecting TopFieldDocs. > >>> Is there any possibility to achieve deduplication with existing lucene > >>> components, or do I need to implement my own > >> DiversifiedTopFieldsCollector? > >>> > >>> > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >> > >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >