This is going to be fairly painful. You need to keep a list 6.5M items long, sorted.
Before diving in there, I’d really back up and ask what the use-case is. Returning 6.5M docs to a user is useless, so are you’re doing some kind of analytics maybe? In which case, and again assuming you’re using Solr, Streaming Aggregation might be a better option. This really sounds like an XY problem. You’re trying to solve problem X and asking how to accomplish it with Y. What I’m questioning is whether Y (grouping) is a good approach or not. Perhaps if you explained X there’d be a better suggestion. Best, Erick > On Oct 9, 2020, at 8:19 AM, Dmitry Emets <emet...@gmail.com> wrote: > > I have 12_000_000 documents, 6_500_000 groups > > With sort: It takes around 1 sec without grouping, 2 sec with grouping and > 12 sec with setAllGroups(true) > Without sort: It takes around 0.2 sec without grouping, 0.6 sec with > grouping and 10 sec with setAllGroups(true) > > Thank you, Erick, I will look into it > > пт, 9 окт. 2020 г. в 14:32, Erick Erickson <erickerick...@gmail.com>: > >> At the Solr level, CollapsingQParserPlugin see: >> https://lucene.apache.org/solr/guide/8_6/collapse-and-expand-results.html >> >> You could perhaps steal some ideas from that if you >> need this at the Lucene level. >> >> Best, >> Erick >> >>> On Oct 9, 2020, at 7:25 AM, Diego Ceccarelli (BLOOMBERG/ LONDON) < >> dceccarel...@bloomberg.net> wrote: >>> >>> Is the field that you are using to dedupe stored as a docvalue? >>> >>> From: java-user@lucene.apache.org At: 10/09/20 12:18:04To: >> java-user@lucene.apache.org >>> Subject: Deduplication of search result with custom with custom sort >>> >>> Hi, >>> I need to deduplicate search results by specific field and I have no idea >>> how to implement this properly. >>> I have tried grouping with setGroupDocsLimit(1) and it gives me expected >>> results, but has not very good performance. >>> I think that I need something like DiversifiedTopDocsCollector, but >>> suitable for collecting TopFieldDocs. >>> Is there any possibility to achieve deduplication with existing lucene >>> components, or do I need to implement my own >> DiversifiedTopFieldsCollector? >>> >>> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org