I studied the Las Vegas patch and got one simple thought. FirstPassingGroupCollector collects CollectedSearchGroup inside itself. CollectedSearchGroup contains docId and sortValues. This is exactly what I need. Thanks for the help!
пн, 12 окт. 2020 г. в 17:38, Diego Ceccarelli (BLOOMBERG/ LONDON) < dceccarel...@bloomberg.net>: > > https://issues.apache.org/jira/browse/SOLR-11831 I collaborated on Las > Vegas patch, I don't think that patch will be merged - it modifies too many > things in the core - we ended up reimplementing it as a standalone plugin. > Also keep in mind that the patch makes the difference only if you are > using Solr Cloud, while it seems that you are using lucene. > > Do you really need to return 1000 results to the user? is this for paging > purposes? > > Do you know how frequent are the groups? if they are not too frequent and > you are not strict on 1000, you might retrieve more let's say 2000 without > grouping and then do the deduping after.. > > Cheers, > Diego > > > From: java-user@lucene.apache.org At: 10/12/20 13:02:46To: > java-user@lucene.apache.org > Subject: Re: Deduplication of search result with custom with custom sort > > Thank you very much for helping! > > There isn't much I can add about my use case. I have user-generated video > titles and hash codes by which I can understand that these are the same > videos. Users search videos by title and I should return the top 1000 > unique videos to them. > > I will try to use grouping without counting groups. Otherwise I'll look > here https://issues.apache.org/jira/browse/SOLR-11831 or here > https://lucene.apache.org/solr/guide/8_6/collapse-and-expand-results.html > > Thanks again! > > пт, 9 окт. 2020 г. в 18:57, Jigar Shah <jigaronl...@gmail.com>: > > > My learnings dealing this problem > > > > We faced a similar problem before, and did the following things: > > > > 1) Don't request totalGroupCount, and the response was fast. as computing > > group count is an expensive task. If you can live without groupCount. > > Although you can approximate pagination up to total count and then group > > count will be less so when you get empty results you stop pagination. > > 2) Have more shards, so you can get the best out of parallel execution. > > > > I have seen use-cases of 60M total documents dedup doc values field, > with > > 4 shards. > > > > Query time SLA is around 5-6 seconds. Not unbearable for users. > > > > Let me know if you find better solution. > > > > > > > > > > > > > > On Fri, Oct 9, 2020 at 11:45 AM Diego Ceccarelli (BLOOMBERG/ LONDON) < > > dceccarel...@bloomberg.net> wrote: > > > > > As Erick said, can you tell us a bit more about the use case? > > > There might be another way to achieve the same result. > > > > > > What are these documents? > > > Why do you need 1000 docs per user? > > > > > > > > > From: java-user@lucene.apache.org At: 10/09/20 14:25:02To: > > > java-user@lucene.apache.org > > > Subject: Re: Deduplication of search result with custom with custom > sort > > > > > > 6_500_000 is the total count of groups in the entire collection. I only > > > return the top 1000 to users. > > > I use Lucene where I have documents that can have the same docvalue, > and > > I > > > want to deduplicate this documents by this docvalue during search. > > > Also, i sort my documents by multiple fields and because of this i > can`t > > > use DiversifiedTopDocsCollector that works with relevance score only. > > > > > > пт, 9 окт. 2020 г. в 16:02, Erick Erickson <erickerick...@gmail.com>: > > > > > > > This is going to be fairly painful. You need to keep a list 6.5M > > > > items long, sorted. > > > > > > > > Before diving in there, I’d really back up and ask what the use-case > > > > is. Returning 6.5M docs to a user is useless, so are you’re doing > > > > some kind of analytics maybe? In which case, and again > > > > assuming you’re using Solr, Streaming Aggregation might > > > > be a better option. > > > > > > > > This really sounds like an XY problem. You’re trying to solve > problem X > > > > and asking how to accomplish it with Y. What I’m questioning > > > > is whether Y (grouping) is a good approach or not. Perhaps if > > > > you explained X there’d be a better suggestion. > > > > > > > > Best, > > > > Erick > > > > > > > > > On Oct 9, 2020, at 8:19 AM, Dmitry Emets <emet...@gmail.com> > wrote: > > > > > > > > > > I have 12_000_000 documents, 6_500_000 groups > > > > > > > > > > With sort: It takes around 1 sec without grouping, 2 sec with > > grouping > > > > and > > > > > 12 sec with setAllGroups(true) > > > > > Without sort: It takes around 0.2 sec without grouping, 0.6 sec > with > > > > > grouping and 10 sec with setAllGroups(true) > > > > > > > > > > Thank you, Erick, I will look into it > > > > > > > > > > пт, 9 окт. 2020 г. в 14:32, Erick Erickson < > erickerick...@gmail.com > > >: > > > > > > > > > >> At the Solr level, CollapsingQParserPlugin see: > > > > >> > > > > > > > > > > https://lucene.apache.org/solr/guide/8_6/collapse-and-expand-results.html > > > > >> > > > > >> You could perhaps steal some ideas from that if you > > > > >> need this at the Lucene level. > > > > >> > > > > >> Best, > > > > >> Erick > > > > >> > > > > >>> On Oct 9, 2020, at 7:25 AM, Diego Ceccarelli (BLOOMBERG/ LONDON) > < > > > > >> dceccarel...@bloomberg.net> wrote: > > > > >>> > > > > >>> Is the field that you are using to dedupe stored as a docvalue? > > > > >>> > > > > >>> From: java-user@lucene.apache.org At: 10/09/20 12:18:04To: > > > > >> java-user@lucene.apache.org > > > > >>> Subject: Deduplication of search result with custom with custom > > sort > > > > >>> > > > > >>> Hi, > > > > >>> I need to deduplicate search results by specific field and I have > > no > > > > idea > > > > >>> how to implement this properly. > > > > >>> I have tried grouping with setGroupDocsLimit(1) and it gives me > > > > expected > > > > >>> results, but has not very good performance. > > > > >>> I think that I need something like DiversifiedTopDocsCollector, > but > > > > >>> suitable for collecting TopFieldDocs. > > > > >>> Is there any possibility to achieve deduplication with existing > > > lucene > > > > >>> components, or do I need to implement my own > > > > >> DiversifiedTopFieldsCollector? > > > > >>> > > > > >>> > > > > >> > > > > >> > > > > >> > > --------------------------------------------------------------------- > > > > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > >> > > > > >> > > > > > > > > > > > > --------------------------------------------------------------------- > > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > > > > > > > > > > > > > > >