Collapse will have dups unless you use the _route_ parameter to co-locate
documents with the same group, onto the same shard.

In you're scenario, co-locating docs sounds like it won't work because you
may have different grouping criteria.

The doc counts would be inflated unless you sent all the documents from the
shards to be merged and then de-duped them, which is how streaming
operates. But streaming has the capability to do these types of operations
in parallel and the merge strategy does not.






Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Aug 4, 2016 at 6:04 PM, tedsolr <tsm...@sciquest.com> wrote:

> Perhaps my assumptions about merge are wrong. When I run a search with the
> collapsing filter (q=*:*&fq={!collapse field=VENDOR_NAME}...) I get "dupes"
> if the same VENDOR_NAME is on shard1 and shard2. Here's the response:
>
> "response": {
>     "numFound": 24158,
>     "start": 0,
>     "docs": [
>       {
>         "VENDOR_NAME": "01DB  METRAVIB SAS",
>         "[shard]":
> "http://localhost:8983/solr/ShardTest1_shard1_0_replica1/|
> http://localhost:8984/solr/ShardTest1_shard1_0_replica2/";
>       },
>       {
>         "VENDOR_NAME": "01DB  METRAVIB SAS",
>         "[shard]":
> "http://localhost:8983/solr/ShardTest1_shard1_1_replica1/|
> http://localhost:8984/solr/ShardTest1_shard1_1_replica2/";
>       },
>       {
>         "VENDOR_NAME": "1 BIG SELF STORE LTD",
>         "[shard]":
> "http://localhost:8983/solr/ShardTest1_shard1_0_replica1/|
> http://localhost:8984/solr/ShardTest1_shard1_0_replica2/";
>       }
>     ]
>   }
>
> You can see the same vendor is returned from shard1_1 and shard1_0. So I'm
> expecting the same results from my plugin (once I get it to work). I
> thought
> the merge strategy could be used to filter out the "duplicate" vendor. So
> would that require rebuilding the document list and then replacing the solr
> response like shardResponse.setSolrResponse()?
>
> And if that is the correct approach, I could return many more results than
> the user expected. If I'm thinking correctly, then worse case is no "dupes"
> between the shards and the returned result count is rows X shards. To make
> sure the correct results are returned based on the sort I'll also have to
> resort the merged results. So for a search like q=*:*&fl=vendor&sort=vendor
> asc...
>
> results example:
> shard 1 docs: { A, B, D }
> shard 2 docs: { B, C, D }
>
> So walking through the solr responses for each shard I end up with a return
> set of { A, B, C, D }
>
>
> Joel Bernstein wrote
> > Can you describe more about what you're trying to do in the merge? Why
> > does
> > it seem it's too late to drop documents in the merge?
> >
> > If you can provide a very simple example with some sample records and a
> > sample output, that would be helpful.
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Thu, Aug 4, 2016 at 4:25 PM, tedsolr &lt;
>
> > tsmith@
>
> > &gt; wrote:
> >
> >> I've been struggling just to get my search plugin working for sharded
> >> collections, but I haven't ascertained if my end goal is even
> achievable.
> >> I
> >> have a plugin that groups documents that are considered duplicates
> (based
> >> on
> >> multiple fields - like the CollapsingQParserPlugin). When responses come
> >> back from different shards another culling will be necessary to remove
> >> dupes
> >> between the shards. In the merge() method it seems it will be too late
> to
> >> simply "drop" documents. Is this something that the client will just
> have
> >> to
> >> deal with? Maybe in the process() method of a search component? I was
> >> expecting to be able to preserve the requested return count, but that
> >> seems
> >> really unlikely now.
> >>
> >> Thanks for any suggestions,
> >> Ted v5.2.1
> >>
> >>
> >>
> >> --
> >> View this message in context: http://lucene.472066.n3.nabble.com/Can-a-
> >> MergeStrategy-filter-returned-docs-tp4290446.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
> >>
>
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Can-a-
> MergeStrategy-filter-returned-docs-tp4290446p4290458.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Reply via email to