Hello,

I'm hitting a performance issue when using field collapsing in a
distributed Solr setup and I'm wondering if others have seen it and if
anyone has an idea to work around. it.

I'm using field collapsing to deduplicate documents that have the same near
duplicate hash value, and deduplicating at query time (as opposed to
filtering at index time) is a requirement.  I have a sharded setup with 10
cores (not SolrCloud), each having ~1000 documents each.  Of the 10k docs,
most have a unique near duplicate hash value, so there are about 10k unique
values for the field that I'm grouping on.  The grouping parameters that
I'm using are:

group=true
group.field=<near dupe hash field>
group.main=true

I'm attempting distributed queries (&shards=s1,s2,...,s10) where the only
difference is the absence or presence of these three grouping parameters
and I'm consistently seeing a marked difference in performance (as a
representative data point, 200ms latency without grouping and 1600ms with
grouping).  Interestingly, if I put all 10k docs on the same core and query
that core independently with and without grouping, I don't see much of a
latency difference, so the performance degradation seems to exist only in
the sharded setup.

Is there a known performance issue when field collapsing in a sharded setup
(perhaps only manifests when the grouping field has many unique values), or
have other people observed this?  Any ideas for a workaround?  Note that
docs in my sharded setup can only have the same signature if they're in the
same shard, so perhaps that can be used to boost perf, though I don't see
an exposed way to do so.

A follow-on question is whether we're likely to see the same issue if /
when we move to SolrCloud.

Thanks,
Dave

Reply via email to