Re: Inconsistent results for facet queries

Erick Erickson Thu, 12 Oct 2017 10:41:50 -0700

(1) It doesn't matter whether it "affect only segments being merged".
You can't get accurate information if different segments have
different expectations.


(2) I strongly doubt it. The problem is that the "tainted" segments'
meta-data is still read when merging. If the segment consisted of
_only_ deleted documents you'd probably lose it, but it'll be
re-merged long before it consists of exclusively deleted documents.

Really, you have to re-index to be sure, I suspect you can find some
way to do this faster than exploring undefined behavior and hoping.

If you can re-index _anywhere_ to a collection with the same number of
shards you can get this done, it'll take some tricky dancing but....

0> copy one index directory from each shard someplace safe.....
1> reindex somewhere, single-replica will do.
2> Delete all replicas except one for your current collection
3> issue an admin API command fetchindex for each replica in old
collection, pulling the index "from the right place" in the new
collection. It's important that there only be a single replica for
each shard active at this point. These two collection do _not_ need to
be part of the same SolrCloud, the fetchindex command just takes a URL
of the core to fetch from.
4> add the replicas back and let them replicate.

Your installation would be unavailable for searching during steps 2-4 of course.

Best,
Erick

On Thu, Oct 12, 2017 at 9:01 AM, Chris Ulicny <culicny@iq.media> wrote:
> We tested the query on all replicas for the given shard, and they all have
> the same issue. So deleting and adding another replica won't fix the
> problem since the leader is exhibiting the behavior as well. I believe the
> second replica was moved (new one added, old one deleted) between nodes and
> so was just a copy of the leader's index after the problematic merge
> happened.
>
> bq: Anything that didn't merge old segments, just threw them
> away when empty (which was my idea) would possibly require as much
> disk space as the index currently occupied, so doesn't help your
> disk-constrained situation.
>
> Something like this was originally what I thought might fix the issue. If
> we reindex the data for the affected shard, it would possibly delete all
> docs from the old segments and just drop them instead of merging them. As
> mentioned, you'd expect the problems to persist through subsequent merges.
> So I've got two questions
>
> 1) If the problem persists through merges, does it only affect the segments
> being merged, and then when solr goes looking for the values, it comes up
> empty? Instead of all segments being affected by a single merge they
> weren't a part of.
>
> 2) Is it expected that any large tainted segments will eventually merge
> with clean segments resulting in more tainted segments as enough docs are
> deleted on the large segments?
>
> Also, we aren't disk constrained as much as previously. Reindexing a subset
> of docs is possible, but a full clean collection reindex isn't.
>
> Thanks,
> Chris
>
>
> On Thu, Oct 12, 2017 at 11:13 AM Erick Erickson <erickerick...@gmail.com>
> wrote:
>
>> Never mind. Anything that didn't merge old segments, just threw them
>> away when empty (which was my idea) would possibly require as much
>> disk space as the index currently occupied, so doesn't help your
>> disk-constrained situation.
>>
>> Best,
>> Erick
>>
>> On Thu, Oct 12, 2017 at 8:06 AM, Erick Erickson <erickerick...@gmail.com>
>> wrote:
>> > If it's _only_ on a particular replica, here's what you could do:
>> > Just DELETEREPLICA on it, then ADDREPLICA to bring it back. You can
>> > define the "node" parameter on ADDREPLICA to get it back on the same
>> > node. Then the normal replication process would pull the entire index
>> > down from the leader.
>> >
>> > My bet, though, is that this wouldn't really fix things. While it fixes
>> the
>> > particular case you've noticed I'd guess others would pop up. You can
>> > see what replicas return what by firing individual queries at the
>> > particular replica in question with &distrib=false, something like
>> >
>> solr_server:port/solr/collection1_shard1_replica1/query?distrib=false&blah
>> > blah blah
>> >
>> >
>> > bq: It is exceedingly unfortunate that reindexing the data on that shard
>> only
>> > probably won't end up fixing the problem
>> >
>> > Well, we've been working on the DWIM (Do What I Mean) feature for years,
>> > but progress has stalled.
>> >
>> > How would that work? You have two segments with vastly different
>> > characteristics for a field. You could change the type, the
>> multiValued-ness,
>> > the analysis chain, there's no end to the things that could go wrong.
>> Fixing
>> > them actually _is_ impossible given how Lucene is structured.
>> >
>> > Hmmmm, you've now given me a brainstorm I'll suggest on the JIRA
>> > system after I talk to the dev list....
>> >
>> > Consider indexed=true stored=false. After stemming, "running" can be
>> > indexed as "run". At merge time you have no way of knowing that
>> > "running" was the original term so you simply couldn't fix it on merge,
>> > not to mention that the performance penalty would be...er...
>> > severe.
>> >
>> > Best,
>> > Erick
>> >
>> > On Thu, Oct 12, 2017 at 5:53 AM, Chris Ulicny <culicny@iq.media> wrote:
>> >> I thought that decision would come back to bite us somehow. At the
>> time, we
>> >> didn't have enough space available to do a fresh reindex alongside the
>> old
>> >> collection, so the only course of action available was to index over the
>> >> old one, and the vast majority of its use worked as expected.
>> >>
>> >> We're planning on upgrading to version 7 at some point in the near
>> future
>> >> and will have enough space to do a full, clean reindex at that time.
>> >>
>> >> bq: This can propagate through all following segment merges IIUC.
>> >>
>> >> It is exceedingly unfortunate that reindexing the data on that shard
>> only
>> >> probably won't end up fixing the problem.
>> >>
>> >> Out of curiosity, are there any good write-ups or documentation on how
>> two
>> >> (or more) lucene segments are merged, or is it just worth looking at the
>> >> source code to figure that out?
>> >>
>> >> Thanks,
>> >> Chris
>> >>
>> >> On Wed, Oct 11, 2017 at 6:55 PM Erick Erickson <erickerick...@gmail.com
>> >
>> >> wrote:
>> >>
>> >>> bq: ...but the collection wasn't emptied first....
>> >>>
>> >>> This is what I'd suspect is the problem. Here's the issue: Segments
>> >>> aren't merged identically on all replicas. So at some point you had
>> >>> this field indexed without docValues, changed that and re-indexed. But
>> >>> the segment merging could "read" the first segment it's going to merge
>> >>> and think it knows about docValues for that field, when in fact that
>> >>> segment had the old (non-DV) definition.
>> >>>
>> >>> This would not necessarily be the same on all replicas even on the
>> _same_
>> >>> shard.
>> >>>
>> >>> This can propagate through all following segment merges IIUC.
>> >>>
>> >>> So my bet is that if you index into a new collection, everything will
>> >>> be fine. You can also just delete everything first, but I usually
>> >>> prefer a new collection so I'm absolutely and positively sure that the
>> >>> above can't happen.
>> >>>
>> >>> Best,
>> >>> Erick
>> >>>
>> >>> On Wed, Oct 11, 2017 at 12:51 PM, Chris Ulicny <culicny@iq.media>
>> wrote:
>> >>> > Hi,
>> >>> >
>> >>> > We've run into a strange issue with our deployment of solrcloud
>> 6.3.0.
>> >>> > Essentially, a standard facet query on a string field usually comes
>> back
>> >>> > empty when it shouldn't. However, every now and again the query
>> actually
>> >>> > returns the correct values. This is only affecting a single shard in
>> our
>> >>> > setup.
>> >>> >
>> >>> > The behavior pattern generally looks like the query works properly
>> when
>> >>> it
>> >>> > hasn't been run recently, and then returns nothing after the query
>> seems
>> >>> to
>> >>> > have been cached (< 50ms QTime). Wait a while and you get the correct
>> >>> > result followed by blanks. It doesn't matter which replica of the
>> shard
>> >>> is
>> >>> > queried; the results are the same.
>> >>> >
>> >>> > The general query in question looks like
>> >>> > /select?q=*:*&facet=true&facet.field=market&rows=0&fq=<some filters>
>> >>> >
>> >>> > The field is defined in the schema as <field name="market"
>> type="string"
>> >>> > docValues="true"/>
>> >>> >
>> >>> > There are numerous other fields defined similarly, and they do not
>> >>> exhibit
>> >>> > the same behavior when used as the facet.field value. They
>> consistently
>> >>> > return the right results on the shard in question.
>> >>> >
>> >>> > If we add facet.method=enum to the query, we get the correct results
>> >>> every
>> >>> > time (though slower. So our assumption is that something is
>> sporadically
>> >>> > working when the fc method is chosen by default.
>> >>> >
>> >>> > A few other notes about the collection. This collection is not
>> freshly
>> >>> > indexed, but has not had any particularly bad failures beyond
>> follower
>> >>> > replicas going down due to PKIAuthentication timeouts (has been
>> fixed).
>> >>> It
>> >>> > has also had a full reindex after a schema change added docValues
>> some
>> >>> > fields (including the one above), but the collection wasn't emptied
>> >>> first.
>> >>> > We are using the composite router to co-locate documents.
>> >>> >
>> >>> > Currently, our plan is just to reindex all of the documents on the
>> >>> affected
>> >>> > shard to see if that fixes the problem. Any ideas on what might be
>> >>> > happening or ways to troubleshoot this are appreciated.
>> >>> >
>> >>> > Thanks,
>> >>> > Chris
>> >>>
>>

Re: Inconsistent results for facet queries

Reply via email to