I'm not sure if that method is viable for reindexing and fetching the whole
collection at once for us, but unless there is something inherent in that
process which happens at the collection level, we could do it a few shards
at a time since it is a multi-tenant setup.

I'll see if we can setup a small test in QA for this, and test it out. This
facet issue is the only one we've noticed and is able to be worked around,
so we might end up just waiting until we reindex for version 7.X to
permanently fix it.

Thanks
Chris

On Thu, Oct 12, 2017 at 1:41 PM Erick Erickson <erickerick...@gmail.com>
wrote:

> (1) It doesn't matter whether it "affect only segments being merged".
> You can't get accurate information if different segments have
> different expectations.
>
> (2) I strongly doubt it. The problem is that the "tainted" segments'
> meta-data is still read when merging. If the segment consisted of
> _only_ deleted documents you'd probably lose it, but it'll be
> re-merged long before it consists of exclusively deleted documents.
>
> Really, you have to re-index to be sure, I suspect you can find some
> way to do this faster than exploring undefined behavior and hoping.
>
> If you can re-index _anywhere_ to a collection with the same number of
> shards you can get this done, it'll take some tricky dancing but....
>
> 0> copy one index directory from each shard someplace safe.....
> 1> reindex somewhere, single-replica will do.
> 2> Delete all replicas except one for your current collection
> 3> issue an admin API command fetchindex for each replica in old
> collection, pulling the index "from the right place" in the new
> collection. It's important that there only be a single replica for
> each shard active at this point. These two collection do _not_ need to
> be part of the same SolrCloud, the fetchindex command just takes a URL
> of the core to fetch from.
> 4> add the replicas back and let them replicate.
>
> Your installation would be unavailable for searching during steps 2-4 of
> course.
>
> Best,
> Erick
>
> On Thu, Oct 12, 2017 at 9:01 AM, Chris Ulicny <culicny@iq.media> wrote:
> > We tested the query on all replicas for the given shard, and they all
> have
> > the same issue. So deleting and adding another replica won't fix the
> > problem since the leader is exhibiting the behavior as well. I believe
> the
> > second replica was moved (new one added, old one deleted) between nodes
> and
> > so was just a copy of the leader's index after the problematic merge
> > happened.
> >
> > bq: Anything that didn't merge old segments, just threw them
> > away when empty (which was my idea) would possibly require as much
> > disk space as the index currently occupied, so doesn't help your
> > disk-constrained situation.
> >
> > Something like this was originally what I thought might fix the issue. If
> > we reindex the data for the affected shard, it would possibly delete all
> > docs from the old segments and just drop them instead of merging them. As
> > mentioned, you'd expect the problems to persist through subsequent
> merges.
> > So I've got two questions
> >
> > 1) If the problem persists through merges, does it only affect the
> segments
> > being merged, and then when solr goes looking for the values, it comes up
> > empty? Instead of all segments being affected by a single merge they
> > weren't a part of.
> >
> > 2) Is it expected that any large tainted segments will eventually merge
> > with clean segments resulting in more tainted segments as enough docs are
> > deleted on the large segments?
> >
> > Also, we aren't disk constrained as much as previously. Reindexing a
> subset
> > of docs is possible, but a full clean collection reindex isn't.
> >
> > Thanks,
> > Chris
> >
> >
> > On Thu, Oct 12, 2017 at 11:13 AM Erick Erickson <erickerick...@gmail.com
> >
> > wrote:
> >
> >> Never mind. Anything that didn't merge old segments, just threw them
> >> away when empty (which was my idea) would possibly require as much
> >> disk space as the index currently occupied, so doesn't help your
> >> disk-constrained situation.
> >>
> >> Best,
> >> Erick
> >>
> >> On Thu, Oct 12, 2017 at 8:06 AM, Erick Erickson <
> erickerick...@gmail.com>
> >> wrote:
> >> > If it's _only_ on a particular replica, here's what you could do:
> >> > Just DELETEREPLICA on it, then ADDREPLICA to bring it back. You can
> >> > define the "node" parameter on ADDREPLICA to get it back on the same
> >> > node. Then the normal replication process would pull the entire index
> >> > down from the leader.
> >> >
> >> > My bet, though, is that this wouldn't really fix things. While it
> fixes
> >> the
> >> > particular case you've noticed I'd guess others would pop up. You can
> >> > see what replicas return what by firing individual queries at the
> >> > particular replica in question with &distrib=false, something like
> >> >
> >>
> solr_server:port/solr/collection1_shard1_replica1/query?distrib=false&blah
> >> > blah blah
> >> >
> >> >
> >> > bq: It is exceedingly unfortunate that reindexing the data on that
> shard
> >> only
> >> > probably won't end up fixing the problem
> >> >
> >> > Well, we've been working on the DWIM (Do What I Mean) feature for
> years,
> >> > but progress has stalled.
> >> >
> >> > How would that work? You have two segments with vastly different
> >> > characteristics for a field. You could change the type, the
> >> multiValued-ness,
> >> > the analysis chain, there's no end to the things that could go wrong.
> >> Fixing
> >> > them actually _is_ impossible given how Lucene is structured.
> >> >
> >> > Hmmmm, you've now given me a brainstorm I'll suggest on the JIRA
> >> > system after I talk to the dev list....
> >> >
> >> > Consider indexed=true stored=false. After stemming, "running" can be
> >> > indexed as "run". At merge time you have no way of knowing that
> >> > "running" was the original term so you simply couldn't fix it on
> merge,
> >> > not to mention that the performance penalty would be...er...
> >> > severe.
> >> >
> >> > Best,
> >> > Erick
> >> >
> >> > On Thu, Oct 12, 2017 at 5:53 AM, Chris Ulicny <culicny@iq.media>
> wrote:
> >> >> I thought that decision would come back to bite us somehow. At the
> >> time, we
> >> >> didn't have enough space available to do a fresh reindex alongside
> the
> >> old
> >> >> collection, so the only course of action available was to index over
> the
> >> >> old one, and the vast majority of its use worked as expected.
> >> >>
> >> >> We're planning on upgrading to version 7 at some point in the near
> >> future
> >> >> and will have enough space to do a full, clean reindex at that time.
> >> >>
> >> >> bq: This can propagate through all following segment merges IIUC.
> >> >>
> >> >> It is exceedingly unfortunate that reindexing the data on that shard
> >> only
> >> >> probably won't end up fixing the problem.
> >> >>
> >> >> Out of curiosity, are there any good write-ups or documentation on
> how
> >> two
> >> >> (or more) lucene segments are merged, or is it just worth looking at
> the
> >> >> source code to figure that out?
> >> >>
> >> >> Thanks,
> >> >> Chris
> >> >>
> >> >> On Wed, Oct 11, 2017 at 6:55 PM Erick Erickson <
> erickerick...@gmail.com
> >> >
> >> >> wrote:
> >> >>
> >> >>> bq: ...but the collection wasn't emptied first....
> >> >>>
> >> >>> This is what I'd suspect is the problem. Here's the issue: Segments
> >> >>> aren't merged identically on all replicas. So at some point you had
> >> >>> this field indexed without docValues, changed that and re-indexed.
> But
> >> >>> the segment merging could "read" the first segment it's going to
> merge
> >> >>> and think it knows about docValues for that field, when in fact that
> >> >>> segment had the old (non-DV) definition.
> >> >>>
> >> >>> This would not necessarily be the same on all replicas even on the
> >> _same_
> >> >>> shard.
> >> >>>
> >> >>> This can propagate through all following segment merges IIUC.
> >> >>>
> >> >>> So my bet is that if you index into a new collection, everything
> will
> >> >>> be fine. You can also just delete everything first, but I usually
> >> >>> prefer a new collection so I'm absolutely and positively sure that
> the
> >> >>> above can't happen.
> >> >>>
> >> >>> Best,
> >> >>> Erick
> >> >>>
> >> >>> On Wed, Oct 11, 2017 at 12:51 PM, Chris Ulicny <culicny@iq.media>
> >> wrote:
> >> >>> > Hi,
> >> >>> >
> >> >>> > We've run into a strange issue with our deployment of solrcloud
> >> 6.3.0.
> >> >>> > Essentially, a standard facet query on a string field usually
> comes
> >> back
> >> >>> > empty when it shouldn't. However, every now and again the query
> >> actually
> >> >>> > returns the correct values. This is only affecting a single shard
> in
> >> our
> >> >>> > setup.
> >> >>> >
> >> >>> > The behavior pattern generally looks like the query works properly
> >> when
> >> >>> it
> >> >>> > hasn't been run recently, and then returns nothing after the query
> >> seems
> >> >>> to
> >> >>> > have been cached (< 50ms QTime). Wait a while and you get the
> correct
> >> >>> > result followed by blanks. It doesn't matter which replica of the
> >> shard
> >> >>> is
> >> >>> > queried; the results are the same.
> >> >>> >
> >> >>> > The general query in question looks like
> >> >>> > /select?q=*:*&facet=true&facet.field=market&rows=0&fq=<some
> filters>
> >> >>> >
> >> >>> > The field is defined in the schema as <field name="market"
> >> type="string"
> >> >>> > docValues="true"/>
> >> >>> >
> >> >>> > There are numerous other fields defined similarly, and they do not
> >> >>> exhibit
> >> >>> > the same behavior when used as the facet.field value. They
> >> consistently
> >> >>> > return the right results on the shard in question.
> >> >>> >
> >> >>> > If we add facet.method=enum to the query, we get the correct
> results
> >> >>> every
> >> >>> > time (though slower. So our assumption is that something is
> >> sporadically
> >> >>> > working when the fc method is chosen by default.
> >> >>> >
> >> >>> > A few other notes about the collection. This collection is not
> >> freshly
> >> >>> > indexed, but has not had any particularly bad failures beyond
> >> follower
> >> >>> > replicas going down due to PKIAuthentication timeouts (has been
> >> fixed).
> >> >>> It
> >> >>> > has also had a full reindex after a schema change added docValues
> >> some
> >> >>> > fields (including the one above), but the collection wasn't
> emptied
> >> >>> first.
> >> >>> > We are using the composite router to co-locate documents.
> >> >>> >
> >> >>> > Currently, our plan is just to reindex all of the documents on the
> >> >>> affected
> >> >>> > shard to see if that fixes the problem. Any ideas on what might be
> >> >>> > happening or ways to troubleshoot this are appreciated.
> >> >>> >
> >> >>> > Thanks,
> >> >>> > Chris
> >> >>>
> >>
>

Reply via email to