(1) It doesn't matter whether it "affect only segments being merged". You can't get accurate information if different segments have different expectations.
(2) I strongly doubt it. The problem is that the "tainted" segments' meta-data is still read when merging. If the segment consisted of _only_ deleted documents you'd probably lose it, but it'll be re-merged long before it consists of exclusively deleted documents. Really, you have to re-index to be sure, I suspect you can find some way to do this faster than exploring undefined behavior and hoping. If you can re-index _anywhere_ to a collection with the same number of shards you can get this done, it'll take some tricky dancing but.... 0> copy one index directory from each shard someplace safe..... 1> reindex somewhere, single-replica will do. 2> Delete all replicas except one for your current collection 3> issue an admin API command fetchindex for each replica in old collection, pulling the index "from the right place" in the new collection. It's important that there only be a single replica for each shard active at this point. These two collection do _not_ need to be part of the same SolrCloud, the fetchindex command just takes a URL of the core to fetch from. 4> add the replicas back and let them replicate. Your installation would be unavailable for searching during steps 2-4 of course. Best, Erick On Thu, Oct 12, 2017 at 9:01 AM, Chris Ulicny <culicny@iq.media> wrote: > We tested the query on all replicas for the given shard, and they all have > the same issue. So deleting and adding another replica won't fix the > problem since the leader is exhibiting the behavior as well. I believe the > second replica was moved (new one added, old one deleted) between nodes and > so was just a copy of the leader's index after the problematic merge > happened. > > bq: Anything that didn't merge old segments, just threw them > away when empty (which was my idea) would possibly require as much > disk space as the index currently occupied, so doesn't help your > disk-constrained situation. > > Something like this was originally what I thought might fix the issue. If > we reindex the data for the affected shard, it would possibly delete all > docs from the old segments and just drop them instead of merging them. As > mentioned, you'd expect the problems to persist through subsequent merges. > So I've got two questions > > 1) If the problem persists through merges, does it only affect the segments > being merged, and then when solr goes looking for the values, it comes up > empty? Instead of all segments being affected by a single merge they > weren't a part of. > > 2) Is it expected that any large tainted segments will eventually merge > with clean segments resulting in more tainted segments as enough docs are > deleted on the large segments? > > Also, we aren't disk constrained as much as previously. Reindexing a subset > of docs is possible, but a full clean collection reindex isn't. > > Thanks, > Chris > > > On Thu, Oct 12, 2017 at 11:13 AM Erick Erickson <erickerick...@gmail.com> > wrote: > >> Never mind. Anything that didn't merge old segments, just threw them >> away when empty (which was my idea) would possibly require as much >> disk space as the index currently occupied, so doesn't help your >> disk-constrained situation. >> >> Best, >> Erick >> >> On Thu, Oct 12, 2017 at 8:06 AM, Erick Erickson <erickerick...@gmail.com> >> wrote: >> > If it's _only_ on a particular replica, here's what you could do: >> > Just DELETEREPLICA on it, then ADDREPLICA to bring it back. You can >> > define the "node" parameter on ADDREPLICA to get it back on the same >> > node. Then the normal replication process would pull the entire index >> > down from the leader. >> > >> > My bet, though, is that this wouldn't really fix things. While it fixes >> the >> > particular case you've noticed I'd guess others would pop up. You can >> > see what replicas return what by firing individual queries at the >> > particular replica in question with &distrib=false, something like >> > >> solr_server:port/solr/collection1_shard1_replica1/query?distrib=false&blah >> > blah blah >> > >> > >> > bq: It is exceedingly unfortunate that reindexing the data on that shard >> only >> > probably won't end up fixing the problem >> > >> > Well, we've been working on the DWIM (Do What I Mean) feature for years, >> > but progress has stalled. >> > >> > How would that work? You have two segments with vastly different >> > characteristics for a field. You could change the type, the >> multiValued-ness, >> > the analysis chain, there's no end to the things that could go wrong. >> Fixing >> > them actually _is_ impossible given how Lucene is structured. >> > >> > Hmmmm, you've now given me a brainstorm I'll suggest on the JIRA >> > system after I talk to the dev list.... >> > >> > Consider indexed=true stored=false. After stemming, "running" can be >> > indexed as "run". At merge time you have no way of knowing that >> > "running" was the original term so you simply couldn't fix it on merge, >> > not to mention that the performance penalty would be...er... >> > severe. >> > >> > Best, >> > Erick >> > >> > On Thu, Oct 12, 2017 at 5:53 AM, Chris Ulicny <culicny@iq.media> wrote: >> >> I thought that decision would come back to bite us somehow. At the >> time, we >> >> didn't have enough space available to do a fresh reindex alongside the >> old >> >> collection, so the only course of action available was to index over the >> >> old one, and the vast majority of its use worked as expected. >> >> >> >> We're planning on upgrading to version 7 at some point in the near >> future >> >> and will have enough space to do a full, clean reindex at that time. >> >> >> >> bq: This can propagate through all following segment merges IIUC. >> >> >> >> It is exceedingly unfortunate that reindexing the data on that shard >> only >> >> probably won't end up fixing the problem. >> >> >> >> Out of curiosity, are there any good write-ups or documentation on how >> two >> >> (or more) lucene segments are merged, or is it just worth looking at the >> >> source code to figure that out? >> >> >> >> Thanks, >> >> Chris >> >> >> >> On Wed, Oct 11, 2017 at 6:55 PM Erick Erickson <erickerick...@gmail.com >> > >> >> wrote: >> >> >> >>> bq: ...but the collection wasn't emptied first.... >> >>> >> >>> This is what I'd suspect is the problem. Here's the issue: Segments >> >>> aren't merged identically on all replicas. So at some point you had >> >>> this field indexed without docValues, changed that and re-indexed. But >> >>> the segment merging could "read" the first segment it's going to merge >> >>> and think it knows about docValues for that field, when in fact that >> >>> segment had the old (non-DV) definition. >> >>> >> >>> This would not necessarily be the same on all replicas even on the >> _same_ >> >>> shard. >> >>> >> >>> This can propagate through all following segment merges IIUC. >> >>> >> >>> So my bet is that if you index into a new collection, everything will >> >>> be fine. You can also just delete everything first, but I usually >> >>> prefer a new collection so I'm absolutely and positively sure that the >> >>> above can't happen. >> >>> >> >>> Best, >> >>> Erick >> >>> >> >>> On Wed, Oct 11, 2017 at 12:51 PM, Chris Ulicny <culicny@iq.media> >> wrote: >> >>> > Hi, >> >>> > >> >>> > We've run into a strange issue with our deployment of solrcloud >> 6.3.0. >> >>> > Essentially, a standard facet query on a string field usually comes >> back >> >>> > empty when it shouldn't. However, every now and again the query >> actually >> >>> > returns the correct values. This is only affecting a single shard in >> our >> >>> > setup. >> >>> > >> >>> > The behavior pattern generally looks like the query works properly >> when >> >>> it >> >>> > hasn't been run recently, and then returns nothing after the query >> seems >> >>> to >> >>> > have been cached (< 50ms QTime). Wait a while and you get the correct >> >>> > result followed by blanks. It doesn't matter which replica of the >> shard >> >>> is >> >>> > queried; the results are the same. >> >>> > >> >>> > The general query in question looks like >> >>> > /select?q=*:*&facet=true&facet.field=market&rows=0&fq=<some filters> >> >>> > >> >>> > The field is defined in the schema as <field name="market" >> type="string" >> >>> > docValues="true"/> >> >>> > >> >>> > There are numerous other fields defined similarly, and they do not >> >>> exhibit >> >>> > the same behavior when used as the facet.field value. They >> consistently >> >>> > return the right results on the shard in question. >> >>> > >> >>> > If we add facet.method=enum to the query, we get the correct results >> >>> every >> >>> > time (though slower. So our assumption is that something is >> sporadically >> >>> > working when the fc method is chosen by default. >> >>> > >> >>> > A few other notes about the collection. This collection is not >> freshly >> >>> > indexed, but has not had any particularly bad failures beyond >> follower >> >>> > replicas going down due to PKIAuthentication timeouts (has been >> fixed). >> >>> It >> >>> > has also had a full reindex after a schema change added docValues >> some >> >>> > fields (including the one above), but the collection wasn't emptied >> >>> first. >> >>> > We are using the composite router to co-locate documents. >> >>> > >> >>> > Currently, our plan is just to reindex all of the documents on the >> >>> affected >> >>> > shard to see if that fixes the problem. Any ideas on what might be >> >>> > happening or ways to troubleshoot this are appreciated. >> >>> > >> >>> > Thanks, >> >>> > Chris >> >>> >>