I had a quick look and couldn't find anything to prevent what you
called “franken-segments” in the Lucene test?

On Tue, Dec 18, 2018 at 5:59 PM Erick Erickson <erickerick...@gmail.com> wrote:
>
> A couple of additions:
>
> AddDVMPLuceneTest2 does not use Solr constructs at all, so is the test
> we think is most interesting at this point, it won't lead anyone down
> the path of "what's all this Solr stuff and is it right" kinds of
> questions (believe me, we've spent some time on that path!). Please
> feel free to look at all the rest of it of course, but the place we're
> stuck is why this test fails.
>
> AddDvStress is intended as an integration-level test, it requires some
> special setup (in particular providing a particular configset), we put
> that together to reliably make the problem visible. We thought the new
> code was the issue at first and needed something to narrow down the
> possibilities...
>
> The reason we're obsessing about this is that it calls into question
> how segments are merged when "things change". We don't understand why
> this is happening at the Lucene level so don't know how to insure that
> things like the schema API in Solr aren't affected.
>
> Andrzej isn't the only one running out of ideas ;).
>
> On Tue, Dec 18, 2018 at 4:46 AM Andrzej Białecki <a...@getopt.org> wrote:
> >
> > Hi,
> >
> > I'm working on a use case where an existing Solr setup needs to migrate to 
> > a schema that uses docValues for faceting, instead of uninversion. This 
> > case fits into a broader subject of SOLR-12259 (Robustly upgrade indexes). 
> > However, in this case there are two major requirements for this migration 
> > process:
> >
> > * data cannot be reindexed from scratch - I need to work with the already 
> > indexed documents (which do contain the values needed for faceting, but 
> > these values are simply indexed and not stored as doc values)
> >
> > * indexing can’t be stopped while the schema is being changed (the 
> > conversion process needs to work on-the-fly while the collection is online, 
> > both for searching and for updates). Collection reloads / reopening is ok 
> > but it’s not ok to take the collection offline for several minutes (or 
> > hours).
> >
> > Together with Erick Erickson we implemented a solution that uses 
> > MergePolicy (actually MergePolicyFactory in Solr) to enforce re-writing of 
> > segments that no longer match the schema, ie. don’t contain docValues in a 
> > field where the new schema requires it. This merge policy determines what 
> > segments need this conversion and then forces the “merging” (actually 
> > re-writing) of these segments by first wrapping them into UninvertingReader 
> > to supply docValues where they are required by the new schema but actually 
> > are missing in the segment’s data. This “AddDocValuesMergePolicy” (ADVMP 
> > for short) is supposed to deal with the following types of segments:
> >
> > * old segments created before the schema change - these don’t contain any 
> > docValues in the target fields and so they are wrapped in UninvertingReader 
> > for merging (and for searching) according to the new schema.
> >
> > * new segments created after the schema change - if FieldInfo-s for these 
> > fields claim that the segment already contains docValues where it should 
> > then the segment is passed as-is to merging, otherwise it’s wrapped again. 
> > An optimisation was also put here to “mark” the already converted segments 
> > using a marker in SegmentInfo diagnostics map so that we can avoid 
> > re-checking and re-converting already converted data.
> >
> > So, long story short, this process works very well when there’s no 
> > concurrent indexing activity - all old segments are properly wrapped and 
> > re-written and merging with new segments works as intended. However, in a 
> > situation with concurrent indexing it works well but only for a short 
> > while. At some point this conversion process seems to lose large percentage 
> > of the docValues, even though it seems that at all points the source 
> > segments are properly wrapped - the ADVMP merge policy adds a lot of 
> > debugging information to track the source and type of segments across many 
> > levels of merging and whether they were wrapped or not.
> >
> > My working theory is that somehow this schema change produces 
> > “franken-segments” (while they still haven’t been flushed) where only some 
> > of the most recent docs have the docValues and earlier ones don’t. As I 
> > understand it, this should not happen in Solr because a schema change 
> > results in a core reload. The tracking information from ADVMP  seems to 
> > indicate that all generations of segments, both those that were flushed and 
> > merged earlier, have been properly wrapped.
> >
> > My alternate theory is that there’s some bug in the doc values merging 
> > process when UninvertingReader is involved, because this problem occurs 
> > also when we modify ADVMP to always force the wrapping of all segments in 
> > UninvertingReader-s. The percentage of lost doc values is sometimes quite 
> > large, up to 50%, perhaps it’s a bug somewhere where the code accounts for 
> > the presence of doc values in FieldCacheImpl?
> >
> > Together with Erick we implemented a bunch of tests that illustrate this 
> > issue - both the tests and the code can be found on branch 
> > "jira/solr-12259":
> >
> > * code.tests.AddDVMPLuceneTest2 - this is a pure Lucene test that shows how 
> > doc values are lost after several rounds of merging while concurrent 
> > indexing is going on. This failure is reproducible 100%.
> >
> > * code.tests.AddDvStress - this is a Solr test that repeatedly creates a 
> > collection without doc values, starts the indexing, changes the config to 
> > use ADVMP, makes the schema change to turn doc values on, and verifies the 
> > number of facets on the target field. This test also fails after a while 
> > with the same symptoms as the Lucene one, so I think that solving the 
> > Lucene test failure should solve this failure too.
> >
> > Any suggestions or insights are very much appreciated - I'm running out of 
> > ideas to try...
> >
> > —
> >
> > Andrzej Białecki
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>


-- 
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to