Re: Are the new index consistency checks too strict?

Michael Sokolov Thu, 02 Sep 2021 05:03:06 -0700

Hmm .. I guess I missed the implication of your comment about
requiring both points and docvalues for some cases, which I guess
could be violated if we relaxed this NONE != not NONE enforcement for
docvalues (or points)...


On Thu, Sep 2, 2021 at 7:46 AM Michael Sokolov <[email protected]> wrote:
>
> Oh, and also, I like the idea of making index sorting parent/child aware!
>
> On Thu, Sep 2, 2021 at 7:45 AM Michael Sokolov <[email protected]> wrote:
> >
> > Yes, I am also supportive of the idea of having a schema that is
> > enforced, and I like what it enables us to do. I just wonder if we
> > could relax the enforcement around IndexOptions.NONE (and
> > DocValuesType.NONE). Would it make sense to enable NONE to be "equal
> > to" any other IndexOptions, so that eg, you if you index a field with
> > IndexOptions.DOCS_AND_TERMS then every document must have either
> > DOCS_AND_TERMS or NONE?  In the case where a field is *only* indexed
> > as terms, and has no docvalues, this is already allowed. But if you
> > index a field as both docvalue and terms, then it is not (currently),
> > which seems weird. I guess the same is true of a field that has no
> > docvalues on some docs, and has them on others, but is also indexed as
> > terms everywhere. I think that ought to be allowed (since you can have
> > a sparse docvalues field that is not indexed with terms).
> >
> > On Wed, Sep 1, 2021 at 12:24 PM Adrien Grand <[email protected]> wrote:
> > >
> > > This additional validation that we introduced in Lucene 9 feels like a 
> > > natural extension of the validation that we already had before, such as 
> > > the fact that you cannot have some docs that use SORTED doc values and 
> > > other docs that use NUMERIC doc values on the same field. Actually I 
> > > would have liked to go further by enforcing that all data structures 
> > > record the exact same information but this is challenging due to the fact 
> > > that IndexingChain only has access to the encoded data, e.g. with 
> > > IntPoint it only sees a byte[] rather than the original integer, so we'd 
> > > have to make assumptions about how the data is encoded, which doesn't 
> > > feel right.
> > >
> > > I do like this additional validation very much because I suspect that 
> > > most cases when users would get this error is because they made a mistake 
> > > in their indexing code. And this also helps make Lucene work better 
> > > out-of-the-box. For instance, thanks to this additional validation we 
> > > enabled dynamic pruning when sorting on numeric fields by default - this 
> > > is opt-in on 8.x since this optimization needs to look at both points and 
> > > doc values, so it's broken if not all documents have the same schema. And 
> > > there are other things we could do in the near future like rewriting 
> > > DocValuesFieldExistsQuery to a MatchAllDocsQuery when points/terms report 
> > > that docCount == maxDoc.
> > >
> > > In my opinion the correct solution for the problem you are facing would 
> > > be to have a way to make index sorting aware of the parent/child 
> > > relationship so that index sorting would read the sort key of the parent 
> > > document whenever it is on a child document, e.g. as done on LUCENE-5312. 
> > > This way you wouldn't have to duplicate this sort key from your parent 
> > > documents to your child documents, so you wouldn't have any schema issues.
> > >
> > > On Wed, Sep 1, 2021 at 4:44 PM Michael Sokolov <[email protected]> wrote:
> > >>
> > >> While upgrading I ran afoul of some inconsistencies in our schema
> > >> usage, and to fix them I've ended up having to add data to our index
> > >> that I'd rather not. Let me give a little context: We have a
> > >> parent/child document structure. Some fields are shared across partn
> > >> and child docs, others are not. Our index has a sort key, and in order
> > >> for all the parent/child docs to sort together correctly, we add the
> > >> same (docvalues) fields that are part of the sortkey to both parent
> > >> and child docs. Some of these fields are *also* indexed as postings
> > >> (StringField) of the same name, but we only index the postings field
> > >> on the parent document, since child documents are never searched for
> > >> on their own - always in conjunction with a parent.
> > >>
> > >> The schema-checking code we added in Lucene 9 does not allow this: it
> > >> enforces that all documents having a field should have the same "index
> > >> options", and failing to index the postings gets interpreted as having
> > >> index options = NONE (because of the presence of the doc values field
> > >> of the same name, I think?)
> > >>
> > >> Our current solution is to also index the postings for the child
> > >> document (but just with an empty string value). This seems gross, and
> > >> creates postings in the index that we will never use.
> > >>
> > >> Another possibility would be to rename the fields so that the postings
> > >> and docvalues fields have different names. But in this case our
> > >> application-level schema diverges from our Lucene schema, adding a
> > >> layer of complexity we'd rather not introduce.
> > >>
> > >> Finally, could we relax this constraint, always allowing index
> > >> options=NONE regardless of how other docs are indexed? Would it cause
> > >> problems?
> > >>
> > >> -Mike
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: [email protected]
> > >> For additional commands, e-mail: [email protected]
> > >>
> > >
> > >
> > > --
> > > Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Are the new index consistency checks too strict?

Reply via email to