Re: Are the new index consistency checks too strict?

Michael Sokolov Thu, 02 Sep 2021 04:46:40 -0700

Oh, and also, I like the idea of making index sorting parent/child aware!

On Thu, Sep 2, 2021 at 7:45 AM Michael Sokolov <[email protected]> wrote:
>
> Yes, I am also supportive of the idea of having a schema that is
> enforced, and I like what it enables us to do. I just wonder if we
> could relax the enforcement around IndexOptions.NONE (and
> DocValuesType.NONE). Would it make sense to enable NONE to be "equal
> to" any other IndexOptions, so that eg, you if you index a field with
> IndexOptions.DOCS_AND_TERMS then every document must have either
> DOCS_AND_TERMS or NONE?  In the case where a field is *only* indexed
> as terms, and has no docvalues, this is already allowed. But if you
> index a field as both docvalue and terms, then it is not (currently),
> which seems weird. I guess the same is true of a field that has no
> docvalues on some docs, and has them on others, but is also indexed as
> terms everywhere. I think that ought to be allowed (since you can have
> a sparse docvalues field that is not indexed with terms).
>
> On Wed, Sep 1, 2021 at 12:24 PM Adrien Grand <[email protected]> wrote:
> >
> > This additional validation that we introduced in Lucene 9 feels like a 
> > natural extension of the validation that we already had before, such as the 
> > fact that you cannot have some docs that use SORTED doc values and other 
> > docs that use NUMERIC doc values on the same field. Actually I would have 
> > liked to go further by enforcing that all data structures record the exact 
> > same information but this is challenging due to the fact that IndexingChain 
> > only has access to the encoded data, e.g. with IntPoint it only sees a 
> > byte[] rather than the original integer, so we'd have to make assumptions 
> > about how the data is encoded, which doesn't feel right.
> >
> > I do like this additional validation very much because I suspect that most 
> > cases when users would get this error is because they made a mistake in 
> > their indexing code. And this also helps make Lucene work better 
> > out-of-the-box. For instance, thanks to this additional validation we 
> > enabled dynamic pruning when sorting on numeric fields by default - this is 
> > opt-in on 8.x since this optimization needs to look at both points and doc 
> > values, so it's broken if not all documents have the same schema. And there 
> > are other things we could do in the near future like rewriting 
> > DocValuesFieldExistsQuery to a MatchAllDocsQuery when points/terms report 
> > that docCount == maxDoc.
> >
> > In my opinion the correct solution for the problem you are facing would be 
> > to have a way to make index sorting aware of the parent/child relationship 
> > so that index sorting would read the sort key of the parent document 
> > whenever it is on a child document, e.g. as done on LUCENE-5312. This way 
> > you wouldn't have to duplicate this sort key from your parent documents to 
> > your child documents, so you wouldn't have any schema issues.
> >
> > On Wed, Sep 1, 2021 at 4:44 PM Michael Sokolov <[email protected]> wrote:
> >>
> >> While upgrading I ran afoul of some inconsistencies in our schema
> >> usage, and to fix them I've ended up having to add data to our index
> >> that I'd rather not. Let me give a little context: We have a
> >> parent/child document structure. Some fields are shared across partn
> >> and child docs, others are not. Our index has a sort key, and in order
> >> for all the parent/child docs to sort together correctly, we add the
> >> same (docvalues) fields that are part of the sortkey to both parent
> >> and child docs. Some of these fields are *also* indexed as postings
> >> (StringField) of the same name, but we only index the postings field
> >> on the parent document, since child documents are never searched for
> >> on their own - always in conjunction with a parent.
> >>
> >> The schema-checking code we added in Lucene 9 does not allow this: it
> >> enforces that all documents having a field should have the same "index
> >> options", and failing to index the postings gets interpreted as having
> >> index options = NONE (because of the presence of the doc values field
> >> of the same name, I think?)
> >>
> >> Our current solution is to also index the postings for the child
> >> document (but just with an empty string value). This seems gross, and
> >> creates postings in the index that we will never use.
> >>
> >> Another possibility would be to rename the fields so that the postings
> >> and docvalues fields have different names. But in this case our
> >> application-level schema diverges from our Lucene schema, adding a
> >> layer of complexity we'd rather not introduce.
> >>
> >> Finally, could we relax this constraint, always allowing index
> >> options=NONE regardless of how other docs are indexed? Would it cause
> >> problems?
> >>
> >> -Mike
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
> >
> >
> > --
> > Adrien


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Are the new index consistency checks too strict?

Reply via email to