This additional validation that we introduced in Lucene 9 feels like a
natural extension of the validation that we already had before, such as the
fact that you cannot have some docs that use SORTED doc values and other
docs that use NUMERIC doc values on the same field. Actually I would have
liked to go further by enforcing that all data structures record the exact
same information but this is challenging due to the fact that IndexingChain
only has access to the encoded data, e.g. with IntPoint it only sees a
byte[] rather than the original integer, so we'd have to make assumptions
about how the data is encoded, which doesn't feel right.

I do like this additional validation very much because I suspect that most
cases when users would get this error is because they made a mistake in
their indexing code. And this also helps make Lucene work better
out-of-the-box. For instance, thanks to this additional validation we
enabled dynamic pruning when sorting on numeric fields by default - this is
opt-in on 8.x since this optimization needs to look at both points and doc
values, so it's broken if not all documents have the same schema. And there
are other things we could do in the near future like rewriting
DocValuesFieldExistsQuery to a MatchAllDocsQuery when points/terms report
that docCount == maxDoc.

In my opinion the correct solution for the problem you are facing would be
to have a way to make index sorting aware of the parent/child relationship
so that index sorting would read the sort key of the parent document
whenever it is on a child document, e.g. as done on LUCENE-5312
<https://issues.apache.org/jira/browse/LUCENE-5312>. This way you wouldn't
have to duplicate this sort key from your parent documents to your child
documents, so you wouldn't have any schema issues.

On Wed, Sep 1, 2021 at 4:44 PM Michael Sokolov <[email protected]> wrote:

> While upgrading I ran afoul of some inconsistencies in our schema
> usage, and to fix them I've ended up having to add data to our index
> that I'd rather not. Let me give a little context: We have a
> parent/child document structure. Some fields are shared across partn
> and child docs, others are not. Our index has a sort key, and in order
> for all the parent/child docs to sort together correctly, we add the
> same (docvalues) fields that are part of the sortkey to both parent
> and child docs. Some of these fields are *also* indexed as postings
> (StringField) of the same name, but we only index the postings field
> on the parent document, since child documents are never searched for
> on their own - always in conjunction with a parent.
>
> The schema-checking code we added in Lucene 9 does not allow this: it
> enforces that all documents having a field should have the same "index
> options", and failing to index the postings gets interpreted as having
> index options = NONE (because of the presence of the doc values field
> of the same name, I think?)
>
> Our current solution is to also index the postings for the child
> document (but just with an empty string value). This seems gross, and
> creates postings in the index that we will never use.
>
> Another possibility would be to rename the fields so that the postings
> and docvalues fields have different names. But in this case our
> application-level schema diverges from our Lucene schema, adding a
> layer of complexity we'd rather not introduce.
>
> Finally, could we relax this constraint, always allowing index
> options=NONE regardless of how other docs are indexed? Would it cause
> problems?
>
> -Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

-- 
Adrien

Reply via email to