This additional validation that we introduced in Lucene 9 feels like a natural extension of the validation that we already had before, such as the fact that you cannot have some docs that use SORTED doc values and other docs that use NUMERIC doc values on the same field. Actually I would have liked to go further by enforcing that all data structures record the exact same information but this is challenging due to the fact that IndexingChain only has access to the encoded data, e.g. with IntPoint it only sees a byte[] rather than the original integer, so we'd have to make assumptions about how the data is encoded, which doesn't feel right.
I do like this additional validation very much because I suspect that most cases when users would get this error is because they made a mistake in their indexing code. And this also helps make Lucene work better out-of-the-box. For instance, thanks to this additional validation we enabled dynamic pruning when sorting on numeric fields by default - this is opt-in on 8.x since this optimization needs to look at both points and doc values, so it's broken if not all documents have the same schema. And there are other things we could do in the near future like rewriting DocValuesFieldExistsQuery to a MatchAllDocsQuery when points/terms report that docCount == maxDoc. In my opinion the correct solution for the problem you are facing would be to have a way to make index sorting aware of the parent/child relationship so that index sorting would read the sort key of the parent document whenever it is on a child document, e.g. as done on LUCENE-5312 <https://issues.apache.org/jira/browse/LUCENE-5312>. This way you wouldn't have to duplicate this sort key from your parent documents to your child documents, so you wouldn't have any schema issues. On Wed, Sep 1, 2021 at 4:44 PM Michael Sokolov <[email protected]> wrote: > While upgrading I ran afoul of some inconsistencies in our schema > usage, and to fix them I've ended up having to add data to our index > that I'd rather not. Let me give a little context: We have a > parent/child document structure. Some fields are shared across partn > and child docs, others are not. Our index has a sort key, and in order > for all the parent/child docs to sort together correctly, we add the > same (docvalues) fields that are part of the sortkey to both parent > and child docs. Some of these fields are *also* indexed as postings > (StringField) of the same name, but we only index the postings field > on the parent document, since child documents are never searched for > on their own - always in conjunction with a parent. > > The schema-checking code we added in Lucene 9 does not allow this: it > enforces that all documents having a field should have the same "index > options", and failing to index the postings gets interpreted as having > index options = NONE (because of the presence of the doc values field > of the same name, I think?) > > Our current solution is to also index the postings for the child > document (but just with an empty string value). This seems gross, and > creates postings in the index that we will never use. > > Another possibility would be to rename the fields so that the postings > and docvalues fields have different names. But in this case our > application-level schema diverges from our Lucene schema, adding a > layer of complexity we'd rather not introduce. > > Finally, could we relax this constraint, always allowing index > options=NONE regardless of how other docs are indexed? Would it cause > problems? > > -Mike > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > -- Adrien
