Hmm .. I guess I missed the implication of your comment about requiring both points and docvalues for some cases, which I guess could be violated if we relaxed this NONE != not NONE enforcement for docvalues (or points)...
On Thu, Sep 2, 2021 at 7:46 AM Michael Sokolov <[email protected]> wrote: > > Oh, and also, I like the idea of making index sorting parent/child aware! > > On Thu, Sep 2, 2021 at 7:45 AM Michael Sokolov <[email protected]> wrote: > > > > Yes, I am also supportive of the idea of having a schema that is > > enforced, and I like what it enables us to do. I just wonder if we > > could relax the enforcement around IndexOptions.NONE (and > > DocValuesType.NONE). Would it make sense to enable NONE to be "equal > > to" any other IndexOptions, so that eg, you if you index a field with > > IndexOptions.DOCS_AND_TERMS then every document must have either > > DOCS_AND_TERMS or NONE? In the case where a field is *only* indexed > > as terms, and has no docvalues, this is already allowed. But if you > > index a field as both docvalue and terms, then it is not (currently), > > which seems weird. I guess the same is true of a field that has no > > docvalues on some docs, and has them on others, but is also indexed as > > terms everywhere. I think that ought to be allowed (since you can have > > a sparse docvalues field that is not indexed with terms). > > > > On Wed, Sep 1, 2021 at 12:24 PM Adrien Grand <[email protected]> wrote: > > > > > > This additional validation that we introduced in Lucene 9 feels like a > > > natural extension of the validation that we already had before, such as > > > the fact that you cannot have some docs that use SORTED doc values and > > > other docs that use NUMERIC doc values on the same field. Actually I > > > would have liked to go further by enforcing that all data structures > > > record the exact same information but this is challenging due to the fact > > > that IndexingChain only has access to the encoded data, e.g. with > > > IntPoint it only sees a byte[] rather than the original integer, so we'd > > > have to make assumptions about how the data is encoded, which doesn't > > > feel right. > > > > > > I do like this additional validation very much because I suspect that > > > most cases when users would get this error is because they made a mistake > > > in their indexing code. And this also helps make Lucene work better > > > out-of-the-box. For instance, thanks to this additional validation we > > > enabled dynamic pruning when sorting on numeric fields by default - this > > > is opt-in on 8.x since this optimization needs to look at both points and > > > doc values, so it's broken if not all documents have the same schema. And > > > there are other things we could do in the near future like rewriting > > > DocValuesFieldExistsQuery to a MatchAllDocsQuery when points/terms report > > > that docCount == maxDoc. > > > > > > In my opinion the correct solution for the problem you are facing would > > > be to have a way to make index sorting aware of the parent/child > > > relationship so that index sorting would read the sort key of the parent > > > document whenever it is on a child document, e.g. as done on LUCENE-5312. > > > This way you wouldn't have to duplicate this sort key from your parent > > > documents to your child documents, so you wouldn't have any schema issues. > > > > > > On Wed, Sep 1, 2021 at 4:44 PM Michael Sokolov <[email protected]> wrote: > > >> > > >> While upgrading I ran afoul of some inconsistencies in our schema > > >> usage, and to fix them I've ended up having to add data to our index > > >> that I'd rather not. Let me give a little context: We have a > > >> parent/child document structure. Some fields are shared across partn > > >> and child docs, others are not. Our index has a sort key, and in order > > >> for all the parent/child docs to sort together correctly, we add the > > >> same (docvalues) fields that are part of the sortkey to both parent > > >> and child docs. Some of these fields are *also* indexed as postings > > >> (StringField) of the same name, but we only index the postings field > > >> on the parent document, since child documents are never searched for > > >> on their own - always in conjunction with a parent. > > >> > > >> The schema-checking code we added in Lucene 9 does not allow this: it > > >> enforces that all documents having a field should have the same "index > > >> options", and failing to index the postings gets interpreted as having > > >> index options = NONE (because of the presence of the doc values field > > >> of the same name, I think?) > > >> > > >> Our current solution is to also index the postings for the child > > >> document (but just with an empty string value). This seems gross, and > > >> creates postings in the index that we will never use. > > >> > > >> Another possibility would be to rename the fields so that the postings > > >> and docvalues fields have different names. But in this case our > > >> application-level schema diverges from our Lucene schema, adding a > > >> layer of complexity we'd rather not introduce. > > >> > > >> Finally, could we relax this constraint, always allowing index > > >> options=NONE regardless of how other docs are indexed? Would it cause > > >> problems? > > >> > > >> -Mike > > >> > > >> --------------------------------------------------------------------- > > >> To unsubscribe, e-mail: [email protected] > > >> For additional commands, e-mail: [email protected] > > >> > > > > > > > > > -- > > > Adrien --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
