Could we use this as a stepping stone towards a schema? Just a very lightweight schema that only enforces what we can easily enforce today, but put some minimal abstraction in place where we can hang future consistency checks.
Re: value consistency; could we do a best-effort enforcement in DefaultIndexingChain where we have the values unencoded? On Mon, Apr 20, 2020 at 9:48 AM Alan Woodward <[email protected]> wrote: > > One way of doing this might be to add an additional field type that adds both > point and docvalues, and then have factory methods for queries and sorts on > the field type. So for example a LongPointAndValue field would automatically > index its value into both BKD and NumericDocValues, and then > LongPointAndValue#newRangeQuery() would build the relevant > IndexOrDocValuesQuery, and LongPointAndValue#newSortField() would return a > sort field that can use the shortcuts. > > On 20 Apr 2020, at 14:10, Adrien Grand <[email protected]> wrote: > > Hello, > > Lucene currently doesn't require consistency across data-structures. For > instance it is possible to have different values in points and doc values > under the same field name. Until now, we worked around it either by making > features use a single data-structure, e.g. facets only use doc values, or by > pushing the responsibility of having consistent data across data-structures > to the user, e.g. IndexOrDocValuesQuery requires that the point query and the > doc-value query match the same documents and it's the responsibility of the > user to ensure this. > > I'm unhappy that it makes Lucene very hard to use. Creating an efficient > range query should be a one-liner, but due to this limitation, users have to > first learn about LongPointQuery#newRangeQuery, > NumericDocValuesField#newSlowRangeQuery and then combine them with > IndexOrDocValuesQuery or maybe even > IndexSortSortedNumericDocValuesRangeQuery. If Lucene had a requirement that > if a field both has points and numeric doc values then both data-structurs > contain the same content, then we could automatically use the > IndexOrDocValuesQuery optimization in LongPoint#newRangeQuery when noticing > that the field also has doc values of type NUMERIC or SORTED_NUMERIC. > > This question is being raised again as we are working on dynamically pruning > uncompetitive hits when sorting by field by leveraging the points index.[1] > This can produce very significant speedups but again requires that the same > data be indexed in points and doc values. > > [1] https://github.com/apache/lucene-solr/pull/1351 > > We had discussions about adding a notion of schema of Lucene in the past, see > e.g. [2]. This seems desirable to me but also a high hanging fruit and > possibly controversial, so my short term proposal would instead be to: > - Require documents to be consistent in the data-structures that they use: > you can't have one document using only points on a document and another > document using only doc values on another document. Of course it would still > be possible to index documents that have neither points nor doc values > indexed even if previous documents had either enabled in order to handle > documents with missing values properly. > - Don't hesitate to rely on consistency across fields when implementing new > functionality, ie. LongPoint#newRangeQuery would check whether the FieldInfo > has numeric doc values, and if so would automatically enable the > IndexOrDocValuesQuery and IndexSortSortedNumericDocValuesRangeQuery > optimizations. > > [2] https://issues.apache.org/jira/browse/LUCENE-6005 > > Checking that documents have the same values sounds desirable to me but also > challenging due to how we sometimes encode data on top of the Lucene APIs, > e.g. longs become byte[] in the points index, geo points become a single long > in doc values, and we have a few use-cases when we encode muliple values into > a single BinaryDocValueField in Elasticsearch to work around the absence of > multi-value binary doc values support. I think it'd be acceptable to not > validate values but still expect consistency in our search APIs? > > What do you think? > > -- > Adrien > > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
