Hello,

Lucene currently doesn't require consistency across data-structures. For
instance it is possible to have different values in points and doc values
under the same field name. Until now, we worked around it either by making
features use a single data-structure, e.g. facets only use doc values, or
by pushing the responsibility of having consistent data across
data-structures to the user, e.g. IndexOrDocValuesQuery requires that the
point query and the doc-value query match the same documents and it's the
responsibility of the user to ensure this.

I'm unhappy that it makes Lucene very hard to use. Creating an efficient
range query should be a one-liner, but due to this limitation, users have
to first learn about LongPointQuery#newRangeQuery,
NumericDocValuesField#newSlowRangeQuery and then combine them with
IndexOrDocValuesQuery or maybe even
IndexSortSortedNumericDocValuesRangeQuery. If Lucene had a requirement that
if a field both has points and numeric doc values then both data-structurs
contain the same content, then we could automatically use the
IndexOrDocValuesQuery optimization in LongPoint#newRangeQuery when noticing
that the field also has doc values of type NUMERIC or SORTED_NUMERIC.

This question is being raised again as we are working on dynamically
pruning uncompetitive hits when sorting by field by leveraging the points
index.[1] This can produce very significant speedups but again requires
that the same data be indexed in points and doc values.

[1] https://github.com/apache/lucene-solr/pull/1351

We had discussions about adding a notion of schema of Lucene in the past,
see e.g. [2]. This seems desirable to me but also a high hanging fruit and
possibly controversial, so my short term proposal would instead be to:
 - Require documents to be consistent in the data-structures that they use:
you can't have one document using only points on a document and another
document using only doc values on another document. Of course it would
still be possible to index documents that have neither points nor doc
values indexed even if previous documents had either enabled in order to
handle documents with missing values properly.
 - Don't hesitate to rely on consistency across fields when implementing
new functionality, ie. LongPoint#newRangeQuery would check whether the
FieldInfo has numeric doc values, and if so would automatically enable
the IndexOrDocValuesQuery and IndexSortSortedNumericDocValuesRangeQuery
optimizations.

[2] https://issues.apache.org/jira/browse/LUCENE-6005

Checking that documents have the same values sounds desirable to me but
also challenging due to how we sometimes encode data on top of the Lucene
APIs, e.g. longs become byte[] in the points index, geo points become a
single long in doc values, and we have a few use-cases when we encode
muliple values into a single BinaryDocValueField in Elasticsearch to work
around the absence of multi-value binary doc values support. I think it'd
be acceptable to not validate values but still expect consistency in our
search APIs?

What do you think?

-- 
Adrien

Reply via email to