Could we use this as a stepping stone towards a schema? Just a very
lightweight schema that only enforces what we can easily enforce
today, but put some minimal abstraction in place where we can hang
future consistency checks.

Re: value consistency; could we do a best-effort enforcement in
DefaultIndexingChain where we have the values unencoded?

On Mon, Apr 20, 2020 at 9:48 AM Alan Woodward <[email protected]> wrote:
>
> One way of doing this might be to add an additional field type that adds both 
> point and docvalues, and then have factory methods for queries and sorts on 
> the field type.  So for example a LongPointAndValue field would automatically 
> index its value into both BKD and NumericDocValues, and then 
> LongPointAndValue#newRangeQuery() would build the relevant 
> IndexOrDocValuesQuery, and LongPointAndValue#newSortField() would return a 
> sort field that can use the shortcuts.
>
> On 20 Apr 2020, at 14:10, Adrien Grand <[email protected]> wrote:
>
> Hello,
>
> Lucene currently doesn't require consistency across data-structures. For 
> instance it is possible to have different values in points and doc values 
> under the same field name. Until now, we worked around it either by making 
> features use a single data-structure, e.g. facets only use doc values, or by 
> pushing the responsibility of having consistent data across data-structures 
> to the user, e.g. IndexOrDocValuesQuery requires that the point query and the 
> doc-value query match the same documents and it's the responsibility of the 
> user to ensure this.
>
> I'm unhappy that it makes Lucene very hard to use. Creating an efficient 
> range query should be a one-liner, but due to this limitation, users have to 
> first learn about LongPointQuery#newRangeQuery, 
> NumericDocValuesField#newSlowRangeQuery and then combine them with 
> IndexOrDocValuesQuery or maybe even 
> IndexSortSortedNumericDocValuesRangeQuery. If Lucene had a requirement that 
> if a field both has points and numeric doc values then both data-structurs 
> contain the same content, then we could automatically use the 
> IndexOrDocValuesQuery optimization in LongPoint#newRangeQuery when noticing 
> that the field also has doc values of type NUMERIC or SORTED_NUMERIC.
>
> This question is being raised again as we are working on dynamically pruning 
> uncompetitive hits when sorting by field by leveraging the points index.[1] 
> This can produce very significant speedups but again requires that the same 
> data be indexed in points and doc values.
>
> [1] https://github.com/apache/lucene-solr/pull/1351
>
> We had discussions about adding a notion of schema of Lucene in the past, see 
> e.g. [2]. This seems desirable to me but also a high hanging fruit and 
> possibly controversial, so my short term proposal would instead be to:
>  - Require documents to be consistent in the data-structures that they use: 
> you can't have one document using only points on a document and another 
> document using only doc values on another document. Of course it would still 
> be possible to index documents that have neither points nor doc values 
> indexed even if previous documents had either enabled in order to handle 
> documents with missing values properly.
>  - Don't hesitate to rely on consistency across fields when implementing new 
> functionality, ie. LongPoint#newRangeQuery would check whether the FieldInfo 
> has numeric doc values, and if so would automatically enable the 
> IndexOrDocValuesQuery and IndexSortSortedNumericDocValuesRangeQuery 
> optimizations.
>
> [2] https://issues.apache.org/jira/browse/LUCENE-6005
>
> Checking that documents have the same values sounds desirable to me but also 
> challenging due to how we sometimes encode data on top of the Lucene APIs, 
> e.g. longs become byte[] in the points index, geo points become a single long 
> in doc values, and we have a few use-cases when we encode muliple values into 
> a single BinaryDocValueField in Elasticsearch to work around the absence of 
> multi-value binary doc values support. I think it'd be acceptable to not 
> validate values but still expect consistency in our search APIs?
>
> What do you think?
>
> --
> Adrien
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to