You're correct that these doc value fields are primarily meant for sorting,
as well as some other use-cases like faceting. And what you're discovered
is also correct, that these fields don't maintain the original ordering,
and SORTED_SET dedupes values (
https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/index/DocValuesType.html
).

There's no technical reason new doc value types couldn't be added that
maintain original ordering and don't dedupe, but whether-or-not there are
enough use-cases to support that need is a question that would need to be
considered. +1 to Shai's suggestion to build on BinaryDocValues. By
extending BinaryDocValuesField, you can encode the doc values however you
like. An example of this can be seen here:
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/document/IntRangeDocValuesField.java

Hope this helps.

Cheers,
-Greg

On Tue, Jun 28, 2022 at 5:52 AM Shai Erera <[email protected]> wrote:

> Depending on what you use the field for, you can use BinaryDocValuesField
> which encodes a byte[] and lets you store the data however you want. But
> how are you using these fields later at search time?
>
> On Tue, Jun 28, 2022 at 3:46 PM linfeng lu <[email protected]> wrote:
>
>> Hi~
>>
>> We are trying to build an OLAP database based on lucene, and we heavily
>> use lucene's *DocValues* (as our column store).
>>
>> *We try to use DocValues to store the array type field. *For example, if
>> we want to store the *field1* and *feild2* in this json document into
>> *DocValues* respectively, SORTED_NUMERIC and SORTED_SET seem to be our
>> only option.
>>
>> *{*
>> *    "field1": [ 3, 1, 1, 2 ], *
>> *    "field2": [ "c", "a", "a", "b" ] *
>> *}*
>>
>>
>> When we store *field1* in SORTED_NUMERIC and *field2* in SORTED_SET, we
>> will get this result:
>>
>> *[image: Community Verified icon]*
>>
>> field1:
>>
>>    - origin: [3, 1, 1, 2]
>>    - in SORTED_NUMERIC: [1, 1, 2, 3]
>>
>> field2:
>>
>>    - origin: [”c”, “a”, “a”, “b” ]
>>    - in SORTED_SET: ords [0, 1, 2] terms [”a”, “b”, “c”]
>>
>>
>> The original ordering relationship of the elements in the array is lost.
>>
>> We're guessing that lucene's DocValues are designed primarily for sorting
>> and aggregation, so the original order of elements may not matter.
>>
>> But in our usage scene, it is important to keep the original order of
>> the elements in the array (we allow user to access the elements in the
>> array using the subscript operator).
>>
>> We wonder if lucene has plans to add new types of DocValues that can
>> store arrays and keep the original order of elements in the array?
>>
>> Thanks!
>>
>

Reply via email to