You're correct that these doc value fields are primarily meant for sorting, as well as some other use-cases like faceting. And what you're discovered is also correct, that these fields don't maintain the original ordering, and SORTED_SET dedupes values ( https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/index/DocValuesType.html ).
There's no technical reason new doc value types couldn't be added that maintain original ordering and don't dedupe, but whether-or-not there are enough use-cases to support that need is a question that would need to be considered. +1 to Shai's suggestion to build on BinaryDocValues. By extending BinaryDocValuesField, you can encode the doc values however you like. An example of this can be seen here: https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/document/IntRangeDocValuesField.java Hope this helps. Cheers, -Greg On Tue, Jun 28, 2022 at 5:52 AM Shai Erera <[email protected]> wrote: > Depending on what you use the field for, you can use BinaryDocValuesField > which encodes a byte[] and lets you store the data however you want. But > how are you using these fields later at search time? > > On Tue, Jun 28, 2022 at 3:46 PM linfeng lu <[email protected]> wrote: > >> Hi~ >> >> We are trying to build an OLAP database based on lucene, and we heavily >> use lucene's *DocValues* (as our column store). >> >> *We try to use DocValues to store the array type field. *For example, if >> we want to store the *field1* and *feild2* in this json document into >> *DocValues* respectively, SORTED_NUMERIC and SORTED_SET seem to be our >> only option. >> >> *{* >> * "field1": [ 3, 1, 1, 2 ], * >> * "field2": [ "c", "a", "a", "b" ] * >> *}* >> >> >> When we store *field1* in SORTED_NUMERIC and *field2* in SORTED_SET, we >> will get this result: >> >> *[image: Community Verified icon]* >> >> field1: >> >> - origin: [3, 1, 1, 2] >> - in SORTED_NUMERIC: [1, 1, 2, 3] >> >> field2: >> >> - origin: [”c”, “a”, “a”, “b” ] >> - in SORTED_SET: ords [0, 1, 2] terms [”a”, “b”, “c”] >> >> >> The original ordering relationship of the elements in the array is lost. >> >> We're guessing that lucene's DocValues are designed primarily for sorting >> and aggregation, so the original order of elements may not matter. >> >> But in our usage scene, it is important to keep the original order of >> the elements in the array (we allow user to access the elements in the >> array using the subscript operator). >> >> We wonder if lucene has plans to add new types of DocValues that can >> store arrays and keep the original order of elements in the array? >> >> Thanks! >> >
