Re: Confusing DocValues documentation

Tech Id Fri, 22 Dec 2017 11:46:19 -0800

Thanks Emir,

It seems that stored="false" docValues="true" is the default in Solr's
github and the recommended way to go.



grep "docValues=\"true\""
./server/solr/configsets/_default/conf/managed-schema


    <dynamicField name="*_str" type="strings" stored="false"
docValues="true" indexed="false" />

    <fieldType name="string" class="solr.StrField" sortMissingLast="true"
docValues="true" />

    <fieldType name="strings" class="solr.StrField" sortMissingLast="true"
multiValued="true" docValues="true" />

      Point fields don't support FieldCache, so they must have
docValues="true" if needed for sorting, faceting, functions, etc.

    <fieldType name="pint" class="solr.IntPointField" docValues="true"/>

    <fieldType name="pfloat" class="solr.FloatPointField" docValues="true"/>

    <fieldType name="plong" class="solr.LongPointField" docValues="true"/>

    <fieldType name="pdouble" class="solr.DoublePointField"
docValues="true"/>

    <fieldType name="pints" class="solr.IntPointField" docValues="true"
multiValued="true"/>

    <fieldType name="pfloats" class="solr.FloatPointField" docValues="true"
multiValued="true"/>

    <fieldType name="plongs" class="solr.LongPointField" docValues="true"
multiValued="true"/>

    <fieldType name="pdoubles" class="solr.DoublePointField"
docValues="true" multiValued="true"/>

    <fieldType name="pdate" class="solr.DatePointField" docValues="true"/>

    <fieldType name="pdates" class="solr.DatePointField" docValues="true"
multiValued="true"/>

    <fieldType name="location" class="solr.LatLonPointSpatialField"
docValues="true"/>



So all the basic field-types (single and multi-valued) would have
docValues="true" and stored="false" is the default I assume.
But I do not get why the "id" field and the "dynamic fields" have
stored="true" in Solr 7:



grep "stored=\"true\""
./server/solr/configsets/_default/conf/managed-schema | grep -v "\*_txt_"


    <field name="id" type="string" indexed="true" stored="true"
required="true" multiValued="false" />

    <dynamicField name="*_i"  type="pint"    indexed="true"  stored="true"/>

    <dynamicField name="*_is" type="pints"    indexed="true"
stored="true"/>

    <dynamicField name="*_s"  type="string"  indexed="true"  stored="true"
/>

    <dynamicField name="*_ss" type="strings"  indexed="true"
stored="true"/>

    <dynamicField name="*_l"  type="plong"   indexed="true"  stored="true"/>

    <dynamicField name="*_ls" type="plongs"   indexed="true"
stored="true"/>

    <dynamicField name="*_txt" type="text_general" indexed="true"
stored="true"/>

    <dynamicField name="*_b"  type="boolean" indexed="true" stored="true"/>

    <dynamicField name="*_bs" type="booleans" indexed="true" stored="true"/>

    <dynamicField name="*_f"  type="pfloat"  indexed="true"  stored="true"/>

    <dynamicField name="*_fs" type="pfloats"  indexed="true"
stored="true"/>

    <dynamicField name="*_d"  type="pdouble" indexed="true"  stored="true"/>

    <dynamicField name="*_ds" type="pdoubles" indexed="true"
stored="true"/>

    <dynamicField name="*_dt"  type="pdate"    indexed="true"
stored="true"/>

    <dynamicField name="*_dts" type="pdate"    indexed="true"  stored="true"
multiValued="true"/>

    <dynamicField name="*_p"  type="location" indexed="true" stored="true"/>

    <dynamicField name="*_srpt"  type="location_rpt" indexed="true"
stored="true"/>

    <dynamicField name="*_dpf" type="delimited_payloads_float"
indexed="true"  stored="true"/>

    <dynamicField name="*_dpi" type="delimited_payloads_int" indexed="true"
stored="true"/>

    <dynamicField name="*_dps" type="delimited_payloads_string"
indexed="true"  stored="true"/>

    <dynamicField name="attr_*" type="text_general" indexed="true"
stored="true" multiValued="true"/>

    <dynamicField name="*_ws" type="text_ws"  indexed="true"
stored="true"/>

    <dynamicField name="*_phon_en" type="phonetic_en"  indexed="true"
stored="true"/>

    <dynamicField name="*_s_lower" type="lowercase"  indexed="true"
stored="true"/>

    <dynamicField name="*_descendent_path" type="descendent_path"
indexed="true"  stored="true"/>

    <dynamicField name="*_ancestor_path" type="ancestor_path"
indexed="true"  stored="true"/>

    <dynamicField name="*_point" type="point"  indexed="true"
stored="true"/>



That is perhaps a bug?



Booleans seem to care neither about stored nor docValues:


grep -i boolean ./server/solr/configsets/_default/conf/managed-schema


    <fieldType name="boolean" class="solr.BoolField"
sortMissingLast="true"/>

    <fieldType name="booleans" class="solr.BoolField"
sortMissingLast="true" multiValued="true"/>



-T



On Fri, Dec 22, 2017 at 11:20 AM, Emir Arnautović <
emir.arnauto...@sematext.com> wrote:

> Your questions are already more or less answered:
> > 1) If the docValues are that good, can we git rid of the stored values
> > altogether?
> You can if you want - just configure your field with stored=“false” and
> docValues=“true”. Note that you can do that only if:
> * field is not analyzed (you cannot enable docValues for analyzed field)
> * you do not care about order of your values
>
> > 2) And why the docValues are not enabled by default for multi-valued
> fields?
> Because it is overhead when it comes to indexing and it is not used in all
> cases - only if field is used for faceting, sorting or in functions.
>
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 22 Dec 2017, at 19:51, Tech Id <tech.login....@gmail.com> wrote:
> >
> > Very interesting discussion SG and Erick.
> > I wish these details were part of the official Solr documentation as
> well.
> > And yes, "columnar format" did not give any useful information to me
> either.
> >
> >
> > "A good explanation increases contributions to the project as more people
> > become empowered to improvise."
> >   - Self, LOL
> >
> >
> > I was expecting the sorting, faceting, pivoting to a bit more optimized
> for
> > docValues, something like a pre-calculated bit of information.
> > However, now it seems that the major benefit of docValues is to optimize
> > the lookup time of stored fields.
> > Here is the sorting function I wrote as pseudo-code from the discussion:
> >
> >
> > int docIDs[] = filterDocsOnQuery (query);
> > T docValues[] = loadDocValues (sortField);
> > TreeMap<T, int> sortFieldValues[] = new TreeMap<>();
> > for (int docId : docIDs) {
> >    T val = docValues[docId];
> >    sortFieldValues.put(val, docId);
> > }
> > // return docIDs sorted by value
> > return sortFieldValues.values;
> >
> >
> > It is indeed difficult to pre-compute the sorts and facets because we do
> > not know what docIDs will be returned by the filtering.
> >
> > Two last questions I have are:
> > 1) If the docValues are that good, can we git rid of the stored values
> > altogether?
> > 2) And why the docValues are not enabled by default for multi-valued
> fields?
> >
> >
> > -T
> >
> >
> >
> >
> > On Thu, Dec 21, 2017 at 9:02 PM, Erick Erickson <erickerick...@gmail.com
> >
> > wrote:
> >
> >> OK, last bit of the tutorial.
> >>
> >> bq: But that does not seem to be helping with sorting or faceting of any
> >> kind.
> >> This seems to be like a good way to speed up a stored field's retrieval.
> >>
> >> These are the same thing. I have two docs. I have to know how they
> >> sort. Therefore I need the value in the sort field for each. This the
> >> same thing as getting the stored value, no?
> >>
> >> As for facets it's the same problem. To count facet buckets I have to
> >> find the values for the  field for each document in the results list
> >> and tally them. This is also getting the stored value, right? You're
> >> asking "for the docs in my result set, how many of them have val1, how
> >> many have val2, how many have val54 etc.
> >>
> >> And as an aside the docValues can also be used to return the stored
> value.
> >>
> >> Best,
> >> Erick
> >>
> >> On Thu, Dec 21, 2017 at 8:23 PM, S G <sg.online.em...@gmail.com> wrote:
> >>> Thank you Eric.
> >>>
> >>> I guess the biggest piece I was missing was the sort on a field other
> >> than
> >>> the search field.
> >>> Once you have filtered a list of documents and then you want to sort,
> the
> >>> inverted index cannot be used for lookup.
> >>> You just have doc-IDs which are values in inverted index, not the keys.
> >>> Hence they cannot be "looked" up - only option is to loop through all
> the
> >>> entries of that key's inverted index.
> >>>
> >>> DocValues come to rescue by reducing that looping operation to a lookup
> >>> again.
> >>> Because in docValues, the key (i.e. array-index) is the document-index
> >> and
> >>> gives an O(1) lookup for any doc-ID.
> >>>
> >>>
> >>> But that does not seem to be helping with sorting or faceting of any
> >> kind.
> >>> This seems to be like a good way to speed up a stored field's
> retrieval.
> >>>
> >>> DocValues in the current example are:
> >>> FieldA
> >>> doc1 = 1
> >>> doc2 = 2
> >>> doc3 =
> >>>
> >>> FieldB
> >>> doc1 = 2
> >>> doc2 = 4
> >>> doc3 = 5
> >>>
> >>> FieldC
> >>> doc1 = 5
> >>> doc2 =
> >>> doc3 = 5
> >>>
> >>> So if I have to run a query:
> >>>    fieldA=*&sort=fieldB asc
> >>> I will get all the documents due to filter and then I will lookup the
> >>> values of field-B from the docValues lookup.
> >>> That will give me 2,4,5
> >>> This is sorted in this case, but assume that this was not sorted.
> >>> (The docValues array is indexed by Lucene's doc-ID not the field-value
> >>> after all, right?)
> >>>
> >>> Then does Lucene/Solr still sort them like regular array of values?
> >>> That does not seem very efficient.
> >>> And it does not seem to helping with faceting, pivoting too.
> >>> What did I miss?
> >>>
> >>> Thanks
> >>> SG
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Thu, Dec 21, 2017 at 5:31 PM, Erick Erickson <
> erickerick...@gmail.com
> >>>
> >>> wrote:
> >>>
> >>>> Here's where you're going off the rails: "I can just look at the
> >>>> map-for-field-A"
> >>>>
> >>>> As I said before, you're totally right, all the information you need
> >>>> is there. But
> >>>> you're thinking of this as though speed weren't a premium when you
> say.
> >>>> "I can just look". Consider that there are single replicas out there
> >> with
> >>>> 300M
> >>>> (or more) docs in them. "Just looking" in a list 300M items long 300M
> >> times
> >>>> (q=*:*&sort=whatever) is simply not going to be performant compared to
> >>>> 300M indexing operations which is what DV does.
> >>>>
> >>>> Faceting is much worse.
> >>>>
> >>>> Plus space is also at a premium. Java takes 40+ bytes to store the
> first
> >>>> character. So any Java structure you use is going to be enormous. 300M
> >> ints
> >>>> is bad enough. And if you spoof this by using ordinals as Lucene does,
> >>>> you're
> >>>> well on your way to reinventing docValues.
> >>>>
> >>>> Maybe this will help. Imagine you have a phone book in your hands. It
> >>>> consists of documents like this:
> >>>>
> >>>> id: something
> >>>> phone: phone number
> >>>> name: person's name
> >>>>
> >>>> For simplicity, they're both string types 'cause they sort.
> >>>>
> >>>> Let's search by phone number but sort by name, i.e.
> >>>>
> >>>> q=phone:1234*&sort=name asc
> >>>>
> >>>> I'm searching and find two docs that match. How do I know how they
> >>>> sort wrt each other?
> >>>>
> >>>> I'm searching in the phone field but I need the value for each doc
> >>>> associated with the name field. In your example I'm searching in
> >>>> map-for-fieldA but sorting in map-for-field-B
> >>>>
> >>>> To get the name value for these two docs I have to enumerate
> >>>> map-for-field-B until I find each doc and then I can get the proper
> >>>> value and know how they sort. Sure, I could do some ordering and do a
> >>>> binary search but that's still vastly slower than having a structure
> >>>> that's a simple index operation to get the value in its field.
> >>>>
> >>>> The DV structure is actually more like what's below. These structures
> >>>> are simply an array indexed by the _internal_ Lucene document id,
> >>>> which is a simple zero-based integer that contains the value
> >>>> associated with that doc for that field (I'm simplifying a bit, but
> >>>> that's conceptually the deal).
> >>>> FieldA
> >>>> doc1 = 1
> >>>> doc2 = 2
> >>>> doc3 =
> >>>>
> >>>> FieldB
> >>>> doc1 = 2
> >>>> doc2 = 4
> >>>> doc3 = 5
> >>>>
> >>>> FieldC
> >>>> doc1 = 5
> >>>> doc2 =
> >>>> doc3 = 5
> >>>>
> >>>> Best,
> >>>> Erick
> >>>>
> >>>> On Thu, Dec 21, 2017 at 4:05 PM, S G <sg.online.em...@gmail.com>
> wrote:
> >>>>> Thanks a lot Erick and Emir.
> >>>>>
> >>>>> I am still a bit confused and an example will help me a lot.
> >>>>> Here is a little bit modified version of the same to illustrate my
> >> point
> >>>>> more clearly.
> >>>>>
> >>>>> Let us consider 3 documents - doc1, doc2 and doc3
> >>>>> Each contains upto 3 fields - A, B and C.
> >>>>> And the values for these fields are random.
> >>>>> For example:
> >>>>>    doc1 = {A:1, B:2, C:5}
> >>>>>    doc2 = {A:2, B:4}
> >>>>>    doc3 = {B:5, C:5}
> >>>>>
> >>>>>
> >>>>> Inverted Index for the same should be a map of:
> >>>>> Key: <value-for-each-field>
> >>>>> Value: <document-containing-that-value>
> >>>>> i.e.
> >>>>> {
> >>>>>   map-for-field-A: {1: doc1, 2: doc2}
> >>>>>   map-for-field-B: {2: doc1, 4: doc2, 5:doc3}
> >>>>>   map-for-field-C: {5: [doc1, doc3]}
> >>>>> }
> >>>>>
> >>>>> For sorting on field A, I can just look at the map-for-field-A and
> >> sort
> >>>> the
> >>>>> keys (and
> >>>>> perhaps keep it sorted too for saving the sort each time). For facets
> >> on
> >>>>> field A, I can
> >>>>> again, just look at the map-for-field-A and get counts for each
> value.
> >>>> So I
> >>>>> will
> >>>>> get facets(Field-A) = {1:1, 2:1} because count for each value is 1.
> >>>>> Similarly facets(Field-C) = {5:2}
> >>>>>
> >>>>> Why is this not performant? All it did was to bring one
> data-structure
> >>>> into
> >>>>> memory and if
> >>>>> the current implementation was changed to use OS-cache for the same,
> >> the
> >>>>> pressure on
> >>>>> the JVM would be reduced as well.
> >>>>>
> >>>>> So the point I am trying to make here is that how does the
> >>>> data-structure of
> >>>>> docValues differ from the inverted index I showed above? And how does
> >>>> that
> >>>>> structure helps it become more performant? I do not want to factor in
> >> the
> >>>>> OS-cache perspective here for the time being because that could have
> >> been
> >>>>> fixed in the regular inverted index also. I just want to focus on the
> >>>>> data-structure
> >>>>> for now that how it is different from the inverted index. Please do
> >> not
> >>>> say
> >>>>> "columnar format" as
> >>>>> those 2 words really convey nothing to me.
> >>>>>
> >>>>> If you can draw me the exact "columnar format" for the above example,
> >>>> then
> >>>>> it would be much appreciated.
> >>>>>
> >>>>> Thanks
> >>>>> SG
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Thu, Dec 21, 2017 at 12:43 PM, Erick Erickson <
> >>>> erickerick...@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>> bq: I do not see why sorting or faceting on any field A, B or C
> would
> >>>>>> be a problem. All the values for a field are there in one
> >>>>>> data-structure and it should be easy to sort or group-by on that.
> >>>>>>
> >>>>>> This is totally true just totally incomplete: ;)
> >>>>>>
> >>>>>> for a given field:
> >>>>>>
> >>>>>> Inverted structure (leaving out position information and the like):
> >>>>>>
> >>>>>> term1: doc1,   doc37, doc 95
> >>>>>> term2: doc10, doc37, doc 950
> >>>>>>
> >>>>>> docValues structure (assuming multiValued):
> >>>>>>
> >>>>>> doc1: term1
> >>>>>> doc10: term2
> >>>>>> doc37: term1 term2
> >>>>>> doc95: term1
> >>>>>> doc950: term2
> >>>>>>
> >>>>>> They are used to answer two different questions.
> >>>>>>
> >>>>>> The inverted structure efficiently answers "for term1, what docs
> does
> >>>>>> it appear in?"
> >>>>>>
> >>>>>> The docValues structure efficiently answers "for doc1, what terms
> are
> >>>>>> in the field?"
> >>>>>>
> >>>>>> So imagine you have a search on term1. It's a simple iteration of
> the
> >>>>>> inverted structure to get my result set, namely docs 1, 37, and 95.
> >>>>>>
> >>>>>> But now I want to facet. I have to get the _values_ for my field
> from
> >>>>>> the entire result set in order to fill my count buckets. Using the
> >>>>>> uninverted structure, I'd have to scan the entire table term-by-term
> >>>>>> and look to see if the term appeared in any of docs 1, 37, 95 and
> add
> >>>>>> to my total for the term. Think "table scan".
> >>>>>>
> >>>>>> Instead I use the docValues structure which is much faster, I
> already
> >>>>>> know all I'm interested in is these three docs, so I just read the
> >>>>>> terms in the field for each doc and add to my counts. Again, to
> >> answer
> >>>>>> this question from the wrong (in this case inverted structure) I'd
> >>>>>> have to do a table scan. Also, this would be _extremely_ expensive
> to
> >>>>>> do from stored fields.
> >>>>>>
> >>>>>> And it's the inverse for searching the docValues structure. In order
> >>>>>> to find which doc has term1, I'd have to examine all the terms for
> >> the
> >>>>>> field for each document in my index. Horribly painful.
> >>>>>>
> >>>>>> So yes, the information is all there in one structure or the other
> >> and
> >>>>>> you _could_ get all of it from either one. You'd also have a system
> >>>>>> that was able to serve 0.00001 QPS on a largish index.
> >>>>>>
> >>>>>> And remember that this is very simplified. If you have a complex
> >> query
> >>>>>> you need to get a result set before even considering the
> >>>>>> facet/sort/whatever question so gathering the term information as I
> >>>>>> searched wouldn't particularly work.
> >>>>>>
> >>>>>> Best,
> >>>>>> Erick
> >>>>>>
> >>>>>> On Thu, Dec 21, 2017 at 9:56 AM, S G <sg.online.em...@gmail.com>
> >> wrote:
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> It seems that docValues are not really explained well anywhere.
> >>>>>>> Here are 2 links that try to explain it:
> >>>>>>> 1) https://lucidworks.com/2013/04/02/fun-with-docvalues-in-
> >> solr-4-2/
> >>>>>>> 2)
> >>>>>>> https://www.elastic.co/guide/en/elasticsearch/guide/
> >>>>>> current/docvalues.html
> >>>>>>>
> >>>>>>> And official Solr documentation that does not explain the internal
> >>>>>> details
> >>>>>>> at all:
> >>>>>>> 3) https://lucene.apache.org/solr/guide/6_6/docvalues.html
> >>>>>>>
> >>>>>>> The first links says that:
> >>>>>>>  The row-oriented (stored fields) are
> >>>>>>>  {
> >>>>>>>    'doc1': {'A':1, 'B':2, 'C':3},
> >>>>>>>    'doc2': {'A':2, 'B':3, 'C':4},
> >>>>>>>    'doc3': {'A':4, 'B':3, 'C':2}
> >>>>>>>  }
> >>>>>>>
> >>>>>>>  while column-oriented (docValues) are:
> >>>>>>>  {
> >>>>>>>    'A': {'doc1':1, 'doc2':2, 'doc3':4},
> >>>>>>>    'B': {'doc1':2, 'doc2':3, 'doc3':3},
> >>>>>>>    'C': {'doc1':3, 'doc2':4, 'doc3':2}
> >>>>>>>  }
> >>>>>>>
> >>>>>>> And the second link gives an example as:
> >>>>>>> Doc values maps documents to the terms contained by the document:
> >>>>>>>
> >>>>>>>  Doc      Terms
> >>>>>>>  ------------------------------------------------------------
> >> -----
> >>>>>>>  Doc_1 | brown, dog, fox, jumped, lazy, over, quick, the
> >>>>>>>  Doc_2 | brown, dogs, foxes, in, lazy, leap, over, quick, summer
> >>>>>>>  Doc_3 | dog, dogs, fox, jumped, over, quick, the
> >>>>>>>  ------------------------------------------------------------
> >> -----
> >>>>>>>
> >>>>>>>
> >>>>>>> To me, this example is same as the row-oriented (stored fields)
> >>>> format in
> >>>>>>> the first link.
> >>>>>>> Which one is right?
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Also, the column-oriented (docValues) mentioned above are:
> >>>>>>> {
> >>>>>>>  'A': {'doc1':1, 'doc2':2, 'doc3':4},
> >>>>>>>  'B': {'doc1':2, 'doc2':3, 'doc3':3},
> >>>>>>>  'C': {'doc1':3, 'doc2':4, 'doc3':2}
> >>>>>>> }
> >>>>>>> Isn't this what the inverted index also looks like?
> >>>>>>> Inverted index is an index of the term (A,B,C) to the document and
> >> the
> >>>>>>> position it is found in the document.
> >>>>>>>
> >>>>>>>
> >>>>>>> Or is it better to say that the inverted index is of the form:
> >>>>>>> {
> >>>>>>>   map-for-field-A: {1: doc1, 2: doc2, 4: doc3}
> >>>>>>>   map-for-field-B: {2: doc1, 3: [doc2,doc3]}
> >>>>>>>   map-for-field-C: {3: doc1, 4: doc2, 2: doc3}
> >>>>>>> }
> >>>>>>> But even if that is true, I do not see why sorting or faceting on
> >> any
> >>>>>> field
> >>>>>>> A, B or C would be a problem.
> >>>>>>> All the values for a field are there in one data-structure and it
> >>>> should
> >>>>>> be
> >>>>>>> easy to sort or group-by on that.
> >>>>>>>
> >>>>>>> Can someone explain the above a bit more clearly please? A
> >> build-upon
> >>>> the
> >>>>>>> same example as above would be really good.
> >>>>>>>
> >>>>>>>
> >>>>>>> Thanks
> >>>>>>> SG
> >>>>>>
> >>>>
> >>
>
>

Re: Confusing DocValues documentation

Reply via email to