Thanks Emir, It seems that stored="false" docValues="true" is the default in Solr's github and the recommended way to go.
grep "docValues=\"true\"" ./server/solr/configsets/_default/conf/managed-schema <dynamicField name="*_str" type="strings" stored="false" docValues="true" indexed="false" /> <fieldType name="string" class="solr.StrField" sortMissingLast="true" docValues="true" /> <fieldType name="strings" class="solr.StrField" sortMissingLast="true" multiValued="true" docValues="true" /> Point fields don't support FieldCache, so they must have docValues="true" if needed for sorting, faceting, functions, etc. <fieldType name="pint" class="solr.IntPointField" docValues="true"/> <fieldType name="pfloat" class="solr.FloatPointField" docValues="true"/> <fieldType name="plong" class="solr.LongPointField" docValues="true"/> <fieldType name="pdouble" class="solr.DoublePointField" docValues="true"/> <fieldType name="pints" class="solr.IntPointField" docValues="true" multiValued="true"/> <fieldType name="pfloats" class="solr.FloatPointField" docValues="true" multiValued="true"/> <fieldType name="plongs" class="solr.LongPointField" docValues="true" multiValued="true"/> <fieldType name="pdoubles" class="solr.DoublePointField" docValues="true" multiValued="true"/> <fieldType name="pdate" class="solr.DatePointField" docValues="true"/> <fieldType name="pdates" class="solr.DatePointField" docValues="true" multiValued="true"/> <fieldType name="location" class="solr.LatLonPointSpatialField" docValues="true"/> So all the basic field-types (single and multi-valued) would have docValues="true" and stored="false" is the default I assume. But I do not get why the "id" field and the "dynamic fields" have stored="true" in Solr 7: grep "stored=\"true\"" ./server/solr/configsets/_default/conf/managed-schema | grep -v "\*_txt_" <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> <dynamicField name="*_i" type="pint" indexed="true" stored="true"/> <dynamicField name="*_is" type="pints" indexed="true" stored="true"/> <dynamicField name="*_s" type="string" indexed="true" stored="true" /> <dynamicField name="*_ss" type="strings" indexed="true" stored="true"/> <dynamicField name="*_l" type="plong" indexed="true" stored="true"/> <dynamicField name="*_ls" type="plongs" indexed="true" stored="true"/> <dynamicField name="*_txt" type="text_general" indexed="true" stored="true"/> <dynamicField name="*_b" type="boolean" indexed="true" stored="true"/> <dynamicField name="*_bs" type="booleans" indexed="true" stored="true"/> <dynamicField name="*_f" type="pfloat" indexed="true" stored="true"/> <dynamicField name="*_fs" type="pfloats" indexed="true" stored="true"/> <dynamicField name="*_d" type="pdouble" indexed="true" stored="true"/> <dynamicField name="*_ds" type="pdoubles" indexed="true" stored="true"/> <dynamicField name="*_dt" type="pdate" indexed="true" stored="true"/> <dynamicField name="*_dts" type="pdate" indexed="true" stored="true" multiValued="true"/> <dynamicField name="*_p" type="location" indexed="true" stored="true"/> <dynamicField name="*_srpt" type="location_rpt" indexed="true" stored="true"/> <dynamicField name="*_dpf" type="delimited_payloads_float" indexed="true" stored="true"/> <dynamicField name="*_dpi" type="delimited_payloads_int" indexed="true" stored="true"/> <dynamicField name="*_dps" type="delimited_payloads_string" indexed="true" stored="true"/> <dynamicField name="attr_*" type="text_general" indexed="true" stored="true" multiValued="true"/> <dynamicField name="*_ws" type="text_ws" indexed="true" stored="true"/> <dynamicField name="*_phon_en" type="phonetic_en" indexed="true" stored="true"/> <dynamicField name="*_s_lower" type="lowercase" indexed="true" stored="true"/> <dynamicField name="*_descendent_path" type="descendent_path" indexed="true" stored="true"/> <dynamicField name="*_ancestor_path" type="ancestor_path" indexed="true" stored="true"/> <dynamicField name="*_point" type="point" indexed="true" stored="true"/> That is perhaps a bug? Booleans seem to care neither about stored nor docValues: grep -i boolean ./server/solr/configsets/_default/conf/managed-schema <fieldType name="boolean" class="solr.BoolField" sortMissingLast="true"/> <fieldType name="booleans" class="solr.BoolField" sortMissingLast="true" multiValued="true"/> -T On Fri, Dec 22, 2017 at 11:20 AM, Emir Arnautović < emir.arnauto...@sematext.com> wrote: > Your questions are already more or less answered: > > 1) If the docValues are that good, can we git rid of the stored values > > altogether? > You can if you want - just configure your field with stored=“false” and > docValues=“true”. Note that you can do that only if: > * field is not analyzed (you cannot enable docValues for analyzed field) > * you do not care about order of your values > > > 2) And why the docValues are not enabled by default for multi-valued > fields? > Because it is overhead when it comes to indexing and it is not used in all > cases - only if field is used for faceting, sorting or in functions. > > HTH, > Emir > -- > Monitoring - Log Management - Alerting - Anomaly Detection > Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > > > > > On 22 Dec 2017, at 19:51, Tech Id <tech.login....@gmail.com> wrote: > > > > Very interesting discussion SG and Erick. > > I wish these details were part of the official Solr documentation as > well. > > And yes, "columnar format" did not give any useful information to me > either. > > > > > > "A good explanation increases contributions to the project as more people > > become empowered to improvise." > > - Self, LOL > > > > > > I was expecting the sorting, faceting, pivoting to a bit more optimized > for > > docValues, something like a pre-calculated bit of information. > > However, now it seems that the major benefit of docValues is to optimize > > the lookup time of stored fields. > > Here is the sorting function I wrote as pseudo-code from the discussion: > > > > > > int docIDs[] = filterDocsOnQuery (query); > > T docValues[] = loadDocValues (sortField); > > TreeMap<T, int> sortFieldValues[] = new TreeMap<>(); > > for (int docId : docIDs) { > > T val = docValues[docId]; > > sortFieldValues.put(val, docId); > > } > > // return docIDs sorted by value > > return sortFieldValues.values; > > > > > > It is indeed difficult to pre-compute the sorts and facets because we do > > not know what docIDs will be returned by the filtering. > > > > Two last questions I have are: > > 1) If the docValues are that good, can we git rid of the stored values > > altogether? > > 2) And why the docValues are not enabled by default for multi-valued > fields? > > > > > > -T > > > > > > > > > > On Thu, Dec 21, 2017 at 9:02 PM, Erick Erickson <erickerick...@gmail.com > > > > wrote: > > > >> OK, last bit of the tutorial. > >> > >> bq: But that does not seem to be helping with sorting or faceting of any > >> kind. > >> This seems to be like a good way to speed up a stored field's retrieval. > >> > >> These are the same thing. I have two docs. I have to know how they > >> sort. Therefore I need the value in the sort field for each. This the > >> same thing as getting the stored value, no? > >> > >> As for facets it's the same problem. To count facet buckets I have to > >> find the values for the field for each document in the results list > >> and tally them. This is also getting the stored value, right? You're > >> asking "for the docs in my result set, how many of them have val1, how > >> many have val2, how many have val54 etc. > >> > >> And as an aside the docValues can also be used to return the stored > value. > >> > >> Best, > >> Erick > >> > >> On Thu, Dec 21, 2017 at 8:23 PM, S G <sg.online.em...@gmail.com> wrote: > >>> Thank you Eric. > >>> > >>> I guess the biggest piece I was missing was the sort on a field other > >> than > >>> the search field. > >>> Once you have filtered a list of documents and then you want to sort, > the > >>> inverted index cannot be used for lookup. > >>> You just have doc-IDs which are values in inverted index, not the keys. > >>> Hence they cannot be "looked" up - only option is to loop through all > the > >>> entries of that key's inverted index. > >>> > >>> DocValues come to rescue by reducing that looping operation to a lookup > >>> again. > >>> Because in docValues, the key (i.e. array-index) is the document-index > >> and > >>> gives an O(1) lookup for any doc-ID. > >>> > >>> > >>> But that does not seem to be helping with sorting or faceting of any > >> kind. > >>> This seems to be like a good way to speed up a stored field's > retrieval. > >>> > >>> DocValues in the current example are: > >>> FieldA > >>> doc1 = 1 > >>> doc2 = 2 > >>> doc3 = > >>> > >>> FieldB > >>> doc1 = 2 > >>> doc2 = 4 > >>> doc3 = 5 > >>> > >>> FieldC > >>> doc1 = 5 > >>> doc2 = > >>> doc3 = 5 > >>> > >>> So if I have to run a query: > >>> fieldA=*&sort=fieldB asc > >>> I will get all the documents due to filter and then I will lookup the > >>> values of field-B from the docValues lookup. > >>> That will give me 2,4,5 > >>> This is sorted in this case, but assume that this was not sorted. > >>> (The docValues array is indexed by Lucene's doc-ID not the field-value > >>> after all, right?) > >>> > >>> Then does Lucene/Solr still sort them like regular array of values? > >>> That does not seem very efficient. > >>> And it does not seem to helping with faceting, pivoting too. > >>> What did I miss? > >>> > >>> Thanks > >>> SG > >>> > >>> > >>> > >>> > >>> > >>> > >>> On Thu, Dec 21, 2017 at 5:31 PM, Erick Erickson < > erickerick...@gmail.com > >>> > >>> wrote: > >>> > >>>> Here's where you're going off the rails: "I can just look at the > >>>> map-for-field-A" > >>>> > >>>> As I said before, you're totally right, all the information you need > >>>> is there. But > >>>> you're thinking of this as though speed weren't a premium when you > say. > >>>> "I can just look". Consider that there are single replicas out there > >> with > >>>> 300M > >>>> (or more) docs in them. "Just looking" in a list 300M items long 300M > >> times > >>>> (q=*:*&sort=whatever) is simply not going to be performant compared to > >>>> 300M indexing operations which is what DV does. > >>>> > >>>> Faceting is much worse. > >>>> > >>>> Plus space is also at a premium. Java takes 40+ bytes to store the > first > >>>> character. So any Java structure you use is going to be enormous. 300M > >> ints > >>>> is bad enough. And if you spoof this by using ordinals as Lucene does, > >>>> you're > >>>> well on your way to reinventing docValues. > >>>> > >>>> Maybe this will help. Imagine you have a phone book in your hands. It > >>>> consists of documents like this: > >>>> > >>>> id: something > >>>> phone: phone number > >>>> name: person's name > >>>> > >>>> For simplicity, they're both string types 'cause they sort. > >>>> > >>>> Let's search by phone number but sort by name, i.e. > >>>> > >>>> q=phone:1234*&sort=name asc > >>>> > >>>> I'm searching and find two docs that match. How do I know how they > >>>> sort wrt each other? > >>>> > >>>> I'm searching in the phone field but I need the value for each doc > >>>> associated with the name field. In your example I'm searching in > >>>> map-for-fieldA but sorting in map-for-field-B > >>>> > >>>> To get the name value for these two docs I have to enumerate > >>>> map-for-field-B until I find each doc and then I can get the proper > >>>> value and know how they sort. Sure, I could do some ordering and do a > >>>> binary search but that's still vastly slower than having a structure > >>>> that's a simple index operation to get the value in its field. > >>>> > >>>> The DV structure is actually more like what's below. These structures > >>>> are simply an array indexed by the _internal_ Lucene document id, > >>>> which is a simple zero-based integer that contains the value > >>>> associated with that doc for that field (I'm simplifying a bit, but > >>>> that's conceptually the deal). > >>>> FieldA > >>>> doc1 = 1 > >>>> doc2 = 2 > >>>> doc3 = > >>>> > >>>> FieldB > >>>> doc1 = 2 > >>>> doc2 = 4 > >>>> doc3 = 5 > >>>> > >>>> FieldC > >>>> doc1 = 5 > >>>> doc2 = > >>>> doc3 = 5 > >>>> > >>>> Best, > >>>> Erick > >>>> > >>>> On Thu, Dec 21, 2017 at 4:05 PM, S G <sg.online.em...@gmail.com> > wrote: > >>>>> Thanks a lot Erick and Emir. > >>>>> > >>>>> I am still a bit confused and an example will help me a lot. > >>>>> Here is a little bit modified version of the same to illustrate my > >> point > >>>>> more clearly. > >>>>> > >>>>> Let us consider 3 documents - doc1, doc2 and doc3 > >>>>> Each contains upto 3 fields - A, B and C. > >>>>> And the values for these fields are random. > >>>>> For example: > >>>>> doc1 = {A:1, B:2, C:5} > >>>>> doc2 = {A:2, B:4} > >>>>> doc3 = {B:5, C:5} > >>>>> > >>>>> > >>>>> Inverted Index for the same should be a map of: > >>>>> Key: <value-for-each-field> > >>>>> Value: <document-containing-that-value> > >>>>> i.e. > >>>>> { > >>>>> map-for-field-A: {1: doc1, 2: doc2} > >>>>> map-for-field-B: {2: doc1, 4: doc2, 5:doc3} > >>>>> map-for-field-C: {5: [doc1, doc3]} > >>>>> } > >>>>> > >>>>> For sorting on field A, I can just look at the map-for-field-A and > >> sort > >>>> the > >>>>> keys (and > >>>>> perhaps keep it sorted too for saving the sort each time). For facets > >> on > >>>>> field A, I can > >>>>> again, just look at the map-for-field-A and get counts for each > value. > >>>> So I > >>>>> will > >>>>> get facets(Field-A) = {1:1, 2:1} because count for each value is 1. > >>>>> Similarly facets(Field-C) = {5:2} > >>>>> > >>>>> Why is this not performant? All it did was to bring one > data-structure > >>>> into > >>>>> memory and if > >>>>> the current implementation was changed to use OS-cache for the same, > >> the > >>>>> pressure on > >>>>> the JVM would be reduced as well. > >>>>> > >>>>> So the point I am trying to make here is that how does the > >>>> data-structure of > >>>>> docValues differ from the inverted index I showed above? And how does > >>>> that > >>>>> structure helps it become more performant? I do not want to factor in > >> the > >>>>> OS-cache perspective here for the time being because that could have > >> been > >>>>> fixed in the regular inverted index also. I just want to focus on the > >>>>> data-structure > >>>>> for now that how it is different from the inverted index. Please do > >> not > >>>> say > >>>>> "columnar format" as > >>>>> those 2 words really convey nothing to me. > >>>>> > >>>>> If you can draw me the exact "columnar format" for the above example, > >>>> then > >>>>> it would be much appreciated. > >>>>> > >>>>> Thanks > >>>>> SG > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> On Thu, Dec 21, 2017 at 12:43 PM, Erick Erickson < > >>>> erickerick...@gmail.com> > >>>>> wrote: > >>>>> > >>>>>> bq: I do not see why sorting or faceting on any field A, B or C > would > >>>>>> be a problem. All the values for a field are there in one > >>>>>> data-structure and it should be easy to sort or group-by on that. > >>>>>> > >>>>>> This is totally true just totally incomplete: ;) > >>>>>> > >>>>>> for a given field: > >>>>>> > >>>>>> Inverted structure (leaving out position information and the like): > >>>>>> > >>>>>> term1: doc1, doc37, doc 95 > >>>>>> term2: doc10, doc37, doc 950 > >>>>>> > >>>>>> docValues structure (assuming multiValued): > >>>>>> > >>>>>> doc1: term1 > >>>>>> doc10: term2 > >>>>>> doc37: term1 term2 > >>>>>> doc95: term1 > >>>>>> doc950: term2 > >>>>>> > >>>>>> They are used to answer two different questions. > >>>>>> > >>>>>> The inverted structure efficiently answers "for term1, what docs > does > >>>>>> it appear in?" > >>>>>> > >>>>>> The docValues structure efficiently answers "for doc1, what terms > are > >>>>>> in the field?" > >>>>>> > >>>>>> So imagine you have a search on term1. It's a simple iteration of > the > >>>>>> inverted structure to get my result set, namely docs 1, 37, and 95. > >>>>>> > >>>>>> But now I want to facet. I have to get the _values_ for my field > from > >>>>>> the entire result set in order to fill my count buckets. Using the > >>>>>> uninverted structure, I'd have to scan the entire table term-by-term > >>>>>> and look to see if the term appeared in any of docs 1, 37, 95 and > add > >>>>>> to my total for the term. Think "table scan". > >>>>>> > >>>>>> Instead I use the docValues structure which is much faster, I > already > >>>>>> know all I'm interested in is these three docs, so I just read the > >>>>>> terms in the field for each doc and add to my counts. Again, to > >> answer > >>>>>> this question from the wrong (in this case inverted structure) I'd > >>>>>> have to do a table scan. Also, this would be _extremely_ expensive > to > >>>>>> do from stored fields. > >>>>>> > >>>>>> And it's the inverse for searching the docValues structure. In order > >>>>>> to find which doc has term1, I'd have to examine all the terms for > >> the > >>>>>> field for each document in my index. Horribly painful. > >>>>>> > >>>>>> So yes, the information is all there in one structure or the other > >> and > >>>>>> you _could_ get all of it from either one. You'd also have a system > >>>>>> that was able to serve 0.00001 QPS on a largish index. > >>>>>> > >>>>>> And remember that this is very simplified. If you have a complex > >> query > >>>>>> you need to get a result set before even considering the > >>>>>> facet/sort/whatever question so gathering the term information as I > >>>>>> searched wouldn't particularly work. > >>>>>> > >>>>>> Best, > >>>>>> Erick > >>>>>> > >>>>>> On Thu, Dec 21, 2017 at 9:56 AM, S G <sg.online.em...@gmail.com> > >> wrote: > >>>>>>> Hi, > >>>>>>> > >>>>>>> It seems that docValues are not really explained well anywhere. > >>>>>>> Here are 2 links that try to explain it: > >>>>>>> 1) https://lucidworks.com/2013/04/02/fun-with-docvalues-in- > >> solr-4-2/ > >>>>>>> 2) > >>>>>>> https://www.elastic.co/guide/en/elasticsearch/guide/ > >>>>>> current/docvalues.html > >>>>>>> > >>>>>>> And official Solr documentation that does not explain the internal > >>>>>> details > >>>>>>> at all: > >>>>>>> 3) https://lucene.apache.org/solr/guide/6_6/docvalues.html > >>>>>>> > >>>>>>> The first links says that: > >>>>>>> The row-oriented (stored fields) are > >>>>>>> { > >>>>>>> 'doc1': {'A':1, 'B':2, 'C':3}, > >>>>>>> 'doc2': {'A':2, 'B':3, 'C':4}, > >>>>>>> 'doc3': {'A':4, 'B':3, 'C':2} > >>>>>>> } > >>>>>>> > >>>>>>> while column-oriented (docValues) are: > >>>>>>> { > >>>>>>> 'A': {'doc1':1, 'doc2':2, 'doc3':4}, > >>>>>>> 'B': {'doc1':2, 'doc2':3, 'doc3':3}, > >>>>>>> 'C': {'doc1':3, 'doc2':4, 'doc3':2} > >>>>>>> } > >>>>>>> > >>>>>>> And the second link gives an example as: > >>>>>>> Doc values maps documents to the terms contained by the document: > >>>>>>> > >>>>>>> Doc Terms > >>>>>>> ------------------------------------------------------------ > >> ----- > >>>>>>> Doc_1 | brown, dog, fox, jumped, lazy, over, quick, the > >>>>>>> Doc_2 | brown, dogs, foxes, in, lazy, leap, over, quick, summer > >>>>>>> Doc_3 | dog, dogs, fox, jumped, over, quick, the > >>>>>>> ------------------------------------------------------------ > >> ----- > >>>>>>> > >>>>>>> > >>>>>>> To me, this example is same as the row-oriented (stored fields) > >>>> format in > >>>>>>> the first link. > >>>>>>> Which one is right? > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> Also, the column-oriented (docValues) mentioned above are: > >>>>>>> { > >>>>>>> 'A': {'doc1':1, 'doc2':2, 'doc3':4}, > >>>>>>> 'B': {'doc1':2, 'doc2':3, 'doc3':3}, > >>>>>>> 'C': {'doc1':3, 'doc2':4, 'doc3':2} > >>>>>>> } > >>>>>>> Isn't this what the inverted index also looks like? > >>>>>>> Inverted index is an index of the term (A,B,C) to the document and > >> the > >>>>>>> position it is found in the document. > >>>>>>> > >>>>>>> > >>>>>>> Or is it better to say that the inverted index is of the form: > >>>>>>> { > >>>>>>> map-for-field-A: {1: doc1, 2: doc2, 4: doc3} > >>>>>>> map-for-field-B: {2: doc1, 3: [doc2,doc3]} > >>>>>>> map-for-field-C: {3: doc1, 4: doc2, 2: doc3} > >>>>>>> } > >>>>>>> But even if that is true, I do not see why sorting or faceting on > >> any > >>>>>> field > >>>>>>> A, B or C would be a problem. > >>>>>>> All the values for a field are there in one data-structure and it > >>>> should > >>>>>> be > >>>>>>> easy to sort or group-by on that. > >>>>>>> > >>>>>>> Can someone explain the above a bit more clearly please? A > >> build-upon > >>>> the > >>>>>>> same example as above would be really good. > >>>>>>> > >>>>>>> > >>>>>>> Thanks > >>>>>>> SG > >>>>>> > >>>> > >> > >