Your questions are already more or less answered: > 1) If the docValues are that good, can we git rid of the stored values > altogether? You can if you want - just configure your field with stored=“false” and docValues=“true”. Note that you can do that only if: * field is not analyzed (you cannot enable docValues for analyzed field) * you do not care about order of your values
> 2) And why the docValues are not enabled by default for multi-valued fields? Because it is overhead when it comes to indexing and it is not used in all cases - only if field is used for faceting, sorting or in functions. HTH, Emir -- Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > On 22 Dec 2017, at 19:51, Tech Id <tech.login....@gmail.com> wrote: > > Very interesting discussion SG and Erick. > I wish these details were part of the official Solr documentation as well. > And yes, "columnar format" did not give any useful information to me either. > > > "A good explanation increases contributions to the project as more people > become empowered to improvise." > - Self, LOL > > > I was expecting the sorting, faceting, pivoting to a bit more optimized for > docValues, something like a pre-calculated bit of information. > However, now it seems that the major benefit of docValues is to optimize > the lookup time of stored fields. > Here is the sorting function I wrote as pseudo-code from the discussion: > > > int docIDs[] = filterDocsOnQuery (query); > T docValues[] = loadDocValues (sortField); > TreeMap<T, int> sortFieldValues[] = new TreeMap<>(); > for (int docId : docIDs) { > T val = docValues[docId]; > sortFieldValues.put(val, docId); > } > // return docIDs sorted by value > return sortFieldValues.values; > > > It is indeed difficult to pre-compute the sorts and facets because we do > not know what docIDs will be returned by the filtering. > > Two last questions I have are: > 1) If the docValues are that good, can we git rid of the stored values > altogether? > 2) And why the docValues are not enabled by default for multi-valued fields? > > > -T > > > > > On Thu, Dec 21, 2017 at 9:02 PM, Erick Erickson <erickerick...@gmail.com> > wrote: > >> OK, last bit of the tutorial. >> >> bq: But that does not seem to be helping with sorting or faceting of any >> kind. >> This seems to be like a good way to speed up a stored field's retrieval. >> >> These are the same thing. I have two docs. I have to know how they >> sort. Therefore I need the value in the sort field for each. This the >> same thing as getting the stored value, no? >> >> As for facets it's the same problem. To count facet buckets I have to >> find the values for the field for each document in the results list >> and tally them. This is also getting the stored value, right? You're >> asking "for the docs in my result set, how many of them have val1, how >> many have val2, how many have val54 etc. >> >> And as an aside the docValues can also be used to return the stored value. >> >> Best, >> Erick >> >> On Thu, Dec 21, 2017 at 8:23 PM, S G <sg.online.em...@gmail.com> wrote: >>> Thank you Eric. >>> >>> I guess the biggest piece I was missing was the sort on a field other >> than >>> the search field. >>> Once you have filtered a list of documents and then you want to sort, the >>> inverted index cannot be used for lookup. >>> You just have doc-IDs which are values in inverted index, not the keys. >>> Hence they cannot be "looked" up - only option is to loop through all the >>> entries of that key's inverted index. >>> >>> DocValues come to rescue by reducing that looping operation to a lookup >>> again. >>> Because in docValues, the key (i.e. array-index) is the document-index >> and >>> gives an O(1) lookup for any doc-ID. >>> >>> >>> But that does not seem to be helping with sorting or faceting of any >> kind. >>> This seems to be like a good way to speed up a stored field's retrieval. >>> >>> DocValues in the current example are: >>> FieldA >>> doc1 = 1 >>> doc2 = 2 >>> doc3 = >>> >>> FieldB >>> doc1 = 2 >>> doc2 = 4 >>> doc3 = 5 >>> >>> FieldC >>> doc1 = 5 >>> doc2 = >>> doc3 = 5 >>> >>> So if I have to run a query: >>> fieldA=*&sort=fieldB asc >>> I will get all the documents due to filter and then I will lookup the >>> values of field-B from the docValues lookup. >>> That will give me 2,4,5 >>> This is sorted in this case, but assume that this was not sorted. >>> (The docValues array is indexed by Lucene's doc-ID not the field-value >>> after all, right?) >>> >>> Then does Lucene/Solr still sort them like regular array of values? >>> That does not seem very efficient. >>> And it does not seem to helping with faceting, pivoting too. >>> What did I miss? >>> >>> Thanks >>> SG >>> >>> >>> >>> >>> >>> >>> On Thu, Dec 21, 2017 at 5:31 PM, Erick Erickson <erickerick...@gmail.com >>> >>> wrote: >>> >>>> Here's where you're going off the rails: "I can just look at the >>>> map-for-field-A" >>>> >>>> As I said before, you're totally right, all the information you need >>>> is there. But >>>> you're thinking of this as though speed weren't a premium when you say. >>>> "I can just look". Consider that there are single replicas out there >> with >>>> 300M >>>> (or more) docs in them. "Just looking" in a list 300M items long 300M >> times >>>> (q=*:*&sort=whatever) is simply not going to be performant compared to >>>> 300M indexing operations which is what DV does. >>>> >>>> Faceting is much worse. >>>> >>>> Plus space is also at a premium. Java takes 40+ bytes to store the first >>>> character. So any Java structure you use is going to be enormous. 300M >> ints >>>> is bad enough. And if you spoof this by using ordinals as Lucene does, >>>> you're >>>> well on your way to reinventing docValues. >>>> >>>> Maybe this will help. Imagine you have a phone book in your hands. It >>>> consists of documents like this: >>>> >>>> id: something >>>> phone: phone number >>>> name: person's name >>>> >>>> For simplicity, they're both string types 'cause they sort. >>>> >>>> Let's search by phone number but sort by name, i.e. >>>> >>>> q=phone:1234*&sort=name asc >>>> >>>> I'm searching and find two docs that match. How do I know how they >>>> sort wrt each other? >>>> >>>> I'm searching in the phone field but I need the value for each doc >>>> associated with the name field. In your example I'm searching in >>>> map-for-fieldA but sorting in map-for-field-B >>>> >>>> To get the name value for these two docs I have to enumerate >>>> map-for-field-B until I find each doc and then I can get the proper >>>> value and know how they sort. Sure, I could do some ordering and do a >>>> binary search but that's still vastly slower than having a structure >>>> that's a simple index operation to get the value in its field. >>>> >>>> The DV structure is actually more like what's below. These structures >>>> are simply an array indexed by the _internal_ Lucene document id, >>>> which is a simple zero-based integer that contains the value >>>> associated with that doc for that field (I'm simplifying a bit, but >>>> that's conceptually the deal). >>>> FieldA >>>> doc1 = 1 >>>> doc2 = 2 >>>> doc3 = >>>> >>>> FieldB >>>> doc1 = 2 >>>> doc2 = 4 >>>> doc3 = 5 >>>> >>>> FieldC >>>> doc1 = 5 >>>> doc2 = >>>> doc3 = 5 >>>> >>>> Best, >>>> Erick >>>> >>>> On Thu, Dec 21, 2017 at 4:05 PM, S G <sg.online.em...@gmail.com> wrote: >>>>> Thanks a lot Erick and Emir. >>>>> >>>>> I am still a bit confused and an example will help me a lot. >>>>> Here is a little bit modified version of the same to illustrate my >> point >>>>> more clearly. >>>>> >>>>> Let us consider 3 documents - doc1, doc2 and doc3 >>>>> Each contains upto 3 fields - A, B and C. >>>>> And the values for these fields are random. >>>>> For example: >>>>> doc1 = {A:1, B:2, C:5} >>>>> doc2 = {A:2, B:4} >>>>> doc3 = {B:5, C:5} >>>>> >>>>> >>>>> Inverted Index for the same should be a map of: >>>>> Key: <value-for-each-field> >>>>> Value: <document-containing-that-value> >>>>> i.e. >>>>> { >>>>> map-for-field-A: {1: doc1, 2: doc2} >>>>> map-for-field-B: {2: doc1, 4: doc2, 5:doc3} >>>>> map-for-field-C: {5: [doc1, doc3]} >>>>> } >>>>> >>>>> For sorting on field A, I can just look at the map-for-field-A and >> sort >>>> the >>>>> keys (and >>>>> perhaps keep it sorted too for saving the sort each time). For facets >> on >>>>> field A, I can >>>>> again, just look at the map-for-field-A and get counts for each value. >>>> So I >>>>> will >>>>> get facets(Field-A) = {1:1, 2:1} because count for each value is 1. >>>>> Similarly facets(Field-C) = {5:2} >>>>> >>>>> Why is this not performant? All it did was to bring one data-structure >>>> into >>>>> memory and if >>>>> the current implementation was changed to use OS-cache for the same, >> the >>>>> pressure on >>>>> the JVM would be reduced as well. >>>>> >>>>> So the point I am trying to make here is that how does the >>>> data-structure of >>>>> docValues differ from the inverted index I showed above? And how does >>>> that >>>>> structure helps it become more performant? I do not want to factor in >> the >>>>> OS-cache perspective here for the time being because that could have >> been >>>>> fixed in the regular inverted index also. I just want to focus on the >>>>> data-structure >>>>> for now that how it is different from the inverted index. Please do >> not >>>> say >>>>> "columnar format" as >>>>> those 2 words really convey nothing to me. >>>>> >>>>> If you can draw me the exact "columnar format" for the above example, >>>> then >>>>> it would be much appreciated. >>>>> >>>>> Thanks >>>>> SG >>>>> >>>>> >>>>> >>>>> >>>>> On Thu, Dec 21, 2017 at 12:43 PM, Erick Erickson < >>>> erickerick...@gmail.com> >>>>> wrote: >>>>> >>>>>> bq: I do not see why sorting or faceting on any field A, B or C would >>>>>> be a problem. All the values for a field are there in one >>>>>> data-structure and it should be easy to sort or group-by on that. >>>>>> >>>>>> This is totally true just totally incomplete: ;) >>>>>> >>>>>> for a given field: >>>>>> >>>>>> Inverted structure (leaving out position information and the like): >>>>>> >>>>>> term1: doc1, doc37, doc 95 >>>>>> term2: doc10, doc37, doc 950 >>>>>> >>>>>> docValues structure (assuming multiValued): >>>>>> >>>>>> doc1: term1 >>>>>> doc10: term2 >>>>>> doc37: term1 term2 >>>>>> doc95: term1 >>>>>> doc950: term2 >>>>>> >>>>>> They are used to answer two different questions. >>>>>> >>>>>> The inverted structure efficiently answers "for term1, what docs does >>>>>> it appear in?" >>>>>> >>>>>> The docValues structure efficiently answers "for doc1, what terms are >>>>>> in the field?" >>>>>> >>>>>> So imagine you have a search on term1. It's a simple iteration of the >>>>>> inverted structure to get my result set, namely docs 1, 37, and 95. >>>>>> >>>>>> But now I want to facet. I have to get the _values_ for my field from >>>>>> the entire result set in order to fill my count buckets. Using the >>>>>> uninverted structure, I'd have to scan the entire table term-by-term >>>>>> and look to see if the term appeared in any of docs 1, 37, 95 and add >>>>>> to my total for the term. Think "table scan". >>>>>> >>>>>> Instead I use the docValues structure which is much faster, I already >>>>>> know all I'm interested in is these three docs, so I just read the >>>>>> terms in the field for each doc and add to my counts. Again, to >> answer >>>>>> this question from the wrong (in this case inverted structure) I'd >>>>>> have to do a table scan. Also, this would be _extremely_ expensive to >>>>>> do from stored fields. >>>>>> >>>>>> And it's the inverse for searching the docValues structure. In order >>>>>> to find which doc has term1, I'd have to examine all the terms for >> the >>>>>> field for each document in my index. Horribly painful. >>>>>> >>>>>> So yes, the information is all there in one structure or the other >> and >>>>>> you _could_ get all of it from either one. You'd also have a system >>>>>> that was able to serve 0.00001 QPS on a largish index. >>>>>> >>>>>> And remember that this is very simplified. If you have a complex >> query >>>>>> you need to get a result set before even considering the >>>>>> facet/sort/whatever question so gathering the term information as I >>>>>> searched wouldn't particularly work. >>>>>> >>>>>> Best, >>>>>> Erick >>>>>> >>>>>> On Thu, Dec 21, 2017 at 9:56 AM, S G <sg.online.em...@gmail.com> >> wrote: >>>>>>> Hi, >>>>>>> >>>>>>> It seems that docValues are not really explained well anywhere. >>>>>>> Here are 2 links that try to explain it: >>>>>>> 1) https://lucidworks.com/2013/04/02/fun-with-docvalues-in- >> solr-4-2/ >>>>>>> 2) >>>>>>> https://www.elastic.co/guide/en/elasticsearch/guide/ >>>>>> current/docvalues.html >>>>>>> >>>>>>> And official Solr documentation that does not explain the internal >>>>>> details >>>>>>> at all: >>>>>>> 3) https://lucene.apache.org/solr/guide/6_6/docvalues.html >>>>>>> >>>>>>> The first links says that: >>>>>>> The row-oriented (stored fields) are >>>>>>> { >>>>>>> 'doc1': {'A':1, 'B':2, 'C':3}, >>>>>>> 'doc2': {'A':2, 'B':3, 'C':4}, >>>>>>> 'doc3': {'A':4, 'B':3, 'C':2} >>>>>>> } >>>>>>> >>>>>>> while column-oriented (docValues) are: >>>>>>> { >>>>>>> 'A': {'doc1':1, 'doc2':2, 'doc3':4}, >>>>>>> 'B': {'doc1':2, 'doc2':3, 'doc3':3}, >>>>>>> 'C': {'doc1':3, 'doc2':4, 'doc3':2} >>>>>>> } >>>>>>> >>>>>>> And the second link gives an example as: >>>>>>> Doc values maps documents to the terms contained by the document: >>>>>>> >>>>>>> Doc Terms >>>>>>> ------------------------------------------------------------ >> ----- >>>>>>> Doc_1 | brown, dog, fox, jumped, lazy, over, quick, the >>>>>>> Doc_2 | brown, dogs, foxes, in, lazy, leap, over, quick, summer >>>>>>> Doc_3 | dog, dogs, fox, jumped, over, quick, the >>>>>>> ------------------------------------------------------------ >> ----- >>>>>>> >>>>>>> >>>>>>> To me, this example is same as the row-oriented (stored fields) >>>> format in >>>>>>> the first link. >>>>>>> Which one is right? >>>>>>> >>>>>>> >>>>>>> >>>>>>> Also, the column-oriented (docValues) mentioned above are: >>>>>>> { >>>>>>> 'A': {'doc1':1, 'doc2':2, 'doc3':4}, >>>>>>> 'B': {'doc1':2, 'doc2':3, 'doc3':3}, >>>>>>> 'C': {'doc1':3, 'doc2':4, 'doc3':2} >>>>>>> } >>>>>>> Isn't this what the inverted index also looks like? >>>>>>> Inverted index is an index of the term (A,B,C) to the document and >> the >>>>>>> position it is found in the document. >>>>>>> >>>>>>> >>>>>>> Or is it better to say that the inverted index is of the form: >>>>>>> { >>>>>>> map-for-field-A: {1: doc1, 2: doc2, 4: doc3} >>>>>>> map-for-field-B: {2: doc1, 3: [doc2,doc3]} >>>>>>> map-for-field-C: {3: doc1, 4: doc2, 2: doc3} >>>>>>> } >>>>>>> But even if that is true, I do not see why sorting or faceting on >> any >>>>>> field >>>>>>> A, B or C would be a problem. >>>>>>> All the values for a field are there in one data-structure and it >>>> should >>>>>> be >>>>>>> easy to sort or group-by on that. >>>>>>> >>>>>>> Can someone explain the above a bit more clearly please? A >> build-upon >>>> the >>>>>>> same example as above would be really good. >>>>>>> >>>>>>> >>>>>>> Thanks >>>>>>> SG >>>>>> >>>> >>