Re: Confusing DocValues documentation

S G Thu, 21 Dec 2017 16:05:35 -0800

Thanks a lot Erick and Emir.

I am still a bit confused and an example will help me a lot.
Here is a little bit modified version of the same to illustrate my point
more clearly.


Let us consider 3 documents - doc1, doc2 and doc3
Each contains upto 3 fields - A, B and C.
And the values for these fields are random.
For example:
    doc1 = {A:1, B:2, C:5}
    doc2 = {A:2, B:4}
    doc3 = {B:5, C:5}


Inverted Index for the same should be a map of:
Key: <value-for-each-field>
Value: <document-containing-that-value>
i.e.
{
   map-for-field-A: {1: doc1, 2: doc2}
   map-for-field-B: {2: doc1, 4: doc2, 5:doc3}
   map-for-field-C: {5: [doc1, doc3]}
}

For sorting on field A, I can just look at the map-for-field-A and sort the
keys (and
perhaps keep it sorted too for saving the sort each time). For facets on
field A, I can
again, just look at the map-for-field-A and get counts for each value. So I
will
get facets(Field-A) = {1:1, 2:1} because count for each value is 1.
Similarly facets(Field-C) = {5:2}

Why is this not performant? All it did was to bring one data-structure into
memory and if
the current implementation was changed to use OS-cache for the same, the
pressure on
the JVM would be reduced as well.

So the point I am trying to make here is that how does the data-structure of
docValues differ from the inverted index I showed above? And how does that
structure helps it become more performant? I do not want to factor in the
OS-cache perspective here for the time being because that could have been
fixed in the regular inverted index also. I just want to focus on the
data-structure
for now that how it is different from the inverted index. Please do not say
"columnar format" as
those 2 words really convey nothing to me.

If you can draw me the exact "columnar format" for the above example, then
it would be much appreciated.

Thanks
SG




On Thu, Dec 21, 2017 at 12:43 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> bq: I do not see why sorting or faceting on any field A, B or C would
> be a problem. All the values for a field are there in one
> data-structure and it should be easy to sort or group-by on that.
>
> This is totally true just totally incomplete: ;)
>
> for a given field:
>
> Inverted structure (leaving out position information and the like):
>
> term1: doc1,   doc37, doc 95
> term2: doc10, doc37, doc 950
>
> docValues structure (assuming multiValued):
>
> doc1: term1
> doc10: term2
> doc37: term1 term2
> doc95: term1
> doc950: term2
>
> They are used to answer two different questions.
>
> The inverted structure efficiently answers "for term1, what docs does
> it appear in?"
>
> The docValues structure efficiently answers "for doc1, what terms are
> in the field?"
>
> So imagine you have a search on term1. It's a simple iteration of the
> inverted structure to get my result set, namely docs 1, 37, and 95.
>
> But now I want to facet. I have to get the _values_ for my field from
> the entire result set in order to fill my count buckets. Using the
> uninverted structure, I'd have to scan the entire table term-by-term
> and look to see if the term appeared in any of docs 1, 37, 95 and add
> to my total for the term. Think "table scan".
>
> Instead I use the docValues structure which is much faster, I already
> know all I'm interested in is these three docs, so I just read the
> terms in the field for each doc and add to my counts. Again, to answer
> this question from the wrong (in this case inverted structure) I'd
> have to do a table scan. Also, this would be _extremely_ expensive to
> do from stored fields.
>
> And it's the inverse for searching the docValues structure. In order
> to find which doc has term1, I'd have to examine all the terms for the
> field for each document in my index. Horribly painful.
>
> So yes, the information is all there in one structure or the other and
> you _could_ get all of it from either one. You'd also have a system
> that was able to serve 0.00001 QPS on a largish index.
>
> And remember that this is very simplified. If you have a complex query
> you need to get a result set before even considering the
> facet/sort/whatever question so gathering the term information as I
> searched wouldn't particularly work.
>
> Best,
> Erick
>
> On Thu, Dec 21, 2017 at 9:56 AM, S G <sg.online.em...@gmail.com> wrote:
> > Hi,
> >
> > It seems that docValues are not really explained well anywhere.
> > Here are 2 links that try to explain it:
> > 1) https://lucidworks.com/2013/04/02/fun-with-docvalues-in-solr-4-2/
> > 2)
> > https://www.elastic.co/guide/en/elasticsearch/guide/
> current/docvalues.html
> >
> > And official Solr documentation that does not explain the internal
> details
> > at all:
> > 3) https://lucene.apache.org/solr/guide/6_6/docvalues.html
> >
> > The first links says that:
> >   The row-oriented (stored fields) are
> >   {
> >     'doc1': {'A':1, 'B':2, 'C':3},
> >     'doc2': {'A':2, 'B':3, 'C':4},
> >     'doc3': {'A':4, 'B':3, 'C':2}
> >   }
> >
> >   while column-oriented (docValues) are:
> >   {
> >     'A': {'doc1':1, 'doc2':2, 'doc3':4},
> >     'B': {'doc1':2, 'doc2':3, 'doc3':3},
> >     'C': {'doc1':3, 'doc2':4, 'doc3':2}
> >   }
> >
> > And the second link gives an example as:
> > Doc values maps documents to the terms contained by the document:
> >
> >   Doc      Terms
> >   -----------------------------------------------------------------
> >   Doc_1 | brown, dog, fox, jumped, lazy, over, quick, the
> >   Doc_2 | brown, dogs, foxes, in, lazy, leap, over, quick, summer
> >   Doc_3 | dog, dogs, fox, jumped, over, quick, the
> >   -----------------------------------------------------------------
> >
> >
> > To me, this example is same as the row-oriented (stored fields) format in
> > the first link.
> > Which one is right?
> >
> >
> >
> > Also, the column-oriented (docValues) mentioned above are:
> > {
> >   'A': {'doc1':1, 'doc2':2, 'doc3':4},
> >   'B': {'doc1':2, 'doc2':3, 'doc3':3},
> >   'C': {'doc1':3, 'doc2':4, 'doc3':2}
> > }
> > Isn't this what the inverted index also looks like?
> > Inverted index is an index of the term (A,B,C) to the document and the
> > position it is found in the document.
> >
> >
> > Or is it better to say that the inverted index is of the form:
> > {
> >    map-for-field-A: {1: doc1, 2: doc2, 4: doc3}
> >    map-for-field-B: {2: doc1, 3: [doc2,doc3]}
> >    map-for-field-C: {3: doc1, 4: doc2, 2: doc3}
> > }
> > But even if that is true, I do not see why sorting or faceting on any
> field
> > A, B or C would be a problem.
> > All the values for a field are there in one data-structure and it should
> be
> > easy to sort or group-by on that.
> >
> > Can someone explain the above a bit more clearly please? A build-upon the
> > same example as above would be really good.
> >
> >
> > Thanks
> > SG
>

Re: Confusing DocValues documentation

Reply via email to