subject:"Re\: Confusing DocValues documentation"

Re: Confusing DocValues documentation

2018-01-02 Thread Shawn Heisey

On 12/22/2017 12:45 PM, Tech Id wrote:
> It seems that stored="false" docValues="true" is the default in Solr's
> github and the recommended way to go.

Like most things in Solr, there's no simple answer.  It depends.

For the purposes of information retrieval (not facets, grouping, or
sorting), whether you want stored or docValues will depend on a number
of factors.

Stored field data is compressed in the index.  This means that it takes
additional CPU processing to get the data from the index, but less data
must be read from disk.  DocValues is stored very differently.  With
docValues, the data is NOT compressed, and all of the values for one
field for all documents across the entire index segment are written in
one place, separately from any other field's docValues data.

If you are returning all fields for a document and there are more than a
few fields, then accessing stored data and decompressing it is probably
going to be faster than accessing docValues data.  For one thing, all
the stored data for a single document is compressed and written
together.  With docValues, each field is in a different place, so
multiple parts of the disk will need to be accessed to get results for
multiple fields of a single document.

If the index is small enough that it can easily be cached by the OS,
then docValues will probably be faster, because accessing the data will
be lightning fast and no decompression step is necessary.  But if the
index is too big to be fully cached, then only experimentation will
allow you to know which is better.

For facets, grouping, and/or sorting, using docValues instead of indexed
data (indexed="true") will generally offer better performance, and WILL
use less heap memory.  Frequently, deciding which way performs better
requires experimentation.  Using indexed data and a larger heap could
perform better in some situations.

For information retrieval, stored is *usually* better than docValues,
but not always.

Thanks,
Shawn

Re: Confusing DocValues documentation

2017-12-22 Thread Erick Erickson

About the docs. Recently we've changed the documents to be asciidoc format

One of the ways to contribute is to raise a JIRA and submit a
documentation patch.
See: https://wiki.apache.org/solr/HowToContribute

It's valuable to have people reading docs and trying to understand
them help update them with fresh eyes.

Best,
Erick

On Fri, Dec 22, 2017 at 11:20 AM, Emir Arnautović
 wrote:
> Your questions are already more or less answered:
>> 1) If the docValues are that good, can we git rid of the stored values
>> altogether?
> You can if you want - just configure your field with stored=“false” and 
> docValues=“true”. Note that you can do that only if:
> * field is not analyzed (you cannot enable docValues for analyzed field)
> * you do not care about order of your values
>
>> 2) And why the docValues are not enabled by default for multi-valued fields?
> Because it is overhead when it comes to indexing and it is not used in all 
> cases - only if field is used for faceting, sorting or in functions.
>
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
>> On 22 Dec 2017, at 19:51, Tech Id  wrote:
>>
>> Very interesting discussion SG and Erick.
>> I wish these details were part of the official Solr documentation as well.
>> And yes, "columnar format" did not give any useful information to me either.
>>
>>
>> "A good explanation increases contributions to the project as more people
>> become empowered to improvise."
>>   - Self, LOL
>>
>>
>> I was expecting the sorting, faceting, pivoting to a bit more optimized for
>> docValues, something like a pre-calculated bit of information.
>> However, now it seems that the major benefit of docValues is to optimize
>> the lookup time of stored fields.
>> Here is the sorting function I wrote as pseudo-code from the discussion:
>>
>>
>> int docIDs[] = filterDocsOnQuery (query);
>> T docValues[] = loadDocValues (sortField);
>> TreeMap sortFieldValues[] = new TreeMap<>();
>> for (int docId : docIDs) {
>>T val = docValues[docId];
>>sortFieldValues.put(val, docId);
>> }
>> // return docIDs sorted by value
>> return sortFieldValues.values;
>>
>>
>> It is indeed difficult to pre-compute the sorts and facets because we do
>> not know what docIDs will be returned by the filtering.
>>
>> Two last questions I have are:
>> 1) If the docValues are that good, can we git rid of the stored values
>> altogether?
>> 2) And why the docValues are not enabled by default for multi-valued fields?
>>
>>
>> -T
>>
>>
>>
>>
>> On Thu, Dec 21, 2017 at 9:02 PM, Erick Erickson 
>> wrote:
>>
>>> OK, last bit of the tutorial.
>>>
>>> bq: But that does not seem to be helping with sorting or faceting of any
>>> kind.
>>> This seems to be like a good way to speed up a stored field's retrieval.
>>>
>>> These are the same thing. I have two docs. I have to know how they
>>> sort. Therefore I need the value in the sort field for each. This the
>>> same thing as getting the stored value, no?
>>>
>>> As for facets it's the same problem. To count facet buckets I have to
>>> find the values for the  field for each document in the results list
>>> and tally them. This is also getting the stored value, right? You're
>>> asking "for the docs in my result set, how many of them have val1, how
>>> many have val2, how many have val54 etc.
>>>
>>> And as an aside the docValues can also be used to return the stored value.
>>>
>>> Best,
>>> Erick
>>>
>>> On Thu, Dec 21, 2017 at 8:23 PM, S G  wrote:
 Thank you Eric.

 I guess the biggest piece I was missing was the sort on a field other
>>> than
 the search field.
 Once you have filtered a list of documents and then you want to sort, the
 inverted index cannot be used for lookup.
 You just have doc-IDs which are values in inverted index, not the keys.
 Hence they cannot be "looked" up - only option is to loop through all the
 entries of that key's inverted index.

 DocValues come to rescue by reducing that looping operation to a lookup
 again.
 Because in docValues, the key (i.e. array-index) is the document-index
>>> and
 gives an O(1) lookup for any doc-ID.


 But that does not seem to be helping with sorting or faceting of any
>>> kind.
 This seems to be like a good way to speed up a stored field's retrieval.

 DocValues in the current example are:
 FieldA
 doc1 = 1
 doc2 = 2
 doc3 =

 FieldB
 doc1 = 2
 doc2 = 4
 doc3 = 5

 FieldC
 doc1 = 5
 doc2 =
 doc3 = 5

 So if I have to run a query:
fieldA=*&sort=fieldB asc
 I will get all the documents due to filter and then I will lookup the
 values of field-B from the docValues lookup.
 That will give me 2,4,5
 This is sorted in this case, but assume that this was not sorted.
 (The docValues array is indexed by Lucene's doc-ID not the field-value

Re: Confusing DocValues documentation

2017-12-22 Thread Tech Id

Thanks Emir,

It seems that stored="false" docValues="true" is the default in Solr's
github and the recommended way to go.


grep "docValues=\"true\""
./server/solr/configsets/_default/conf/managed-schema








  Point fields don't support FieldCache, so they must have
docValues="true" if needed for sorting, faceting, functions, etc.

























So all the basic field-types (single and multi-valued) would have
docValues="true" and stored="false" is the default I assume.
But I do not get why the "id" field and the "dynamic fields" have
stored="true" in Solr 7:



grep "stored=\"true\""
./server/solr/configsets/_default/conf/managed-schema | grep -v "\*_txt_"




























































That is perhaps a bug?



Booleans seem to care neither about stored nor docValues:


grep -i boolean ./server/solr/configsets/_default/conf/managed-schema








-T



On Fri, Dec 22, 2017 at 11:20 AM, Emir Arnautović <
emir.arnauto...@sematext.com> wrote:

> Your questions are already more or less answered:
> > 1) If the docValues are that good, can we git rid of the stored values
> > altogether?
> You can if you want - just configure your field with stored=“false” and
> docValues=“true”. Note that you can do that only if:
> * field is not analyzed (you cannot enable docValues for analyzed field)
> * you do not care about order of your values
>
> > 2) And why the docValues are not enabled by default for multi-valued
> fields?
> Because it is overhead when it comes to indexing and it is not used in all
> cases - only if field is used for faceting, sorting or in functions.
>
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 22 Dec 2017, at 19:51, Tech Id  wrote:
> >
> > Very interesting discussion SG and Erick.
> > I wish these details were part of the official Solr documentation as
> well.
> > And yes, "columnar format" did not give any useful information to me
> either.
> >
> >
> > "A good explanation increases contributions to the project as more people
> > become empowered to improvise."
> >   - Self, LOL
> >
> >
> > I was expecting the sorting, faceting, pivoting to a bit more optimized
> for
> > docValues, something like a pre-calculated bit of information.
> > However, now it seems that the major benefit of docValues is to optimize
> > the lookup time of stored fields.
> > Here is the sorting function I wrote as pseudo-code from the discussion:
> >
> >
> > int docIDs[] = filterDocsOnQuery (query);
> > T docValues[] = loadDocValues (sortField);
> > TreeMap sortFieldValues[] = new TreeMap<>();
> > for (int docId : docIDs) {
> >T val = docValues[docId];
> >sortFieldValues.put(val, docId);
> > }
> > // return docIDs sorted by value
> > return sortFieldValues.values;
> >
> >
> > It is indeed difficult to pre-compute the sorts and facets because we do
> > not know what docIDs will be returned by the filtering.
> >
> > Two last questions I have are:
> > 1) If the docValues are that good, can we git rid of the stored values
> > altogether?
> > 2) And why the docValues are not enabled by default for multi-valued
> fields?
> >
> >
> > -T
> >
> >
> >
> >
> > On Thu, Dec 21, 2017 at 9:02 PM, Erick Erickson  >
> > wrote:
> >
> >> OK, last bit of the tutorial.
> >>
> >> bq: But that does not seem to be helping with sorting or faceting of any
> >> kind.
> >> This seems to be like a good way to speed up a stored field's retrieval.
> >>
> >> These are the same thing. I have two docs. I have to know how they
> >> sort. Therefore I need the value in the sort field for each. This the
> >> same thing as getting the stored value, no?
> >>
> >> As for facets it's the same problem. To count facet buckets I have to
> >> find the values for the  field for each document in the results list
> >> and tally them. This is also getting the stored value, right? You're
> >> asking "for the docs in my result set, how many of them have val1, how
> >> many have val2, how many have val54 etc.
> >>
> >> And as an aside the docValues can also be used to return the stored
> value.
> >>
> >> Best,
> >> Erick
> >>
> >> On Thu, Dec 21, 2017 at 8:23 PM, S G  wrote:
> >>> Thank you Eric.
> >>>
> >>> I guess the biggest piece I was missing was the sort on a field other
> >> than
> >>> the search field.
> >>> Once you have filtered a list of documents and then you want to sort,
> the
> >>> inverted index cannot be used for lookup.
> >>> You just have doc-IDs which are values in inverted index, not the keys.
> >>> Hence they cannot be "looked" up - only option is to loop through all
> the
> >>> entries of that key's inverted index.
> >>>
> >>> DocValues come to rescue by reducing that looping operation to a lookup
> >>>

Re: Confusing DocValues documentation

2017-12-22 Thread Emir Arnautović

Your questions are already more or less answered:
> 1) If the docValues are that good, can we git rid of the stored values
> altogether?
You can if you want - just configure your field with stored=“false” and 
docValues=“true”. Note that you can do that only if:
* field is not analyzed (you cannot enable docValues for analyzed field)
* you do not care about order of your values

> 2) And why the docValues are not enabled by default for multi-valued fields?
Because it is overhead when it comes to indexing and it is not used in all 
cases - only if field is used for faceting, sorting or in functions.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 22 Dec 2017, at 19:51, Tech Id  wrote:
> 
> Very interesting discussion SG and Erick.
> I wish these details were part of the official Solr documentation as well.
> And yes, "columnar format" did not give any useful information to me either.
> 
> 
> "A good explanation increases contributions to the project as more people
> become empowered to improvise."
>   - Self, LOL
> 
> 
> I was expecting the sorting, faceting, pivoting to a bit more optimized for
> docValues, something like a pre-calculated bit of information.
> However, now it seems that the major benefit of docValues is to optimize
> the lookup time of stored fields.
> Here is the sorting function I wrote as pseudo-code from the discussion:
> 
> 
> int docIDs[] = filterDocsOnQuery (query);
> T docValues[] = loadDocValues (sortField);
> TreeMap sortFieldValues[] = new TreeMap<>();
> for (int docId : docIDs) {
>T val = docValues[docId];
>sortFieldValues.put(val, docId);
> }
> // return docIDs sorted by value
> return sortFieldValues.values;
> 
> 
> It is indeed difficult to pre-compute the sorts and facets because we do
> not know what docIDs will be returned by the filtering.
> 
> Two last questions I have are:
> 1) If the docValues are that good, can we git rid of the stored values
> altogether?
> 2) And why the docValues are not enabled by default for multi-valued fields?
> 
> 
> -T
> 
> 
> 
> 
> On Thu, Dec 21, 2017 at 9:02 PM, Erick Erickson 
> wrote:
> 
>> OK, last bit of the tutorial.
>> 
>> bq: But that does not seem to be helping with sorting or faceting of any
>> kind.
>> This seems to be like a good way to speed up a stored field's retrieval.
>> 
>> These are the same thing. I have two docs. I have to know how they
>> sort. Therefore I need the value in the sort field for each. This the
>> same thing as getting the stored value, no?
>> 
>> As for facets it's the same problem. To count facet buckets I have to
>> find the values for the  field for each document in the results list
>> and tally them. This is also getting the stored value, right? You're
>> asking "for the docs in my result set, how many of them have val1, how
>> many have val2, how many have val54 etc.
>> 
>> And as an aside the docValues can also be used to return the stored value.
>> 
>> Best,
>> Erick
>> 
>> On Thu, Dec 21, 2017 at 8:23 PM, S G  wrote:
>>> Thank you Eric.
>>> 
>>> I guess the biggest piece I was missing was the sort on a field other
>> than
>>> the search field.
>>> Once you have filtered a list of documents and then you want to sort, the
>>> inverted index cannot be used for lookup.
>>> You just have doc-IDs which are values in inverted index, not the keys.
>>> Hence they cannot be "looked" up - only option is to loop through all the
>>> entries of that key's inverted index.
>>> 
>>> DocValues come to rescue by reducing that looping operation to a lookup
>>> again.
>>> Because in docValues, the key (i.e. array-index) is the document-index
>> and
>>> gives an O(1) lookup for any doc-ID.
>>> 
>>> 
>>> But that does not seem to be helping with sorting or faceting of any
>> kind.
>>> This seems to be like a good way to speed up a stored field's retrieval.
>>> 
>>> DocValues in the current example are:
>>> FieldA
>>> doc1 = 1
>>> doc2 = 2
>>> doc3 =
>>> 
>>> FieldB
>>> doc1 = 2
>>> doc2 = 4
>>> doc3 = 5
>>> 
>>> FieldC
>>> doc1 = 5
>>> doc2 =
>>> doc3 = 5
>>> 
>>> So if I have to run a query:
>>>fieldA=*&sort=fieldB asc
>>> I will get all the documents due to filter and then I will lookup the
>>> values of field-B from the docValues lookup.
>>> That will give me 2,4,5
>>> This is sorted in this case, but assume that this was not sorted.
>>> (The docValues array is indexed by Lucene's doc-ID not the field-value
>>> after all, right?)
>>> 
>>> Then does Lucene/Solr still sort them like regular array of values?
>>> That does not seem very efficient.
>>> And it does not seem to helping with faceting, pivoting too.
>>> What did I miss?
>>> 
>>> Thanks
>>> SG
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Thu, Dec 21, 2017 at 5:31 PM, Erick Erickson >> 
>>> wrote:
>>> 
 Here's where you're going off the rails: "I can just look at the
 map-for-field-A"
 
 As I said before, you're totally right, all the in

Re: Confusing DocValues documentation

2017-12-22 Thread Tech Id

Very interesting discussion SG and Erick.
I wish these details were part of the official Solr documentation as well.
And yes, "columnar format" did not give any useful information to me either.


"A good explanation increases contributions to the project as more people
become empowered to improvise."
   - Self, LOL


I was expecting the sorting, faceting, pivoting to a bit more optimized for
docValues, something like a pre-calculated bit of information.
However, now it seems that the major benefit of docValues is to optimize
the lookup time of stored fields.
Here is the sorting function I wrote as pseudo-code from the discussion:


int docIDs[] = filterDocsOnQuery (query);
T docValues[] = loadDocValues (sortField);
TreeMap sortFieldValues[] = new TreeMap<>();
for (int docId : docIDs) {
T val = docValues[docId];
sortFieldValues.put(val, docId);
}
// return docIDs sorted by value
return sortFieldValues.values;


It is indeed difficult to pre-compute the sorts and facets because we do
not know what docIDs will be returned by the filtering.

Two last questions I have are:
1) If the docValues are that good, can we git rid of the stored values
altogether?
2) And why the docValues are not enabled by default for multi-valued fields?


-T




On Thu, Dec 21, 2017 at 9:02 PM, Erick Erickson 
wrote:

> OK, last bit of the tutorial.
>
> bq: But that does not seem to be helping with sorting or faceting of any
> kind.
> This seems to be like a good way to speed up a stored field's retrieval.
>
> These are the same thing. I have two docs. I have to know how they
> sort. Therefore I need the value in the sort field for each. This the
> same thing as getting the stored value, no?
>
> As for facets it's the same problem. To count facet buckets I have to
> find the values for the  field for each document in the results list
> and tally them. This is also getting the stored value, right? You're
> asking "for the docs in my result set, how many of them have val1, how
> many have val2, how many have val54 etc.
>
> And as an aside the docValues can also be used to return the stored value.
>
> Best,
> Erick
>
> On Thu, Dec 21, 2017 at 8:23 PM, S G  wrote:
> > Thank you Eric.
> >
> > I guess the biggest piece I was missing was the sort on a field other
> than
> > the search field.
> > Once you have filtered a list of documents and then you want to sort, the
> > inverted index cannot be used for lookup.
> > You just have doc-IDs which are values in inverted index, not the keys.
> > Hence they cannot be "looked" up - only option is to loop through all the
> > entries of that key's inverted index.
> >
> > DocValues come to rescue by reducing that looping operation to a lookup
> > again.
> > Because in docValues, the key (i.e. array-index) is the document-index
> and
> > gives an O(1) lookup for any doc-ID.
> >
> >
> > But that does not seem to be helping with sorting or faceting of any
> kind.
> > This seems to be like a good way to speed up a stored field's retrieval.
> >
> > DocValues in the current example are:
> > FieldA
> > doc1 = 1
> > doc2 = 2
> > doc3 =
> >
> > FieldB
> > doc1 = 2
> > doc2 = 4
> > doc3 = 5
> >
> > FieldC
> > doc1 = 5
> > doc2 =
> > doc3 = 5
> >
> > So if I have to run a query:
> > fieldA=*&sort=fieldB asc
> > I will get all the documents due to filter and then I will lookup the
> > values of field-B from the docValues lookup.
> > That will give me 2,4,5
> > This is sorted in this case, but assume that this was not sorted.
> > (The docValues array is indexed by Lucene's doc-ID not the field-value
> > after all, right?)
> >
> > Then does Lucene/Solr still sort them like regular array of values?
> > That does not seem very efficient.
> > And it does not seem to helping with faceting, pivoting too.
> > What did I miss?
> >
> > Thanks
> > SG
> >
> >
> >
> >
> >
> >
> > On Thu, Dec 21, 2017 at 5:31 PM, Erick Erickson  >
> > wrote:
> >
> >> Here's where you're going off the rails: "I can just look at the
> >> map-for-field-A"
> >>
> >> As I said before, you're totally right, all the information you need
> >> is there. But
> >> you're thinking of this as though speed weren't a premium when you say.
> >> "I can just look". Consider that there are single replicas out there
> with
> >> 300M
> >> (or more) docs in them. "Just looking" in a list 300M items long 300M
> times
> >> (q=*:*&sort=whatever) is simply not going to be performant compared to
> >> 300M indexing operations which is what DV does.
> >>
> >> Faceting is much worse.
> >>
> >> Plus space is also at a premium. Java takes 40+ bytes to store the first
> >> character. So any Java structure you use is going to be enormous. 300M
> ints
> >> is bad enough. And if you spoof this by using ordinals as Lucene does,
> >> you're
> >> well on your way to reinventing docValues.
> >>
> >> Maybe this will help. Imagine you have a phone book in your hands. It
> >> consists of documents like this:
> >>
> >> id: something
> >> phone: phone number
> >> name: perso

Re: Confusing DocValues documentation

2017-12-21 Thread Erick Erickson

OK, last bit of the tutorial.

bq: But that does not seem to be helping with sorting or faceting of any kind.
This seems to be like a good way to speed up a stored field's retrieval.

These are the same thing. I have two docs. I have to know how they
sort. Therefore I need the value in the sort field for each. This the
same thing as getting the stored value, no?

As for facets it's the same problem. To count facet buckets I have to
find the values for the  field for each document in the results list
and tally them. This is also getting the stored value, right? You're
asking "for the docs in my result set, how many of them have val1, how
many have val2, how many have val54 etc.

And as an aside the docValues can also be used to return the stored value.

Best,
Erick

On Thu, Dec 21, 2017 at 8:23 PM, S G  wrote:
> Thank you Eric.
>
> I guess the biggest piece I was missing was the sort on a field other than
> the search field.
> Once you have filtered a list of documents and then you want to sort, the
> inverted index cannot be used for lookup.
> You just have doc-IDs which are values in inverted index, not the keys.
> Hence they cannot be "looked" up - only option is to loop through all the
> entries of that key's inverted index.
>
> DocValues come to rescue by reducing that looping operation to a lookup
> again.
> Because in docValues, the key (i.e. array-index) is the document-index and
> gives an O(1) lookup for any doc-ID.
>
>
> But that does not seem to be helping with sorting or faceting of any kind.
> This seems to be like a good way to speed up a stored field's retrieval.
>
> DocValues in the current example are:
> FieldA
> doc1 = 1
> doc2 = 2
> doc3 =
>
> FieldB
> doc1 = 2
> doc2 = 4
> doc3 = 5
>
> FieldC
> doc1 = 5
> doc2 =
> doc3 = 5
>
> So if I have to run a query:
> fieldA=*&sort=fieldB asc
> I will get all the documents due to filter and then I will lookup the
> values of field-B from the docValues lookup.
> That will give me 2,4,5
> This is sorted in this case, but assume that this was not sorted.
> (The docValues array is indexed by Lucene's doc-ID not the field-value
> after all, right?)
>
> Then does Lucene/Solr still sort them like regular array of values?
> That does not seem very efficient.
> And it does not seem to helping with faceting, pivoting too.
> What did I miss?
>
> Thanks
> SG
>
>
>
>
>
>
> On Thu, Dec 21, 2017 at 5:31 PM, Erick Erickson 
> wrote:
>
>> Here's where you're going off the rails: "I can just look at the
>> map-for-field-A"
>>
>> As I said before, you're totally right, all the information you need
>> is there. But
>> you're thinking of this as though speed weren't a premium when you say.
>> "I can just look". Consider that there are single replicas out there with
>> 300M
>> (or more) docs in them. "Just looking" in a list 300M items long 300M times
>> (q=*:*&sort=whatever) is simply not going to be performant compared to
>> 300M indexing operations which is what DV does.
>>
>> Faceting is much worse.
>>
>> Plus space is also at a premium. Java takes 40+ bytes to store the first
>> character. So any Java structure you use is going to be enormous. 300M ints
>> is bad enough. And if you spoof this by using ordinals as Lucene does,
>> you're
>> well on your way to reinventing docValues.
>>
>> Maybe this will help. Imagine you have a phone book in your hands. It
>> consists of documents like this:
>>
>> id: something
>> phone: phone number
>> name: person's name
>>
>> For simplicity, they're both string types 'cause they sort.
>>
>> Let's search by phone number but sort by name, i.e.
>>
>> q=phone:1234*&sort=name asc
>>
>> I'm searching and find two docs that match. How do I know how they
>> sort wrt each other?
>>
>> I'm searching in the phone field but I need the value for each doc
>> associated with the name field. In your example I'm searching in
>> map-for-fieldA but sorting in map-for-field-B
>>
>> To get the name value for these two docs I have to enumerate
>> map-for-field-B until I find each doc and then I can get the proper
>> value and know how they sort. Sure, I could do some ordering and do a
>> binary search but that's still vastly slower than having a structure
>> that's a simple index operation to get the value in its field.
>>
>> The DV structure is actually more like what's below. These structures
>> are simply an array indexed by the _internal_ Lucene document id,
>> which is a simple zero-based integer that contains the value
>> associated with that doc for that field (I'm simplifying a bit, but
>> that's conceptually the deal).
>> FieldA
>> doc1 = 1
>> doc2 = 2
>> doc3 =
>>
>> FieldB
>> doc1 = 2
>> doc2 = 4
>> doc3 = 5
>>
>> FieldC
>> doc1 = 5
>> doc2 =
>> doc3 = 5
>>
>> Best,
>> Erick
>>
>> On Thu, Dec 21, 2017 at 4:05 PM, S G  wrote:
>> > Thanks a lot Erick and Emir.
>> >
>> > I am still a bit confused and an example will help me a lot.
>> > Here is a little bit modified version of the same to illustrate my point
>> > more clearly.
>> >
>> > Le

Re: Confusing DocValues documentation

2017-12-21 Thread S G

Thank you Eric.

I guess the biggest piece I was missing was the sort on a field other than
the search field.
Once you have filtered a list of documents and then you want to sort, the
inverted index cannot be used for lookup.
You just have doc-IDs which are values in inverted index, not the keys.
Hence they cannot be "looked" up - only option is to loop through all the
entries of that key's inverted index.

DocValues come to rescue by reducing that looping operation to a lookup
again.
Because in docValues, the key (i.e. array-index) is the document-index and
gives an O(1) lookup for any doc-ID.


But that does not seem to be helping with sorting or faceting of any kind.
This seems to be like a good way to speed up a stored field's retrieval.

DocValues in the current example are:
FieldA
doc1 = 1
doc2 = 2
doc3 =

FieldB
doc1 = 2
doc2 = 4
doc3 = 5

FieldC
doc1 = 5
doc2 =
doc3 = 5

So if I have to run a query:
fieldA=*&sort=fieldB asc
I will get all the documents due to filter and then I will lookup the
values of field-B from the docValues lookup.
That will give me 2,4,5
This is sorted in this case, but assume that this was not sorted.
(The docValues array is indexed by Lucene's doc-ID not the field-value
after all, right?)

Then does Lucene/Solr still sort them like regular array of values?
That does not seem very efficient.
And it does not seem to helping with faceting, pivoting too.
What did I miss?

Thanks
SG






On Thu, Dec 21, 2017 at 5:31 PM, Erick Erickson 
wrote:

> Here's where you're going off the rails: "I can just look at the
> map-for-field-A"
>
> As I said before, you're totally right, all the information you need
> is there. But
> you're thinking of this as though speed weren't a premium when you say.
> "I can just look". Consider that there are single replicas out there with
> 300M
> (or more) docs in them. "Just looking" in a list 300M items long 300M times
> (q=*:*&sort=whatever) is simply not going to be performant compared to
> 300M indexing operations which is what DV does.
>
> Faceting is much worse.
>
> Plus space is also at a premium. Java takes 40+ bytes to store the first
> character. So any Java structure you use is going to be enormous. 300M ints
> is bad enough. And if you spoof this by using ordinals as Lucene does,
> you're
> well on your way to reinventing docValues.
>
> Maybe this will help. Imagine you have a phone book in your hands. It
> consists of documents like this:
>
> id: something
> phone: phone number
> name: person's name
>
> For simplicity, they're both string types 'cause they sort.
>
> Let's search by phone number but sort by name, i.e.
>
> q=phone:1234*&sort=name asc
>
> I'm searching and find two docs that match. How do I know how they
> sort wrt each other?
>
> I'm searching in the phone field but I need the value for each doc
> associated with the name field. In your example I'm searching in
> map-for-fieldA but sorting in map-for-field-B
>
> To get the name value for these two docs I have to enumerate
> map-for-field-B until I find each doc and then I can get the proper
> value and know how they sort. Sure, I could do some ordering and do a
> binary search but that's still vastly slower than having a structure
> that's a simple index operation to get the value in its field.
>
> The DV structure is actually more like what's below. These structures
> are simply an array indexed by the _internal_ Lucene document id,
> which is a simple zero-based integer that contains the value
> associated with that doc for that field (I'm simplifying a bit, but
> that's conceptually the deal).
> FieldA
> doc1 = 1
> doc2 = 2
> doc3 =
>
> FieldB
> doc1 = 2
> doc2 = 4
> doc3 = 5
>
> FieldC
> doc1 = 5
> doc2 =
> doc3 = 5
>
> Best,
> Erick
>
> On Thu, Dec 21, 2017 at 4:05 PM, S G  wrote:
> > Thanks a lot Erick and Emir.
> >
> > I am still a bit confused and an example will help me a lot.
> > Here is a little bit modified version of the same to illustrate my point
> > more clearly.
> >
> > Let us consider 3 documents - doc1, doc2 and doc3
> > Each contains upto 3 fields - A, B and C.
> > And the values for these fields are random.
> > For example:
> > doc1 = {A:1, B:2, C:5}
> > doc2 = {A:2, B:4}
> > doc3 = {B:5, C:5}
> >
> >
> > Inverted Index for the same should be a map of:
> > Key: 
> > Value: 
> > i.e.
> > {
> >map-for-field-A: {1: doc1, 2: doc2}
> >map-for-field-B: {2: doc1, 4: doc2, 5:doc3}
> >map-for-field-C: {5: [doc1, doc3]}
> > }
> >
> > For sorting on field A, I can just look at the map-for-field-A and sort
> the
> > keys (and
> > perhaps keep it sorted too for saving the sort each time). For facets on
> > field A, I can
> > again, just look at the map-for-field-A and get counts for each value.
> So I
> > will
> > get facets(Field-A) = {1:1, 2:1} because count for each value is 1.
> > Similarly facets(Field-C) = {5:2}
> >
> > Why is this not performant? All it did was to bring one data-structure
> into
> > memory and if
> > the current imp

Re: Confusing DocValues documentation

2017-12-21 Thread Erick Erickson

Here's where you're going off the rails: "I can just look at the
map-for-field-A"

As I said before, you're totally right, all the information you need
is there. But
you're thinking of this as though speed weren't a premium when you say.
"I can just look". Consider that there are single replicas out there with 300M
(or more) docs in them. "Just looking" in a list 300M items long 300M times
(q=*:*&sort=whatever) is simply not going to be performant compared to
300M indexing operations which is what DV does.

Faceting is much worse.

Plus space is also at a premium. Java takes 40+ bytes to store the first
character. So any Java structure you use is going to be enormous. 300M ints
is bad enough. And if you spoof this by using ordinals as Lucene does, you're
well on your way to reinventing docValues.

Maybe this will help. Imagine you have a phone book in your hands. It
consists of documents like this:

id: something
phone: phone number
name: person's name

For simplicity, they're both string types 'cause they sort.

Let's search by phone number but sort by name, i.e.

q=phone:1234*&sort=name asc

I'm searching and find two docs that match. How do I know how they
sort wrt each other?

I'm searching in the phone field but I need the value for each doc
associated with the name field. In your example I'm searching in
map-for-fieldA but sorting in map-for-field-B

To get the name value for these two docs I have to enumerate
map-for-field-B until I find each doc and then I can get the proper
value and know how they sort. Sure, I could do some ordering and do a
binary search but that's still vastly slower than having a structure
that's a simple index operation to get the value in its field.

The DV structure is actually more like what's below. These structures
are simply an array indexed by the _internal_ Lucene document id,
which is a simple zero-based integer that contains the value
associated with that doc for that field (I'm simplifying a bit, but
that's conceptually the deal).
FieldA
doc1 = 1
doc2 = 2
doc3 =

FieldB
doc1 = 2
doc2 = 4
doc3 = 5

FieldC
doc1 = 5
doc2 =
doc3 = 5

Best,
Erick

On Thu, Dec 21, 2017 at 4:05 PM, S G  wrote:
> Thanks a lot Erick and Emir.
>
> I am still a bit confused and an example will help me a lot.
> Here is a little bit modified version of the same to illustrate my point
> more clearly.
>
> Let us consider 3 documents - doc1, doc2 and doc3
> Each contains upto 3 fields - A, B and C.
> And the values for these fields are random.
> For example:
> doc1 = {A:1, B:2, C:5}
> doc2 = {A:2, B:4}
> doc3 = {B:5, C:5}
>
>
> Inverted Index for the same should be a map of:
> Key: 
> Value: 
> i.e.
> {
>map-for-field-A: {1: doc1, 2: doc2}
>map-for-field-B: {2: doc1, 4: doc2, 5:doc3}
>map-for-field-C: {5: [doc1, doc3]}
> }
>
> For sorting on field A, I can just look at the map-for-field-A and sort the
> keys (and
> perhaps keep it sorted too for saving the sort each time). For facets on
> field A, I can
> again, just look at the map-for-field-A and get counts for each value. So I
> will
> get facets(Field-A) = {1:1, 2:1} because count for each value is 1.
> Similarly facets(Field-C) = {5:2}
>
> Why is this not performant? All it did was to bring one data-structure into
> memory and if
> the current implementation was changed to use OS-cache for the same, the
> pressure on
> the JVM would be reduced as well.
>
> So the point I am trying to make here is that how does the data-structure of
> docValues differ from the inverted index I showed above? And how does that
> structure helps it become more performant? I do not want to factor in the
> OS-cache perspective here for the time being because that could have been
> fixed in the regular inverted index also. I just want to focus on the
> data-structure
> for now that how it is different from the inverted index. Please do not say
> "columnar format" as
> those 2 words really convey nothing to me.
>
> If you can draw me the exact "columnar format" for the above example, then
> it would be much appreciated.
>
> Thanks
> SG
>
>
>
>
> On Thu, Dec 21, 2017 at 12:43 PM, Erick Erickson 
> wrote:
>
>> bq: I do not see why sorting or faceting on any field A, B or C would
>> be a problem. All the values for a field are there in one
>> data-structure and it should be easy to sort or group-by on that.
>>
>> This is totally true just totally incomplete: ;)
>>
>> for a given field:
>>
>> Inverted structure (leaving out position information and the like):
>>
>> term1: doc1,   doc37, doc 95
>> term2: doc10, doc37, doc 950
>>
>> docValues structure (assuming multiValued):
>>
>> doc1: term1
>> doc10: term2
>> doc37: term1 term2
>> doc95: term1
>> doc950: term2
>>
>> They are used to answer two different questions.
>>
>> The inverted structure efficiently answers "for term1, what docs does
>> it appear in?"
>>
>> The docValues structure efficiently answers "for doc1, what terms are
>> in the field?"
>>
>> So imagine you have a search on term1. It'

Re: Confusing DocValues documentation

2017-12-21 Thread S G

Thanks a lot Erick and Emir.

I am still a bit confused and an example will help me a lot.
Here is a little bit modified version of the same to illustrate my point
more clearly.

Let us consider 3 documents - doc1, doc2 and doc3
Each contains upto 3 fields - A, B and C.
And the values for these fields are random.
For example:
doc1 = {A:1, B:2, C:5}
doc2 = {A:2, B:4}
doc3 = {B:5, C:5}


Inverted Index for the same should be a map of:
Key: 
Value: 
i.e.
{
   map-for-field-A: {1: doc1, 2: doc2}
   map-for-field-B: {2: doc1, 4: doc2, 5:doc3}
   map-for-field-C: {5: [doc1, doc3]}
}

For sorting on field A, I can just look at the map-for-field-A and sort the
keys (and
perhaps keep it sorted too for saving the sort each time). For facets on
field A, I can
again, just look at the map-for-field-A and get counts for each value. So I
will
get facets(Field-A) = {1:1, 2:1} because count for each value is 1.
Similarly facets(Field-C) = {5:2}

Why is this not performant? All it did was to bring one data-structure into
memory and if
the current implementation was changed to use OS-cache for the same, the
pressure on
the JVM would be reduced as well.

So the point I am trying to make here is that how does the data-structure of
docValues differ from the inverted index I showed above? And how does that
structure helps it become more performant? I do not want to factor in the
OS-cache perspective here for the time being because that could have been
fixed in the regular inverted index also. I just want to focus on the
data-structure
for now that how it is different from the inverted index. Please do not say
"columnar format" as
those 2 words really convey nothing to me.

If you can draw me the exact "columnar format" for the above example, then
it would be much appreciated.

Thanks
SG




On Thu, Dec 21, 2017 at 12:43 PM, Erick Erickson 
wrote:

> bq: I do not see why sorting or faceting on any field A, B or C would
> be a problem. All the values for a field are there in one
> data-structure and it should be easy to sort or group-by on that.
>
> This is totally true just totally incomplete: ;)
>
> for a given field:
>
> Inverted structure (leaving out position information and the like):
>
> term1: doc1,   doc37, doc 95
> term2: doc10, doc37, doc 950
>
> docValues structure (assuming multiValued):
>
> doc1: term1
> doc10: term2
> doc37: term1 term2
> doc95: term1
> doc950: term2
>
> They are used to answer two different questions.
>
> The inverted structure efficiently answers "for term1, what docs does
> it appear in?"
>
> The docValues structure efficiently answers "for doc1, what terms are
> in the field?"
>
> So imagine you have a search on term1. It's a simple iteration of the
> inverted structure to get my result set, namely docs 1, 37, and 95.
>
> But now I want to facet. I have to get the _values_ for my field from
> the entire result set in order to fill my count buckets. Using the
> uninverted structure, I'd have to scan the entire table term-by-term
> and look to see if the term appeared in any of docs 1, 37, 95 and add
> to my total for the term. Think "table scan".
>
> Instead I use the docValues structure which is much faster, I already
> know all I'm interested in is these three docs, so I just read the
> terms in the field for each doc and add to my counts. Again, to answer
> this question from the wrong (in this case inverted structure) I'd
> have to do a table scan. Also, this would be _extremely_ expensive to
> do from stored fields.
>
> And it's the inverse for searching the docValues structure. In order
> to find which doc has term1, I'd have to examine all the terms for the
> field for each document in my index. Horribly painful.
>
> So yes, the information is all there in one structure or the other and
> you _could_ get all of it from either one. You'd also have a system
> that was able to serve 0.1 QPS on a largish index.
>
> And remember that this is very simplified. If you have a complex query
> you need to get a result set before even considering the
> facet/sort/whatever question so gathering the term information as I
> searched wouldn't particularly work.
>
> Best,
> Erick
>
> On Thu, Dec 21, 2017 at 9:56 AM, S G  wrote:
> > Hi,
> >
> > It seems that docValues are not really explained well anywhere.
> > Here are 2 links that try to explain it:
> > 1) https://lucidworks.com/2013/04/02/fun-with-docvalues-in-solr-4-2/
> > 2)
> > https://www.elastic.co/guide/en/elasticsearch/guide/
> current/docvalues.html
> >
> > And official Solr documentation that does not explain the internal
> details
> > at all:
> > 3) https://lucene.apache.org/solr/guide/6_6/docvalues.html
> >
> > The first links says that:
> >   The row-oriented (stored fields) are
> >   {
> > 'doc1': {'A':1, 'B':2, 'C':3},
> > 'doc2': {'A':2, 'B':3, 'C':4},
> > 'doc3': {'A':4, 'B':3, 'C':2}
> >   }
> >
> >   while column-oriented (docValues) are:
> >   {
> > 'A': {'doc1':1, 'doc2':2, 'doc3':4},
> > 'B': {'doc

Re: Confusing DocValues documentation

2017-12-21 Thread Erick Erickson

bq: I do not see why sorting or faceting on any field A, B or C would
be a problem. All the values for a field are there in one
data-structure and it should be easy to sort or group-by on that.

This is totally true just totally incomplete: ;)

for a given field:

Inverted structure (leaving out position information and the like):

term1: doc1,   doc37, doc 95
term2: doc10, doc37, doc 950

docValues structure (assuming multiValued):

doc1: term1
doc10: term2
doc37: term1 term2
doc95: term1
doc950: term2

They are used to answer two different questions.

The inverted structure efficiently answers "for term1, what docs does
it appear in?"

The docValues structure efficiently answers "for doc1, what terms are
in the field?"

So imagine you have a search on term1. It's a simple iteration of the
inverted structure to get my result set, namely docs 1, 37, and 95.

But now I want to facet. I have to get the _values_ for my field from
the entire result set in order to fill my count buckets. Using the
uninverted structure, I'd have to scan the entire table term-by-term
and look to see if the term appeared in any of docs 1, 37, 95 and add
to my total for the term. Think "table scan".

Instead I use the docValues structure which is much faster, I already
know all I'm interested in is these three docs, so I just read the
terms in the field for each doc and add to my counts. Again, to answer
this question from the wrong (in this case inverted structure) I'd
have to do a table scan. Also, this would be _extremely_ expensive to
do from stored fields.

And it's the inverse for searching the docValues structure. In order
to find which doc has term1, I'd have to examine all the terms for the
field for each document in my index. Horribly painful.

So yes, the information is all there in one structure or the other and
you _could_ get all of it from either one. You'd also have a system
that was able to serve 0.1 QPS on a largish index.

And remember that this is very simplified. If you have a complex query
you need to get a result set before even considering the
facet/sort/whatever question so gathering the term information as I
searched wouldn't particularly work.

Best,
Erick

On Thu, Dec 21, 2017 at 9:56 AM, S G  wrote:
> Hi,
>
> It seems that docValues are not really explained well anywhere.
> Here are 2 links that try to explain it:
> 1) https://lucidworks.com/2013/04/02/fun-with-docvalues-in-solr-4-2/
> 2)
> https://www.elastic.co/guide/en/elasticsearch/guide/current/docvalues.html
>
> And official Solr documentation that does not explain the internal details
> at all:
> 3) https://lucene.apache.org/solr/guide/6_6/docvalues.html
>
> The first links says that:
>   The row-oriented (stored fields) are
>   {
> 'doc1': {'A':1, 'B':2, 'C':3},
> 'doc2': {'A':2, 'B':3, 'C':4},
> 'doc3': {'A':4, 'B':3, 'C':2}
>   }
>
>   while column-oriented (docValues) are:
>   {
> 'A': {'doc1':1, 'doc2':2, 'doc3':4},
> 'B': {'doc1':2, 'doc2':3, 'doc3':3},
> 'C': {'doc1':3, 'doc2':4, 'doc3':2}
>   }
>
> And the second link gives an example as:
> Doc values maps documents to the terms contained by the document:
>
>   Doc  Terms
>   -
>   Doc_1 | brown, dog, fox, jumped, lazy, over, quick, the
>   Doc_2 | brown, dogs, foxes, in, lazy, leap, over, quick, summer
>   Doc_3 | dog, dogs, fox, jumped, over, quick, the
>   -
>
>
> To me, this example is same as the row-oriented (stored fields) format in
> the first link.
> Which one is right?
>
>
>
> Also, the column-oriented (docValues) mentioned above are:
> {
>   'A': {'doc1':1, 'doc2':2, 'doc3':4},
>   'B': {'doc1':2, 'doc2':3, 'doc3':3},
>   'C': {'doc1':3, 'doc2':4, 'doc3':2}
> }
> Isn't this what the inverted index also looks like?
> Inverted index is an index of the term (A,B,C) to the document and the
> position it is found in the document.
>
>
> Or is it better to say that the inverted index is of the form:
> {
>map-for-field-A: {1: doc1, 2: doc2, 4: doc3}
>map-for-field-B: {2: doc1, 3: [doc2,doc3]}
>map-for-field-C: {3: doc1, 4: doc2, 2: doc3}
> }
> But even if that is true, I do not see why sorting or faceting on any field
> A, B or C would be a problem.
> All the values for a field are there in one data-structure and it should be
> easy to sort or group-by on that.
>
> Can someone explain the above a bit more clearly please? A build-upon the
> same example as above would be really good.
>
>
> Thanks
> SG

Re: Confusing DocValues documentation

2017-12-21 Thread Emir Arnautović

Hi SG,
It is all ok - it’s just that notation is different. Please see inline comments.

Regards,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 21 Dec 2017, at 18:56, S G  wrote:
> 
> Hi,
> 
> It seems that docValues are not really explained well anywhere.
> Here are 2 links that try to explain it:
> 1) https://lucidworks.com/2013/04/02/fun-with-docvalues-in-solr-4-2/
> 2)
> https://www.elastic.co/guide/en/elasticsearch/guide/current/docvalues.html
> 
> And official Solr documentation that does not explain the internal details
> at all:
> 3) https://lucene.apache.org/solr/guide/6_6/docvalues.html
> 
> The first links says that:
>  The row-oriented (stored fields) are
>  {
>'doc1': {'A':1, 'B':2, 'C':3},
>'doc2': {'A':2, 'B':3, 'C':4},
>'doc3': {'A':4, 'B':3, 'C':2}
>  }
[EA] These are input documents. For more completeness,  it would be good if one 
example is multivalue field.

> 
>  while column-oriented (docValues) are:
>  {
>'A': {'doc1':1, 'doc2':2, 'doc3':4},
>'B': {'doc1':2, 'doc2':3, 'doc3':3},
>'C': {'doc1':3, 'doc2':4, 'doc3':2}
>  }
[EA] You can focus here on one field.

> 
> And the second link gives an example as:
> Doc values maps documents to the terms contained by the document:
> 
>  Doc  Terms
>  -
>  Doc_1 | brown, dog, fox, jumped, lazy, over, quick, the
>  Doc_2 | brown, dogs, foxes, in, lazy, leap, over, quick, summer
>  Doc_3 | dog, dogs, fox, jumped, over, quick, the
>  
[EA] And this is the “multiline” version of single field with multiple values. 
Note here that terms are deduplicated and sorted.

> 
> 
> To me, this example is same as the row-oriented (stored fields) format in
> the first link.
> Which one is right?
[EA] As explained earlier, this is single field column-oriented structure. In 
first link notation, row-oriented would be:
{
  ‘Doc_1’: {‘text_field’: ’The quick brown fox jumped over lazy dog’, 
’some_other_field’:….}
  ‘Doc_2’:…
}
and column-oriented would be:
{
  ’text_field’: {‘Doc_1’: [‘brown’, ‘dog’, ‘fox’,….], ‘Doc_2’: [‘brown’, 
‘dog’,…]}
}

> 
> 
> 
> Also, the column-oriented (docValues) mentioned above are:
> {
>  'A': {'doc1':1, 'doc2':2, 'doc3':4},
>  'B': {'doc1':2, 'doc2':3, 'doc3':3},
>  'C': {'doc1':3, 'doc2':4, 'doc3':2}
> }
> Isn’t this what the inverted index also looks like?
[EA] No - inverted index is…well… inverted :) Keys are values and values are 
doc ids.

> Inverted index is an index of the term (A,B,C) to the document and the
> position it is found in the document.
> 
> 
> Or is it better to say that the inverted index is of the form:
> {
>   map-for-field-A: {1: doc1, 2: doc2, 4: doc3}
>   map-for-field-B: {2: doc1, 3: [doc2,doc3]}
>   map-for-field-C: {3: doc1, 4: doc2, 2: doc3}
> }
[EA] This is inverted index.

> But even if that is true, I do not see why sorting or faceting on any field
> A, B or C would be a problem.
[EA] It is more obvious when you try with multivalue fields: imagine you want 
to facet on text_field in previous example and have matched Doc_1 and 
Doc_2.…Doc_n.  How would you do it with only inverted structure? You would have 
to check each term to see how many docs from resultset does it contain. And 
stored fields are not deduplicated and optimized for quick access.
On the other hand, you can use doc values as stored fields if you can accept 
that they will be sorted.

> All the values for a field are there in one data-structure and it should be
> easy to sort or group-by on that.
> 
> Can someone explain the above a bit more clearly please? A build-upon the
> same example as above would be really good.
> 
> 
> Thanks
> SG

Re: Confusing DocValues documentation

Re: Confusing DocValues documentation

Re: Confusing DocValues documentation

Re: Confusing DocValues documentation

Re: Confusing DocValues documentation

Re: Confusing DocValues documentation

Re: Confusing DocValues documentation

Re: Confusing DocValues documentation

Re: Confusing DocValues documentation

Re: Confusing DocValues documentation

Re: Confusing DocValues documentation

11 matches

Site Navigation

Mail list logo

Footer information