Re: Howto verify that only docValues are returned
bq: Do I have an incorrect understanding of how this works? If I take "OS disk cache" to include the OS's memory available as a result of MMapDirectory, you'r spot on. I want to quibble a bit with (1) above. If you search on a docValues=true indexed=false field it's terrible unless you have a tiny, tiny, tiny data set. Think "table scan" here. DocValues answer "for doc X, what is the term(s) in field Y" efficiently, which is what you need for sorting, grouping and faceting since you've already answered "what doc does term X appear in" through scoring. Conceptually, docValues are just an array indexed by the internal Lucene doc ID contains the value for the field. So to _search_ on it you have to examine every cell in the array. That's the "uninverted" bit you sometimes see thrown around when people discuss docValues. The inverted structure built when indexed=true is what makes answering "for term Y, what documents does it appear in" efficient. Anyway I think the original question is how can Julian assure that if the fl list specifies a field where docValues=true and stored=true, the dv value is returned not the stored value. I don't know of any way off hand either. I can say that the entire streaming world would fall completely apart since it would slow to to a complete crawl if it had to export stored fields. Indirect at best. And frankly I haven't looked at the tests for useDocValuesAsStored and the like. Best Erick On Tue, Oct 17, 2017 at 10:01 AM, Shawn Heiseywrote: > On 10/17/2017 2:09 AM, Julian Ohrt wrote: >> >> The Solr 6.6 documentation states: >> >> In cases where the query is returning only docValues fields performance >> may improve since returning stored fields requires disk reads and >> decompression whereas returning docValues fields in the fl list only >> requires memory access. > > > I'm curious how this guarantee (that docValues are accessed from memory not > disk) could possibly exist. I think the only way that this could be > guaranteed is for Lucene to keep docValues data in the heap, but using > docValues is supposed to *reduce* heap requirements, not increase them, so I > don't think that's going to happen. If the data's not in the heap, then > you're reliant on the OS disk cache as to whether or not the data is in > memory, and that would be the case either way. Do I have an incorrect > understanding of how this works? > > As I understand it, the potential advantage to docValues over stored data is > two-fold: 1) docValues are accessed differently because all the values for > one field across the entire Lucene segment are in one place. This can be a > good thing or a bad thing depending on the query and the data > characteristics, and it may not be obvious which way that will go. 2) > docValues data is not compressed, so there's less CPU required. In cases > where OS disk caching is insufficient and the compression ratio is really > good, stored data might actually be faster. > > Thanks, > Shawn >
Re: Howto verify that only docValues are returned
On 10/17/2017 2:09 AM, Julian Ohrt wrote: The Solr 6.6 documentation states: In cases where the query is returning only docValues fields performance may improve since returning stored fields requires disk reads and decompression whereas returning docValues fields in the fl list only requires memory access. I'm curious how this guarantee (that docValues are accessed from memory not disk) could possibly exist. I think the only way that this could be guaranteed is for Lucene to keep docValues data in the heap, but using docValues is supposed to *reduce* heap requirements, not increase them, so I don't think that's going to happen. If the data's not in the heap, then you're reliant on the OS disk cache as to whether or not the data is in memory, and that would be the case either way. Do I have an incorrect understanding of how this works? As I understand it, the potential advantage to docValues over stored data is two-fold: 1) docValues are accessed differently because all the values for one field across the entire Lucene segment are in one place. This can be a good thing or a bad thing depending on the query and the data characteristics, and it may not be obvious which way that will go. 2) docValues data is not compressed, so there's less CPU required. In cases where OS disk caching is insufficient and the compression ratio is really good, stored data might actually be faster. Thanks, Shawn
Re: Howto verify that only docValues are returned
See: SOLR-8344 and the JIRAs linked for a pretty extensive discussion. Note, you can force some of this in some other versions by specifying useDocValuesAsStored (since 5.5). How you'd verify I'm not quite sure. That kind of information isn't put in the logs. On Tue, Oct 17, 2017 at 1:09 AM, Julian Ohrtwrote: > The Solr 6.6 documentation states: > > > > In cases where the query is returning only docValues fields performance may > improve since returning stored > > fields requires disk reads and decompression whereas returning docValues > fields in the fl list only requires > > memory access. > > > > I want to use this potential performance gain. I think I set up my schema > correctly. > > > > Is there a way to make certain that only docValues are returned by a query? > > > > I am especially concerned about fields that are docValues but with > stored=true, for the doc also states: > > > > Field values retrieved during search queries are typically returned from > stored values. > > > > Does this mean that if I have such a field, retrieving as "stored value" is > preferred over retrieving as docValue? > > If so, how can I prevent this behavior? > > The major problem here is field "id" which is (by default) a docValue with > stored=true. > > > > Thanks, > > James > > > > > > >
Howto verify that only docValues are returned
The Solr 6.6 documentation states: In cases where the query is returning only docValues fields performance may improve since returning stored fields requires disk reads and decompression whereas returning docValues fields in the fl list only requires memory access. I want to use this potential performance gain. I think I set up my schema correctly. Is there a way to make certain that only docValues are returned by a query? I am especially concerned about fields that are docValues but with stored=true, for the doc also states: Field values retrieved during search queries are typically returned from stored values. Does this mean that if I have such a field, retrieving as "stored value" is preferred over retrieving as docValue? If so, how can I prevent this behavior? The major problem here is field "id" which is (by default) a docValue with stored=true. Thanks, James