[ 
https://issues.apache.org/jira/browse/SOLR-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15031910#comment-15031910
 ] 

Yonik Seeley commented on SOLR-8220:
------------------------------------

bq. From a performance perspective, reading values from DocValues always (if 
they exist) can be horrible because each field access in docvalues may need a 
random disk seek, whereas, all stored fields for a document are kept together 
and need only 1 random seek and a sequential block read.

A few points:
- stored fields also require decompression (more overhead)
- use of stored fields and docvalues at the same time is less memory efficient 
- the stored fields will also take up needed disk cache (although hopefully the 
OS will figure out which it should cache more aggressively
- presumably one has docvalues because they need to be used, and they need to 
be fast... i.e. they already need to be cached.
- if one as a small set of fields that are normally retrieved, it seems like a 
win again.
- a *very* common case these days is that the entire index fits in memory.
- we're in the SSD era, and multiple "seeks" will still be more expensive if 
not cached, but much less so (and less so over time as non-volatile storage 
keeps improving)

It seems like this should be a big win for the common case, and the ability to 
reindex your data or change config and not have to change the clients is 
important IMO.  It's like being able to reindex a date to a trie-date and have 
the clients not care.  We can already reindex a field as docValues, and sort, 
facet, do analytics, without changing client requests.  Optimizations to field 
value retrieval (or optionally removing redundantly stored data) should be the 
same.


> Read field from docValues for non stored fields
> -----------------------------------------------
>
>                 Key: SOLR-8220
>                 URL: https://issues.apache.org/jira/browse/SOLR-8220
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Keith Laban
>         Attachments: SOLR-8220-ishan.patch, SOLR-8220-ishan.patch, 
> SOLR-8220-ishan.patch, SOLR-8220-ishan.patch, SOLR-8220.patch, 
> SOLR-8220.patch, SOLR-8220.patch, SOLR-8220.patch, SOLR-8220.patch, 
> SOLR-8220.patch, SOLR-8220.patch
>
>
> Many times a value will be both stored="true" and docValues="true" which 
> requires redundant data to be stored on disk. Since reading from docValues is 
> both efficient and a common practice (facets, analytics, streaming, etc), 
> reading values from docValues when a stored version of the field does not 
> exist would be a valuable disk usage optimization.
> The only caveat with this that I can see would be for multiValued fields as 
> they would always be returned sorted in the docValues approach. I believe 
> this is a fair compromise.
> I've done a rough implementation for this as a field transform, but I think 
> it should live closer to where stored fields are loaded in the 
> SolrIndexSearcher.
> Two open questions/observations:
> 1) There doesn't seem to be a standard way to read values for docValues, 
> facets, analytics, streaming, etc, all seem to be doing their own ways, 
> perhaps some of this logic should be centralized.
> 2) What will the API behavior be? (Below is my proposed implementation)
> Parameters for fl:
> - fl="docValueField"
>   -- return field from docValue if the field is not stored and in docValues, 
> if the field is stored return it from stored fields
> - fl="*"
>   -- return only stored fields
> - fl="+"
>    -- return stored fields and docValue fields
> 2a - would be easiest implementation and might be sufficient for a first 
> pass. 2b - is current behavior



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to