[jira] [Commented] (LUCENE-8178) Bulk operations for LongValues and Sorted[Set]DocValues

Adrien Grand (JIRA) Tue, 20 Feb 2018 02:30:37 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-8178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16369904#comment-16369904
 ]


Adrien Grand commented on LUCENE-8178:
--------------------------------------

I understand how it can make things faster, but I'm a bit on the fence due to 
how specialized it is (it seems to focus on faceting on MatchAllDocsQuery on an 
index with not too many deletions) vs. the number of additional APIs.

> Bulk operations for LongValues and Sorted[Set]DocValues
> -------------------------------------------------------
>
>                 Key: LUCENE-8178
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8178
>             Project: Lucene - Core
>          Issue Type: Improvement
>    Affects Versions: 7.2.1
>            Reporter: Nikolay Khitrin
>            Priority: Major
>         Attachments: LUCENE-8178-for-solr.patch, LUCENE-8178.patch
>
>
> One-by-one DocValues iteration by {{advanceExact}} and 
> {{nextOrd}}/{{ordValue}} is really slow for bulk operations like facetting. 
> Reading and unpacking integers in blocks is substantially faster but 
> DocValues for now can be queried only for single document.
> To apply document-based bulk processing {{DocIdSetIterator}} matches have to 
> be splitted to sequential docID runs and remapped to underlying 
> {{LongValues}} positions.
>  After this transformation relatively large linear scans can be performed 
> over packed integers.
>  
> To do this two new interfaces
> 1. {{LongValuesCollector}} ({{collectValue(long index, long value)}}).
>  2. {{OrdStatsCollector}} ({{collectOrd(long ord)}}, {{collectMissing(int 
> count)}}).
> and three new functions are introduced
> 1. {{LongValues.forRange(long begin, long end, LongValuesCollector 
> collector)}}
>  2. {{SortedDocValues.forEach(DocIdSetIterator disi, OrdStatsConsumer 
> collector)}}
>  3. {{SortedSetDocValues.forEach(DocIdSetIterator disi, OrdStatsConsumer 
> collector)}}
> with reference implementations.
> Optimized versions of these functions are provided for:
>  1. {{DirectReader}} for non-32/64 bits per value cases (using 
> {{PackedInts.Decoder}}).
>  2. {{Lucene70DocValuesProducer}} {{getSorted}} and {{getSortedSet}} (both 
> sparse and dense).
>  
> Measured Solr facetting performance boost is up to 2 - 2.5x on real index.
>  Patch for Solr {{DocValuesFacets}} is also provided as separate file.
>  
> Implementation notes:
>  * {{OrdStatsCollector}} does not accept document id because it will ruin 
> performance for {{SortedSetDocValues}} due to excessive position lookups.
>  * This patch is fully compatible with Lucene 7.0 DocValues format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8178) Bulk operations for LongValues and Sorted[Set]DocValues

Reply via email to