[ 
https://issues.apache.org/jira/browse/LUCENE-5748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029770#comment-14029770
 ] 

ASF subversion and git services commented on LUCENE-5748:
---------------------------------------------------------

Commit 1602286 from [~rcmuir] in branch 'dev/branches/branch_4x'
[ https://svn.apache.org/r1602286 ]

LUCENE-5748: Add SORTED_NUMERIC docvalues type

> SORTED_NUMERIC dv type
> ----------------------
>
>                 Key: LUCENE-5748
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5748
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Robert Muir
>             Fix For: 4.9, 5.0
>
>         Attachments: LUCENE-5748.patch, LUCENE-5748.patch
>
>
> Currently for Strings you have SORTED and SORTED_SET, capable of single and 
> multiple values per document respectively.
> For multi-numerics, there are only a few choices:
> * encode with NumericUtils into byte[]'s and store with SORTED_SET.
> * encode yourself per-document into BINARY.
> Both of these techniques have problems: 
> SORTED_SET isn't bad if you just want to do basic sorting (e.g. min/max) or 
> faceting counts: most of the bloat in the "terms dict" is compressed away, 
> and it optimizes the case where the data is actually single-valued, but it 
> falls apart performance-wise if you want to do more complex stuff like solr's 
> analytics component or elasticsearch's aggregations: the ordinals just get in 
> your way and cause additional work, deref'ing each to a byte[] and then 
> decoding that back to a number. Worst of all, any mathematical calculations 
> are off because it discards frequency (deduplicates).
> using your own custom encoding in BINARY removes the unnecessary ordinal 
> dereferencing, but you trade off bad compression and access: you have no real 
> choice but to do something like vInt within each byte[] for the doc, which 
> means even basic sorting (e.g. max) is slow as its not constant time. There 
> is no chance for the codec to optimize things like dates with GCD compression 
> or optimize the single-valued case because its just an opaque byte[].
> So I think it would be good to explore a simple long[] type that solves these 
> problems.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to