[
https://issues.apache.org/jira/browse/LUCENE-5748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14026380#comment-14026380
]
Adrien Grand commented on LUCENE-5748:
--------------------------------------
+1 I like it!
> SORTED_NUMERIC dv type
> ----------------------
>
> Key: LUCENE-5748
> URL: https://issues.apache.org/jira/browse/LUCENE-5748
> Project: Lucene - Core
> Issue Type: New Feature
> Reporter: Robert Muir
> Attachments: LUCENE-5748.patch
>
>
> Currently for Strings you have SORTED and SORTED_SET, capable of single and
> multiple values per document respectively.
> For multi-numerics, there are only a few choices:
> * encode with NumericUtils into byte[]'s and store with SORTED_SET.
> * encode yourself per-document into BINARY.
> Both of these techniques have problems:
> SORTED_SET isn't bad if you just want to do basic sorting (e.g. min/max) or
> faceting counts: most of the bloat in the "terms dict" is compressed away,
> and it optimizes the case where the data is actually single-valued, but it
> falls apart performance-wise if you want to do more complex stuff like solr's
> analytics component or elasticsearch's aggregations: the ordinals just get in
> your way and cause additional work, deref'ing each to a byte[] and then
> decoding that back to a number. Worst of all, any mathematical calculations
> are off because it discards frequency (deduplicates).
> using your own custom encoding in BINARY removes the unnecessary ordinal
> dereferencing, but you trade off bad compression and access: you have no real
> choice but to do something like vInt within each byte[] for the doc, which
> means even basic sorting (e.g. max) is slow as its not constant time. There
> is no chance for the codec to optimize things like dates with GCD compression
> or optimize the single-valued case because its just an opaque byte[].
> So I think it would be good to explore a simple long[] type that solves these
> problems.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]