Re: "For dictionary encodings the dictionary is sorted"

Dain Sundstrom Mon, 12 Dec 2016 16:48:58 -0800

On Dec 12, 2016, at 4:36 PM, Owen O'Malley <omal...@apache.org> wrote:
> 
>> Is it a requirement that the dictionary be sorted or a suggestion?
> 
> It is a requirement, although we can discuss weakening it.
> 
> The SargApplier doesn't currently use the sorted nature of the
> dictionaries, but it should. In particular, it should map sarg predicates
> for strings into the dictionary entries using binary search.


In that case we should definitely document the sort order for the dictionary 
items.

> The problem with sorting the dictionary is of course that it makes the
> writer keep all of the values deserialized until the end of the stripe.
> I've considered using a secondary stream that stores the sort order of each
> dictionary item. Thoughts?

You will need the uncompressed values in memory to perform the lookup in the 
hash table (the equals call).

>> I believe the current implementation is using Java String
> 
> No, the dictionary has always used UTF-8.

I meant that the sorting of the dictionary seems to be UTF-16 BE.  Is that not 
correct?

>> I think this should also be documented in the statistics section which
> also uses UTF-16 BE, which is at least consistent, but still annoying for
> everything other than Java.
> 
> Yes, it should be documented and we should replace it with UTF-8. (Although
> changes to the serialized form are always painful.)

I think we can do something similar to the bloom filter code, where we add a 
StringUtf8Stats object and have a transition period where we can produce both.

-dain

Re: "For dictionary encodings the dictionary is sorted"

Reply via email to