On Dec 12, 2016, at 4:36 PM, Owen O'Malley <omal...@apache.org> wrote: > >> Is it a requirement that the dictionary be sorted or a suggestion? > > It is a requirement, although we can discuss weakening it. > > The SargApplier doesn't currently use the sorted nature of the > dictionaries, but it should. In particular, it should map sarg predicates > for strings into the dictionary entries using binary search.
In that case we should definitely document the sort order for the dictionary items. > The problem with sorting the dictionary is of course that it makes the > writer keep all of the values deserialized until the end of the stripe. > I've considered using a secondary stream that stores the sort order of each > dictionary item. Thoughts? You will need the uncompressed values in memory to perform the lookup in the hash table (the equals call). >> I believe the current implementation is using Java String > > No, the dictionary has always used UTF-8. I meant that the sorting of the dictionary seems to be UTF-16 BE. Is that not correct? >> I think this should also be documented in the statistics section which > also uses UTF-16 BE, which is at least consistent, but still annoying for > everything other than Java. > > Yes, it should be documented and we should replace it with UTF-8. (Although > changes to the serialized form are always painful.) I think we can do something similar to the bloom filter code, where we add a StringUtf8Stats object and have a transition period where we can produce both. -dain