Dain, That is a great point. I wasn't thinking about having to implement that in C++, where it would really suck. (It was hard enough dealing with the timezones in C++, so I should know better!) I got the approach from this morning working and pushed it as a branch https://github.com/omalley/orc/commit/38752621863bf8dc1f05a6e7d34552969395e5f5 .
The heaviest hammer would be to just create a new file version, but that seems like overkill for this. (Although sooner or later we will get there, since we need new encodings for decimal.) Ok, so how about: 1. We create a new stream kind (BLOOM_FILTER_UTF8) that always has UTF-8 based bloom filters. It will be used for strings, chars, varchars, and decimal. (Hopefully all charsets use the same bytes for ascii characters, but I don't want to find the strange exceptions.) 2. We create a new config knob that let's you write both BLOOM_FILTER and BLOOM_FILTER_UTF8 for users while they are transitioning. 3. The reader prefers BLOOM_FILTER_UTF8, but will fall back to BLOOM_FILTER if it is an old file. Thoughts? .. Owen On Thu, Sep 8, 2016 at 11:02 AM, Dain Sundstrom <[email protected]> wrote: > > On Sep 8, 2016, at 9:59 AM, Owen O'Malley <[email protected]> wrote: > > > > Ok, Prasanth found a problem with my proposed approach. In particular, > the > > old readers would misinterpret bloom filters from new files. Therefore, > I'd > > like to propose a more complicated solution: > > 1. We extend the stripe footer or bloom filter index to record the > default > > encoding when we are writing a string or decimal bloom filter. > > 2. When reading a bloom filter, we use the encoding if it is present. > > Does that mean that you always write with he platform encoding? This > would make using bloom filters for read in other programming languages > difficult because you would need to do a UTF_8 to some arbitrary character > encoding. This will also make using these bloom filters in performance > critical sections (join loops) computationally expensive as you have to do > a transcode. > > Also, I think the spec need to be clarified. The spec does not state the > character encoding of the bloom filters. I assumed it was UTF_8 to match > the normal string column encoding. It looks like the spec does not > document the meaning of "the version of the writer” and what workarounds > are necessary (or operating assumptions have been made). Once we have > that, we should document that old readers assume that the platform default > charset is consistent for readers and writers. > > As and alternative, for new files we could add add a new stream ID, so the > old readers skip them. > > -dain
