Ok, Prasanth found a problem with my proposed approach. In particular, the old readers would misinterpret bloom filters from new files. Therefore, I'd like to propose a more complicated solution: 1. We extend the stripe footer or bloom filter index to record the default encoding when we are writing a string or decimal bloom filter. 2. When reading a bloom filter, we use the encoding if it is present. 3. I'd still like to bump the WriterVersion for ORC-101.
Thoughts? .. Owen On Wed, Sep 7, 2016 at 1:08 PM, Owen O'Malley <[email protected]> wrote: > To expand on Prasanth's answer, in ORC we have both a format version, > which is oldest version of the reader that can read the file (eg 0.11 and > 0.12), and the writer version, which keeps track of which version of the > software that wrote the file denoted by the jiras where there are > significant changes in the writer (eg. original, hive-8732, hive-4243, > hive-12055, hive-13083, and now orc-101). The reader uses the writer > version to work around issues like this. > > .. Owen > > On Wed, Sep 7, 2016 at 12:08 PM, Prasanth Jayachandran < > [email protected]> wrote: > >> +1 to bump up the writer version to facilitate correct ppd for older >> versions. >> Alan - PPD will have to look at the writer version to detect old files. >> Newer files will have writer version as ORC-101. >> >> Thanks >> Prasanth >> >> >> >> >> On Wed, Sep 7, 2016 at 1:12 PM -0500, "Alan Gates" <[email protected]> >> wrote: >> >> >> >> >> >> >> >> >> >> >> I think using the default encoding for the old files is the best option, >> as it will be right 99% of the time. I was wondering how the system would >> know whether or not this was an old file. >> >> Alan. >> >> > On Sep 7, 2016, at 10:06, Owen O'Malley wrote: >> > >> > 4 is about when you are using the bloom filter for predicate push down. >> I'm >> > saying old files should use the default encoding when checking the bloom >> > filter. The other option is to always have the predicate push down say >> > maybe if the file is an old one. >> > >> > .. Owen >> > >> > On Wed, Sep 7, 2016 at 9:34 AM, Alan Gates wrote: >> > >> >> +1 to 1-3. On 4, what do you mean by test? Assume it’s the default >> >> encoding and use that? Is there a versioning concept in the bloom >> filters >> >> that will make it easy to determine if this is pre or post ORC-101? >> >> >> >> Alan. >> >> >> >>> On Sep 7, 2016, at 08:57, Owen O'Malley wrote: >> >>> >> >>> All, >> >>> Dain Sundstrom pointed out to me in personal email that the ORC bloom >> >>> filters are currently using the default character encoding. That makes >> >> the >> >>> bloom filters non-portable between different computers that use >> different >> >>> default encodings. I've filed ORC-101 to address it, but I want to >> have a >> >>> wider discussion. I'd propose that we: >> >>> >> >>> 1. create a new WriterVersion for ORC-101. >> >>> 2. move the bloom filter code from storage-api into ORC. >> >>> 3. consistently use UTF-8 when creating new bloom filters >> >>> 4. for ORC files older than ORC-101, test the default encoding >> instead of >> >>> UTF-8 >> >>> >> >>> Thoughts? >> >>> >> >>> .. Owen >> >> >> >> >> >> >> >> >> >> >> >
