To expand on Prasanth's answer, in ORC we have both a format version, which is oldest version of the reader that can read the file (eg 0.11 and 0.12), and the writer version, which keeps track of which version of the software that wrote the file denoted by the jiras where there are significant changes in the writer (eg. original, hive-8732, hive-4243, hive-12055, hive-13083, and now orc-101). The reader uses the writer version to work around issues like this.
.. Owen On Wed, Sep 7, 2016 at 12:08 PM, Prasanth Jayachandran < [email protected]> wrote: > +1 to bump up the writer version to facilitate correct ppd for older > versions. > Alan - PPD will have to look at the writer version to detect old files. > Newer files will have writer version as ORC-101. > > Thanks > Prasanth > > > > > On Wed, Sep 7, 2016 at 1:12 PM -0500, "Alan Gates" <[email protected]> > wrote: > > > > > > > > > > > I think using the default encoding for the old files is the best option, > as it will be right 99% of the time. I was wondering how the system would > know whether or not this was an old file. > > Alan. > > > On Sep 7, 2016, at 10:06, Owen O'Malley wrote: > > > > 4 is about when you are using the bloom filter for predicate push down. > I'm > > saying old files should use the default encoding when checking the bloom > > filter. The other option is to always have the predicate push down say > > maybe if the file is an old one. > > > > .. Owen > > > > On Wed, Sep 7, 2016 at 9:34 AM, Alan Gates wrote: > > > >> +1 to 1-3. On 4, what do you mean by test? Assume it’s the default > >> encoding and use that? Is there a versioning concept in the bloom > filters > >> that will make it easy to determine if this is pre or post ORC-101? > >> > >> Alan. > >> > >>> On Sep 7, 2016, at 08:57, Owen O'Malley wrote: > >>> > >>> All, > >>> Dain Sundstrom pointed out to me in personal email that the ORC bloom > >>> filters are currently using the default character encoding. That makes > >> the > >>> bloom filters non-portable between different computers that use > different > >>> default encodings. I've filed ORC-101 to address it, but I want to > have a > >>> wider discussion. I'd propose that we: > >>> > >>> 1. create a new WriterVersion for ORC-101. > >>> 2. move the bloom filter code from storage-api into ORC. > >>> 3. consistently use UTF-8 when creating new bloom filters > >>> 4. for ORC files older than ORC-101, test the default encoding instead > of > >>> UTF-8 > >>> > >>> Thoughts? > >>> > >>> .. Owen > >> > >> > > > > > > >
