Re: Bloom filter hash broken

Owen O'Malley Thu, 08 Sep 2016 10:00:19 -0700

Ok, Prasanth found a problem with my proposed approach. In particular, the
old readers would misinterpret bloom filters from new files. Therefore, I'd
like to propose a more complicated solution:
1. We extend the stripe footer or bloom filter index to record the default
encoding when we are writing a string or decimal bloom filter.
2. When reading a bloom filter, we use the encoding if it is present.
3. I'd still like to bump the WriterVersion for ORC-101.


Thoughts?

.. Owen


On Wed, Sep 7, 2016 at 1:08 PM, Owen O'Malley <[email protected]> wrote:

> To expand on Prasanth's answer, in ORC we have both a format version,
> which is oldest version of the reader that can read the file (eg 0.11 and
> 0.12), and the writer version, which keeps track of which version of the
> software that wrote the file denoted by the jiras where there are
> significant changes in the writer (eg. original, hive-8732, hive-4243,
> hive-12055, hive-13083, and now orc-101). The reader uses the writer
> version to work around issues like this.
>
> .. Owen
>
> On Wed, Sep 7, 2016 at 12:08 PM, Prasanth Jayachandran <
> [email protected]> wrote:
>
>> +1 to bump up the writer version to facilitate correct ppd for older
>> versions.
>> Alan - PPD will have to look at the writer version to detect old files.
>> Newer files will have writer version as ORC-101.
>>
>> Thanks
>> Prasanth
>>
>>
>>
>>
>> On Wed, Sep 7, 2016 at 1:12 PM -0500, "Alan Gates" <[email protected]>
>> wrote:
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> I think using the default encoding for the old files is the best option,
>> as it will be right 99% of the time.  I was wondering how the system would
>> know whether or not this was an old file.
>>
>> Alan.
>>
>> > On Sep 7, 2016, at 10:06, Owen O'Malley  wrote:
>> >
>> > 4 is about when you are using the bloom filter for predicate push down.
>> I'm
>> > saying old files should use the default encoding when checking the bloom
>> > filter. The other option is to always have the predicate push down say
>> > maybe if the file is an old one.
>> >
>> > .. Owen
>> >
>> > On Wed, Sep 7, 2016 at 9:34 AM, Alan Gates  wrote:
>> >
>> >> +1 to 1-3.  On 4, what do you mean by test?  Assume it’s the default
>> >> encoding and use that?  Is there a versioning concept in the bloom
>> filters
>> >> that will make it easy to determine if this is pre or post ORC-101?
>> >>
>> >> Alan.
>> >>
>> >>> On Sep 7, 2016, at 08:57, Owen O'Malley  wrote:
>> >>>
>> >>> All,
>> >>>  Dain Sundstrom pointed out to me in personal email that the ORC bloom
>> >>> filters are currently using the default character encoding. That makes
>> >> the
>> >>> bloom filters non-portable between different computers that use
>> different
>> >>> default encodings. I've filed ORC-101 to address it, but I want to
>> have a
>> >>> wider discussion. I'd propose that we:
>> >>>
>> >>> 1. create a new WriterVersion for ORC-101.
>> >>> 2. move the bloom filter code from storage-api into ORC.
>> >>> 3. consistently use UTF-8 when creating new bloom filters
>> >>> 4. for ORC files older than ORC-101, test the default encoding
>> instead of
>> >>> UTF-8
>> >>>
>> >>> Thoughts?
>> >>>
>> >>> .. Owen
>> >>
>> >>
>>
>>
>>
>>
>>
>>
>>
>

Re: Bloom filter hash broken

Reply via email to