Thanks Gabor, It's never too late to make it better. We don't have to run it in a hurry, it has been developed for a long time yet.:)
The thrift file is indeed a bit lag behind the spec. As the spec defined, the bloom filter data is stored near the footer which means we don't have to handle it like the page. Therefore, I just opened a jira to remove bloom_filter_page_header in PageHeader structure, while the BloomFitlerHeader is kept intentionally for convenience. Since the spec and the thrift should be aligned with each other eventually, so the vote target is both of them. On Mon, Jul 15, 2019 at 7:48 PM Gabor Szadovszky <[email protected]> wrote: > > Hi Junjie, > > Sorry for bringing up this a bit late but I have some problems with the > format update. The parquet.thrift file is updated to have the bloom filters > as a page (just as dictionaries and data pages). Meanwhile, the spec > (BloomFilter.md) says that the bloom filter is stored near the footer. So, > if the bloom filter is not part of the row-groups (like column indexes) I > would not add it as a page. See the struct ColumnIndex in the thrift file. > This struct is not referenced anywhere in it only declared. It was done > this way because we don't parse it in the same way as we parse the pages. > > Currently, I am not 100% sure about the target of this vote. If it is a > vote about adding bloom filters in general then it is a +1 (binding). If it > is about adding the bloom filters to parquet-format as is then, it is a -1 > (binding) until we fix the issue above. > > Regards, > Gabor > > On Mon, Jul 15, 2019 at 11:45 AM Gidon Gershinsky <[email protected]> wrote: > > > +1 (non-binding) > > > > On Mon, Jul 15, 2019 at 12:08 PM Zoltan Ivanfi <[email protected]> > > wrote: > > > > > +1 (binding) > > > > > > On Mon, Jul 15, 2019 at 9:57 AM 俊杰陈 <[email protected]> wrote: > > > > > > > > Dear Parquet developers > > > > > > > > I'd like to resume this vote, you can start to vote now. Thanks for > > your > > > time. > > > > > > > > On Wed, Jul 10, 2019 at 9:29 PM 俊杰陈 <[email protected]> wrote: > > > > > > > > > > I see, will resume this next week. Thanks. > > > > > > > > > > > > > > > > > > > > On Wed, Jul 10, 2019 at 5:26 PM Zoltan Ivanfi > > <[email protected]> > > > wrote: > > > > > > > > > > > > Hi Junjie, > > > > > > > > > > > > Since there are ongoing improvements addressing review comments, I > > > would > > > > > > hold off with the vote for a few more days until the specification > > > settles. > > > > > > > > > > > > Br, > > > > > > > > > > > > Zoltan > > > > > > > > > > > > On Wed, Jul 10, 2019 at 9:32 AM 俊杰陈 <[email protected]> wrote: > > > > > > > > > > > > > Hi Parquet committers and developers > > > > > > > > > > > > > > We are waiting for your important ballot:) > > > > > > > > > > > > > > On Tue, Jul 9, 2019 at 10:21 AM 俊杰陈 <[email protected]> wrote: > > > > > > > > > > > > > > > > Yes, there are some public benchmark results, such as the > > > official > > > > > > > > benchmark from xxhash site (http://www.xxhash.com/) and > > > published > > > > > > > > comparison from smhasher project > > > > > > > > (https://github.com/rurban/smhasher/). > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jul 9, 2019 at 5:25 AM Wes McKinney < > > [email protected]> > > > wrote: > > > > > > > > > > > > > > > > > > Do you have any benchmark data to support the choice of hash > > > function? > > > > > > > > > > > > > > > > > > On Wed, Jul 3, 2019 at 8:41 AM 俊杰陈 <[email protected]> > > wrote: > > > > > > > > > > > > > > > > > > > > Dear Parquet developers > > > > > > > > > > > > > > > > > > > > To simplify the voting, I 'd like to update voting content > > > to the > > > > > > > spec > > > > > > > > > > with xxHash hash strategy. Now you can reply with +1 or -1. > > > > > > > > > > > > > > > > > > > > Thanks for your participation. > > > > > > > > > > > > > > > > > > > > On Tue, Jul 2, 2019 at 10:23 AM 俊杰陈 <[email protected]> > > > wrote: > > > > > > > > > > > > > > > > > > > > > > Dear Parquet developers > > > > > > > > > > > > > > > > > > > > > > Parquet Bloom filter has been developed for a while, per > > > the > > > > > > > discussion on the mail list, it's time to call a vote for spec to > > > move > > > > > > > forward. The current spec can be found at > > > > > > > > > > https://github.com/apache/parquet-format/blob/master/BloomFilter.md. > > > > > > > There are some different options about the internal hash choice > > of > > > Bloom > > > > > > > filter and the PR is for that concern. > > > > > > > > > > > > > > > > > > > > > > So I 'd like to propose to vote the spec + hash option, > > for > > > > > > > example: > > > > > > > > > > > > > > > > > > > > > > +1 to spec and xxHash > > > > > > > > > > > +1 to spec and murmur3 > > > > > > > > > > > ... > > > > > > > > > > > > > > > > > > > > > > Please help to vote, any feedback is also welcome in the > > > > > > > discussion thread. > > > > > > > > > > > > > > > > > > > > > > Thanks & Best Regards > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > Thanks & Best Regards > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > Thanks & Best Regards > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Thanks & Best Regards > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Thanks & Best Regards > > > > > > > > > > > > > > > > -- > > > > Thanks & Best Regards > > > > > -- Thanks & Best Regards
