Re: [DISCUSSION] Page Level Bloom Filter

Manhua Tue, 26 Nov 2019 00:52:57 -0800

Hi Vimal,
   For what you concern about, if you have tried bloom datamap, you may know 
about how difficult it is to configure the bloom parameter. You never know how 
many (distinct) elements will be added to the bloom filter because blocklet is 
configure by size. The more bytes of a row is, the less numer of row added in 
blocklet. And for block level, this will be related to block size configuration 
too. Also, please mind the size of bloom filter.



On 2019/11/26 08:24:33, Vimal Das Kammath <vimaldas.kamm...@gmail.com> wrote: 
> I agree with ravindra that having bloom filter at Page level would not save
> any IO. Having bloom filter at file level makes sense as it could help to
> prune files at the driver side. But, I am concerned on the number of false
> positives that would result if we keep bloom filter at an entire file
> level. I think we need to experiment to find out the ideal parameters(Bloom
> size and number of hash functions) that would work effectively for a file
> level bloom filter.
> 
> Regards,
> Vimal
> 
> On Tue, Nov 26, 2019 at 12:30 PM ravipesala <ravi.pes...@gmail.com> wrote:
> 
> > Hi Manhua,
> >
> > Main problem with this approach is we cannot save any IO as our IO unit is
> > blocklet not page. Once it is already to memory I really don’t think we can
> > get performance with bloom at page level. I feel the solution would be
> > efficient only the IO is saved somewhere.
> >
> > Our min/max index is efficient because it can prune the files at driver
> > side
> > and prune the blocklets and pages at the executor side. It is actually
> > saving lots of IO.
> >
> > Supporting bloom at carbondata file and index level is a good approach
> > rather than just supporting at page level. My intention is that it should
> > behave just the same as the min/max index. So that we can prune the data at
> > multiple levels.
> >
> > The driver side at the block level we can have a bloom with less
> > probability
> > percentage and fewer hash functions to control the size as we load it to
> > the
> > memory. And in the blocklet level we can increase the probability and
> > hashes
> > little more for better pruning, gradually at page level we can increase the
> > probability further to have a much better pruning ability.
> >
> >
> > Regards,
> > Ravindra.
> >
> >
> >
> > --
> > Sent from:
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> >
>

Re: [DISCUSSION] Page Level Bloom Filter

Reply via email to