Re: Bloom Filters?

Ryan Blue Tue, 05 Mar 2019 10:11:30 -0800

+Iceberg Dev List <dev@iceberg.apache.org>, the project has moved to Apache.

Hi Josh,

We have considered bloom filters. There are some cases where they could be
useful, but there are generally better ways to accomplish the same task.

I typically recommend sorting on an ID field to take advantage of the lower
and upper bounds that Iceberg already supports. In addition to the range
bounds, this also maximizes the likelihood that Parquet dictionaries are
used for encoding. When dictionaries are available, that's better than a
row group because it is already generated and stored; plug it can be used
for filtering without any false-positives. Sorting in Spark also handles
skew, which is a great bonus.

The use case where bloom filters can provide value is when you have a high
rate of unique values (like a UUID used to identify a record) and cannot
sort the data because of the volume and when it needs to be available for
downstream consumption. Iceberg also erodes this use case because you can
sort the data in the background and atomically swap the unsorted data for
sorted data. Because you can safely swap in data you've optimized, the data
is available quickly and only a small portion of the new data takes a long
time to scan through before it is optimized for reads.

I think that the remaining use case for bloom filters is a narrow one. If
you'd still like to work on it, we can think through what we would need to
add to the spec. Bloom filters are too large to add to the existing
metadata structures, like manifests, but we could add an index location for
each file that stores a bloom filter separately.

rb

On Sun, Mar 3, 2019 at 9:40 PM Joshua Hollander <jholl...@gmail.com> wrote:

> Hello, really interesting project.  Has any consideration been given to
> adding bloom filters to the column stats in the manifests?
>
> I've developed a custom metastore which stores bloom filters for pruning
> along side the lower and upper bounds.  This allows us to do reasonably
> fast needle in a haystack searches in our data lake.  I know it might be a
> bit of a unique use case as most folks are looking for aggregates and
> trends.
>
> I realize that it would likely cause a severe bloating of the manifest
> file considering bloom filter sizes for this kind of data (we use the
> scalable bloom filter variant in an attempt to mitigate this).  Any
> interest?  I'd be interested in possibly contributing to the feature if
> there was.
>
> Thanks,
> -Josh
>
> --
> You received this message because you are subscribed to the Google Groups
> "Iceberg Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to iceberg-devel+unsubscr...@googlegroups.com.
> To post to this group, send email to iceberg-de...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/iceberg-devel/a57dbe29-2d22-4530-b1c0-af191fe694ca%40googlegroups.com
> <https://groups.google.com/d/msgid/iceberg-devel/a57dbe29-2d22-4530-b1c0-af191fe694ca%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Bloom Filters?

Reply via email to