+Iceberg Dev List <dev@iceberg.apache.org>, the project has moved to Apache.
Hi Josh, We have considered bloom filters. There are some cases where they could be useful, but there are generally better ways to accomplish the same task. I typically recommend sorting on an ID field to take advantage of the lower and upper bounds that Iceberg already supports. In addition to the range bounds, this also maximizes the likelihood that Parquet dictionaries are used for encoding. When dictionaries are available, that's better than a row group because it is already generated and stored; plug it can be used for filtering without any false-positives. Sorting in Spark also handles skew, which is a great bonus. The use case where bloom filters can provide value is when you have a high rate of unique values (like a UUID used to identify a record) and cannot sort the data because of the volume and when it needs to be available for downstream consumption. Iceberg also erodes this use case because you can sort the data in the background and atomically swap the unsorted data for sorted data. Because you can safely swap in data you've optimized, the data is available quickly and only a small portion of the new data takes a long time to scan through before it is optimized for reads. I think that the remaining use case for bloom filters is a narrow one. If you'd still like to work on it, we can think through what we would need to add to the spec. Bloom filters are too large to add to the existing metadata structures, like manifests, but we could add an index location for each file that stores a bloom filter separately. rb On Sun, Mar 3, 2019 at 9:40 PM Joshua Hollander <jholl...@gmail.com> wrote: > Hello, really interesting project. Has any consideration been given to > adding bloom filters to the column stats in the manifests? > > I've developed a custom metastore which stores bloom filters for pruning > along side the lower and upper bounds. This allows us to do reasonably > fast needle in a haystack searches in our data lake. I know it might be a > bit of a unique use case as most folks are looking for aggregates and > trends. > > I realize that it would likely cause a severe bloating of the manifest > file considering bloom filter sizes for this kind of data (we use the > scalable bloom filter variant in an attempt to mitigate this). Any > interest? I'd be interested in possibly contributing to the feature if > there was. > > Thanks, > -Josh > > -- > You received this message because you are subscribed to the Google Groups > "Iceberg Developers" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to iceberg-devel+unsubscr...@googlegroups.com. > To post to this group, send email to iceberg-de...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/iceberg-devel/a57dbe29-2d22-4530-b1c0-af191fe694ca%40googlegroups.com > <https://groups.google.com/d/msgid/iceberg-devel/a57dbe29-2d22-4530-b1c0-af191fe694ca%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- Ryan Blue Software Engineer Netflix