Hi everyone,
Here is a doc for upcoming agendas and notes from the community sync:
https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit?usp=sharing
Everyone should be able to comment using that link and a google account. If
you'd like to add agenda items, please request editor access and I'll add
you. As discussed in the sync, we'll try to use that as a running log.
The notes may be a bit incomplete, since I accidentally closed the text
editor where I was taking notes without saving and the discussion is from
memory. Feel free to comment and fix it!
I'm also going to copy the notes below:
27 May 2020
-
Agenda doc
-
Dan: Is there an Agenda for this meeting?
-
Ryan: there is one in the invite, but it was empty
-
Dan: we should use a doc for notes and the next Agenda
-
Guava is now shaded in iceberg-bundled-guava. Please let us know if you
hit problems.
-
Update the build from gradle-consistent-versions
-
Ryan: We currently use gradle-consistent-versions. That’s great in
some ways, but doesn’t allow us to have multiple versions of the same
dependency in different modules, which blocks adding a Spark 3 module to
master. It is also blocking the MR/Hive work because Hive uses a
different
Guava version. Even though we shade to avoid the conflict, the plugin can
only support 1 version.
-
Ryan: I’ve gone through options and the best option looks like the
Nebula plugins (maintained by Netflix). There is a PR open to move to
these, #1067 <https://github.com/apache/iceberg/pull/1067>. Please
have a look at the PR and comment!
-
Ratandeep: What are the drawbacks of the one built into gradle?
-
Ryan: It doesn’t use versions.props, so changes are larger; use is
awkward because locking is a side-effect of other tasks; much newer and
requires bumping to a new major version of gradle. And, I can ask the
Nebula team for support if we use those modules.
-
Bloom filters for GDPR use cases
-
Miao: For GDPR data requests, we need to scan tables for specific
IDs. Even batching the requests together to minimize the number of scans,
this is very expensive for large tables. Matching records are usually
stored in just a few files, so keeping bloom filters for ID
columns reduces
cost significantly.
-
Ryan: Why doesn’t using partitioning help? We normally recommend
bucketing or sorting by ID columns.
-
Miao: ID columns change between user schemas and requests may use a
secondary ID not used in the table layout.
-
Owen: Why do this at the table level, if bloom filters are already
supported in ORC and Parquet? Doesn’t that duplicate work?
-
Miao: We didn’t want to be tied to a specific format.
-
Owen: Are bloom filters the right solution? It’s easy to misconfigure
them
-
Ryan: I agree they are easy to do incorrectly, but I think there is a
good argument for this as a secondary index that is independent
of the file
format. Bloom filters are hard to get right, especially if
you’re trying to
do it while minimizing memory consumption for a columnar file format.
Usually, parameters are chosen up front and might be wrong. At a table
level, this could be moved offline, so indexes are maintained by
a service
that can do a better job choosing the right tuning parameters
for each file
bloom filter.
-
Ryan: I think we should think of this as a secondary index. We might
have other techniques for secondary indexes, this one happens to use a
bloom filter. We’ve had other people interested in secondary indexes, so
maybe this is a good opportunity to add a way to track and maintain them.
-
Miao agreed to write up their use case and approach to start the
discussion on secondary indexes.
-
Update on row-level deletes
-
Ryan: If you want to get involved, we’ve updated the Milestone with
tasks. Tasks like writing a row filter using a set of equality values
should be good candidates because they can be written independently and
tested in isolation before the other work is done.
-
Ryan: Another area that could use help is updating tests for sequence
numbers. We’re running all of the operations tests that extend
TableTestBase on both v1 and v2 tables. I’ve added a way to make
assertions
for v1 and v2, using V1Assert.assertEquals or V2Assert.assertEquals so we
can go back and add assertions to all of the existing tests that exercise
lots of different cases. It would be great to have more help adding those
sequence number assertions!
-
Ryan: In the last few weeks, we’ve added content types to metadata
that tracks whether files contain deletes or data in the metadata tree.
There’s currently an open PR, #1064
<https://github.com/apache/iceberg/pull/1064>, that adds DeleteFile
and extends readers and writers to work with them. We decided
last sync to
store either delete files or data files in a manifest, but not
both. Using
separate interfaces enforces this in Java. I’m also working on a branch
that separates the manifests in a snapshot into delete manifests and data
manifests, which will help us identify everything that needs to
be updated
to support delete manifests.
-
Ryan: If you’d like to help review, please speak up and we’ll tag you
on issues. (Gautham Kowshik and Ryan Murray volunteered.)
--
Ryan Blue