Iceberg community sync - 27 May 2020

Ryan Blue Fri, 29 May 2020 15:51:13 -0700

Hi everyone,

Here is a doc for upcoming agendas and notes from the community sync:
https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit?usp=sharing


Everyone should be able to comment using that link and a google account. If
you'd like to add agenda items, please request editor access and I'll add
you. As discussed in the sync, we'll try to use that as a running log.

The notes may be a bit incomplete, since I accidentally closed the text
editor where I was taking notes without saving and the discussion is from
memory. Feel free to comment and fix it!

I'm also going to copy the notes below:
27 May 2020

   -

   Agenda doc
   -

      Dan: Is there an Agenda for this meeting?
      -

      Ryan: there is one in the invite, but it was empty
      -

      Dan: we should use a doc for notes and the next Agenda
      -

   Guava is now shaded in iceberg-bundled-guava. Please let us know if you
   hit problems.
   -

   Update the build from gradle-consistent-versions
   -

      Ryan: We currently use gradle-consistent-versions. That’s great in
      some ways, but doesn’t allow us to have multiple versions of the same
      dependency in different modules, which blocks adding a Spark 3 module to
      master. It is also blocking the MR/Hive work because Hive uses a
different
      Guava version. Even though we shade to avoid the conflict, the plugin can
      only support 1 version.
      -

      Ryan: I’ve gone through options and the best option looks like the
      Nebula plugins (maintained by Netflix). There is a PR open to move to
      these, #1067 <https://github.com/apache/iceberg/pull/1067>. Please
      have a look at the PR and comment!
      -

      Ratandeep: What are the drawbacks of the one built into gradle?
      -

      Ryan: It doesn’t use versions.props, so changes are larger; use is
      awkward because locking is a side-effect of other tasks; much newer and
      requires bumping to a new major version of gradle. And, I can ask the
      Nebula team for support if we use those modules.
      -

   Bloom filters for GDPR use cases
   -

      Miao: For GDPR data requests, we need to scan tables for specific
      IDs. Even batching the requests together to minimize the number of scans,
      this is very expensive for large tables. Matching records are usually
      stored in just a few files, so keeping bloom filters for ID
columns reduces
      cost significantly.
      -

      Ryan: Why doesn’t using partitioning help? We normally recommend
      bucketing or sorting by ID columns.
      -

      Miao: ID columns change between user schemas and requests may use a
      secondary ID not used in the table layout.
      -

      Owen: Why do this at the table level, if bloom filters are already
      supported in ORC and Parquet? Doesn’t that duplicate work?
      -

      Miao: We didn’t want to be tied to a specific format.
      -

      Owen: Are bloom filters the right solution? It’s easy to misconfigure
      them
      -

      Ryan: I agree they are easy to do incorrectly, but I think there is a
      good argument for this as a secondary index that is independent
of the file
      format. Bloom filters are hard to get right, especially if
you’re trying to
      do it while minimizing memory consumption for a columnar file format.
      Usually, parameters are chosen up front and might be wrong. At a table
      level, this could be moved offline, so indexes are maintained by
a service
      that can do a better job choosing the right tuning parameters
for each file
      bloom filter.
      -

      Ryan: I think we should think of this as a secondary index. We might
      have other techniques for secondary indexes, this one happens to use a
      bloom filter. We’ve had other people interested in secondary indexes, so
      maybe this is a good opportunity to add a way to track and maintain them.
      -

      Miao agreed to write up their use case and approach to start the
      discussion on secondary indexes.
      -

   Update on row-level deletes
   -

      Ryan: If you want to get involved, we’ve updated the Milestone with
      tasks. Tasks like writing a row filter using a set of equality values
      should be good candidates because they can be written independently and
      tested in isolation before the other work is done.
      -

      Ryan: Another area that could use help is updating tests for sequence
      numbers. We’re running all of the operations tests that extend
      TableTestBase on both v1 and v2 tables. I’ve added a way to make
assertions
      for v1 and v2, using V1Assert.assertEquals or V2Assert.assertEquals so we
      can go back and add assertions to all of the existing tests that exercise
      lots of different cases. It would be great to have more help adding those
      sequence number assertions!
      -

      Ryan: In the last few weeks, we’ve added content types to metadata
      that tracks whether files contain deletes or data in the metadata tree.
      There’s currently an open PR, #1064
      <https://github.com/apache/iceberg/pull/1064>, that adds DeleteFile
      and extends readers and writers to work with them. We decided
last sync to
      store either delete files or data files in a manifest, but not
both. Using
      separate interfaces enforces this in Java. I’m also working on a branch
      that separates the manifests in a snapshot into delete manifests and data
      manifests, which will help us identify everything that needs to
be updated
      to support delete manifests.
      -

      Ryan: If you’d like to help review, please speak up and we’ll tag you
      on issues. (Gautham Kowshik and Ryan Murray volunteered.)



-- 
Ryan Blue

Iceberg community sync - 27 May 2020

Reply via email to