Those are both really neat use cases, but the one I had in mind was what
Ryan mentioned. It's something that Hoodie apparently supports or is
building support for, and it's an important use case for the systems that
my colleagues and I are building.

There are three scenarios:

   - An Extract system that is receiving updates/deletes from a source
   system. We wish to capture them as quickly as possible and make them
   available to users without having to restate the affected data files. The
   update patterns are not anything that can be addressed with partitioning.
   - A Transform platform that is running a graph of jobs. For some jobs
   that are rebuilt from scratch, we would like to compress the output without
   losing the history.
   - A Transform / Load system that is building tables on GCS and
   registering them in Hive for querying by Presto. This system is
   incrementally updating views, and while some of those views are
   event-oriented (with most updates clustered in recent history) some of them
   are not and in those cases there is not partitioning algorithm that will
   prevent us from updating virtually all partitions in every update.

We have one example of an internal solution but would prefer something less
bespoke. That system works as follows:

   1. For each dataset, unique key columns are defined.
   2. Datasets are partitioned (not necessarily by anything in the key).
   3. Upserts/deletes are captured in a mutation set.
   4. The mutation set is used to update affected partitions:
      1. Identify the previous/new partition for each upserted/deleted row.
      2. Open the affected partitions, drop all rows matching an
      upserted/deleted key.
      3. Append all upserts.
      4. Write out the result.
   5. We maintain an index (effectively an Iceberg snapshot) that says
   which partitions come from where (we keep the ones that are unaffected from
   the previous dataset version and add in the updated ones).

This data is loaded into Presto and our current plan is to update it by
registering a view in Presto that applies recent mutation sets to the
latest merged version on the fly.

So to build this in Iceberg we would likely need to extend the Table spec
with:

   - An optional unique key specification, possibly composite, naming one
   or more columns for which there is expected to be at most one row per
   unique value.
   - The ability to indicate in the snapshot that a certain set of
   manifests are "base" data while other manifests are "diffs".
   - The ability in a "diff" manifest to indicate files that contain
   "deleted" keys (or else the ability in a given row to have a special column
   that indicates that the row is a "delete" and not an "upsert")
   - "diff" manifests would need to be ordered in the snapshot (as multiple
   "diff" manifests could affect a single row and only the latest of those
   takes effect).

Obviously readers would need to be updated to correctly interpret this
data. And there is all kinds of supporting work that would be required in
order to maintain these (periodically collapsing diffs into the base, etc.).

Is this something for which PRs would be accepted, assuming all of the
necessary steps to make sure the direction is compatible with Iceberg's
other use-cases?

On Wed, Nov 28, 2018 at 1:14 PM Owen O'Malley <owen.omal...@gmail.com>
wrote:

> I’m not sure what use case Erik is looking for, but I’ve had users that
> want to do the equivalent of HBase’s column families. They want some of the
> columns to be stored separately and the merged together on read. The
> requirements would be that there is a 1:1 mapping between rows in the
> matching files and stripes.
>
> It would look like:
>
> file1.orc: struct<name:string,email:string> file2.orc:
> struct<lastAccess:timestamp>
>
> It would let them leave the stable information and only re-write the
> second column family when the information in the mutable column family
> changes. It would also support use cases where you add data enrichment
> columns after the data has been ingested.
>
> From there it is easy to imagine having a replace operator where file2’s
> version of a column replaces file1’s version.
>
> .. Owen
>
> > On Nov 28, 2018, at 9:44 AM, Ryan Blue <rb...@netflix.com.INVALID>
> wrote:
> >
> > What do you mean by merge on read?
> >
> > A few people I've talked to are interested in building delete and upsert
> > features. Those would create files that track the changes, which would be
> > merged at read time to apply them. Is that what you mean?
> >
> > rb
> >
> > On Tue, Nov 27, 2018 at 12:26 PM Erik Wright
> > <erik.wri...@shopify.com.invalid> wrote:
> >
> >> Has any consideration been given to the possibility of eventual
> >> merge-on-read support in the Iceberg table spec?
> >>
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
>
>

Reply via email to