My concern with the AIP is the talk of support for incremental data
pipelines.  In an incremental data pipeline, you don't think of a delta
load (let's say a collection of updated rows) as a partition.  A partition
in data is defined by a partition key, which should be an immutable field
or fields in a record.  You can't use an "updated at" field as a partition
key because then the same record can be in multiple partitions.  And it
doesn't make sense either when you think about what it would mean to
"reprocess a partition" -- the rows that were in that partition now might
not be there anymore.  So I think this AIP needs to not brand itself as any
kind of solution for incremental loads.
If your processing hive partitions (by time), and those data can be
updated, you might need to reprocess the last N partitions each time.
That's a common way to handle updates.  (And maybe something that we should
consider supporting in this AIP.)  If you're doing some kind of change
tracking, you're just processing rows or new files, and it doesn't make
sense to consider those a partition.
My suggestion would be to remove the language talking about incremental
loads from this AIP.

Reply via email to