+1 (binding). I think it's a big improvement, and I agree the "incremental" part might be misleading as we essentially always "replace" (but with finer granularity - partition level) - so we never "add" things incrementally and people might be misled here.
On Sun, Aug 4, 2024 at 10:02 PM Shahar Epstein <sha...@apache.org> wrote: > +1 (binding) > > On Fri, Aug 2, 2024 at 10:43 PM Kaxil Naik <kaxiln...@gmail.com> wrote: > > > Yup, I am fine removing that language to make it explicit but leave it up > > to TP. > > > > On Fri, 2 Aug 2024 at 19:56, Daniel Standish > > <daniel.stand...@astronomer.io.invalid> wrote: > > > > > My concern with the AIP is the talk of support for incremental data > > > pipelines. In an incremental data pipeline, you don't think of a delta > > > load (let's say a collection of updated rows) as a partition. A > > partition > > > in data is defined by a partition key, which should be an immutable > field > > > or fields in a record. You can't use an "updated at" field as a > > partition > > > key because then the same record can be in multiple partitions. And it > > > doesn't make sense either when you think about what it would mean to > > > "reprocess a partition" -- the rows that were in that partition now > might > > > not be there anymore. So I think this AIP needs to not brand itself as > > any > > > kind of solution for incremental loads. > > > If your processing hive partitions (by time), and those data can be > > > updated, you might need to reprocess the last N partitions each time. > > > That's a common way to handle updates. (And maybe something that we > > should > > > consider supporting in this AIP.) If you're doing some kind of change > > > tracking, you're just processing rows or new files, and it doesn't make > > > sense to consider those a partition. > > > My suggestion would be to remove the language talking about incremental > > > loads from this AIP. > > > > > >