I meant to do this last week but totally forgot. Since no-one really expressed any more concerns, I’ll finish the process and declare the results in a new thread.
TP > On 5 Aug 2024, at 14:15, Ephraim Anierobi <ephraimanier...@gmail.com> wrote: > > +1 (binding) > > On Sun, 4 Aug 2024 at 21:14, Jarek Potiuk <ja...@potiuk.com> wrote: > >> +1 (binding). I think it's a big improvement, and I agree the "incremental" >> part might be misleading as we essentially always "replace" (but with finer >> granularity - partition level) - so we never "add" things incrementally and >> people might be misled here. >> >> On Sun, Aug 4, 2024 at 10:02 PM Shahar Epstein <sha...@apache.org> wrote: >> >>> +1 (binding) >>> >>> On Fri, Aug 2, 2024 at 10:43 PM Kaxil Naik <kaxiln...@gmail.com> wrote: >>> >>>> Yup, I am fine removing that language to make it explicit but leave it >> up >>>> to TP. >>>> >>>> On Fri, 2 Aug 2024 at 19:56, Daniel Standish >>>> <daniel.stand...@astronomer.io.invalid> wrote: >>>> >>>>> My concern with the AIP is the talk of support for incremental data >>>>> pipelines. In an incremental data pipeline, you don't think of a >> delta >>>>> load (let's say a collection of updated rows) as a partition. A >>>> partition >>>>> in data is defined by a partition key, which should be an immutable >>> field >>>>> or fields in a record. You can't use an "updated at" field as a >>>> partition >>>>> key because then the same record can be in multiple partitions. And >> it >>>>> doesn't make sense either when you think about what it would mean to >>>>> "reprocess a partition" -- the rows that were in that partition now >>> might >>>>> not be there anymore. So I think this AIP needs to not brand itself >> as >>>> any >>>>> kind of solution for incremental loads. >>>>> If your processing hive partitions (by time), and those data can be >>>>> updated, you might need to reprocess the last N partitions each time. >>>>> That's a common way to handle updates. (And maybe something that we >>>> should >>>>> consider supporting in this AIP.) If you're doing some kind of >> change >>>>> tracking, you're just processing rows or new files, and it doesn't >> make >>>>> sense to consider those a partition. >>>>> My suggestion would be to remove the language talking about >> incremental >>>>> loads from this AIP. >>>>> >>>> >>> >> --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org For additional commands, e-mail: dev-h...@airflow.apache.org