Yup, I am fine removing that language to make it explicit but leave it up
to TP.

On Fri, 2 Aug 2024 at 19:56, Daniel Standish
<daniel.stand...@astronomer.io.invalid> wrote:

> My concern with the AIP is the talk of support for incremental data
> pipelines.  In an incremental data pipeline, you don't think of a delta
> load (let's say a collection of updated rows) as a partition.  A partition
> in data is defined by a partition key, which should be an immutable field
> or fields in a record.  You can't use an "updated at" field as a partition
> key because then the same record can be in multiple partitions.  And it
> doesn't make sense either when you think about what it would mean to
> "reprocess a partition" -- the rows that were in that partition now might
> not be there anymore.  So I think this AIP needs to not brand itself as any
> kind of solution for incremental loads.
> If your processing hive partitions (by time), and those data can be
> updated, you might need to reprocess the last N partitions each time.
> That's a common way to handle updates.  (And maybe something that we should
> consider supporting in this AIP.)  If you're doing some kind of change
> tracking, you're just processing rows or new files, and it doesn't make
> sense to consider those a partition.
> My suggestion would be to remove the language talking about incremental
> loads from this AIP.
>

Reply via email to