Re: AIP-42: Dynamic Task Mapping

Jarek Potiuk Wed, 17 Nov 2021 13:16:25 -0800

Looks great! made my small comments in-line, but my main concerns are
already addressed !


On Wed, Nov 17, 2021 at 9:42 PM Daniel Imberman
<[email protected]> wrote:
>
> Hey everyone!
>
> Wanted to update a few additions/changes we've made and open up the thread 
> for any remaining comments/questions before we go to a vote:
>
> Considering @Jarek Potiuk's comments, we have changed the mapping_id to an 
> integer to ensure that we are within the range of MySQL's key size limits. We 
> have also designed a system that can do cartesian product joining when a user 
> gives two or more lists for processing.
>
> Please take a chance to look this over if you have any time, we will start 
> the vote tomorrow if no one else has comments or questions.
>
> Happy Airflowing!
> Daniel
>
>
>
> On Wed, Nov 10, 2021 at 3:32am, Jarek Potiuk <[email protected]> wrote:
>
> > So yes, you can use a generator literal, but it will be evaluated to 
> > completion at DAG parse time.
>
> Makes perfect sense.
>
> > 2) For the UI part - I think we should consider "filtering" from day one 
> > and first-class citizens. Filtering for "failed" tasks seems like a 
> > super-useful feature for operations people.
> >
> >
> > Yes, Brent and I have thought about that -- it's somewhat orthogonal to 
> > this AIP in that the AIP doesn't depend filtering in the UI, and there are 
> > cases where filtering would already be useful. But we will look at doing it 
> > as part of this project.
>
> Cool!
>
> > Roughly what I'm proposing is the PK on the task_instance table becomes 
> > something like (dag_run_id, task_id, mapping_index) where mapping_index is 
> > now an array index into a JSON value in a row in the new task_mapping 
> > table. And since primary key columns can't be nullable, and arrays are zero 
> > indexed, we'd have to use -1 as "not mapped" value.
>
> > Does that make any sense?
>
> Absolutely. That was very much what I also had in mind. I think
> indexing with auto-incremented id and keeping a table with the index
> array makes way more sense to be honest.
>
> It also makes it possible to implement a (potentially interesting)
> case where mapping key values will actually be multiplicated without
> adding any artificial indexing column. I am not sure how practical the
> case is (though I believe it is quite a common case), but there is a
> range of tasks that might have different output even if they have the
> same input. Any kind of randomisation might cause for example that
> the same learning task with exactly the same parameters will lead to a
> different output. And you might want to run N of such identical tasks
> and average, or somehow differently aggregate the result of those N
> tasks with identical input.
>
> It was also possible with the original design by adding an extra index
> field to JSON which would make it unique, but it was a bit "not clean"
> in the sense that it made the input "identic-ish". With the design
> where we keep TaskMapping and index it with the index in the table, we
> have a much more "clean" solution for this. You can clearly see which
> of those tasks had identical input by simply comparing the JSONS.
> Overall I think having a unique index to handle this case is generally
> better design - even if we could index JSONB columns in mysql.
>
> > There's still a bit more info to work out, such as how we find the right 
> > row in task_mapping table, as it is associated with the _parent_ task, not 
> > the mapped task itself.
> Yep.
>
> > Oh, it gets a bit more complicated actually! It is possible to do ` 
> > task.map(x=list1, y=list2)` to map over two lists at once (making a 
> > cartesian product) and in this case the mapping would be unique per 
> > _consuming_ tasks.
>
> Yeah - as usual - the more digging, the more discovered. I think
> simply each task in the group should have a unique, incremental ID
> assigned. For cartesians that would mean that we have to agree on the
> sequence of the inputs to take into account to calculate the index
> (but they simply could be alphabetically sorted). And it also means
> that we have to "fix it" at particular dag run "mapping evaluation"
> time (but that's precisely what TaskMapping table will do I
> understand).
>
> J.
>
>

Re: AIP-42: Dynamic Task Mapping

Reply via email to