<div style="font-size: 15px;"><div style="">Hey everyone!<br style=""><br
style=""></div></div><div style="font-size: 15px;"><div style="">Wanted to
update a few additions/changes we've made and open up the thread for any
remaining comments/questions before we go to a vote:<br style=""></div><div
style=""><br style=""></div><div style="">Considering <b
style="">@Jarek Potiuk's </b>comments, we have changed the mapping_id
to an integer to ensure that we are within the range of MySQL's key size
limits. We have also designed a system that can do cartesian product
joining when a user gives two or more lists for processing.<br style=""><br
style="">Please take a chance to look this over if you have any time, we
will start the vote tomorrow if no one else has comments or questions.<br
style=""><br style="">Happy Airflowing!<br style="">Daniel</div><div
style=""><br></div></div><div id="cm_signature" style=""></div><div
style=""><br style=""></div><div style=""><br style=""></div><div
id="cm_quote_div" style="display: block;"><div style="">On Wed, Nov 10,
2021 at 3:32am, Jarek Potiuk <<a href="mailto:ja...@potiuk.com"
style="">ja...@potiuk.com</a>> wrote:<br style=""></div><blockquote
style="margin: 0px; border-left: 1px solid rgb(214, 214, 214);
padding-left: 10px;"><div style="">> So yes, you can use a generator
literal, but it will be evaluated to completion at DAG parse time.<br
style=""></div><div style=""><br style=""></div><div style="">Makes perfect
sense.<br style=""></div><div style=""><br style=""></div><div
style="">> 2) For the UI part - I think we should consider "filtering"
from day one and first-class citizens. Filtering for "failed" tasks seems
like a super-useful feature for operations people.<br style=""></div><div
style="">><br style=""></div><div style="">><br style=""></div><div
style="">> Yes, Brent and I have thought about that -- it's somewhat
orthogonal to this AIP in that the AIP doesn't depend filtering in the UI,
and there are cases where filtering would already be useful. But we will
look at doing it as part of this project.<br style=""></div><div
style=""><br style=""></div><div style="">Cool!<br style=""></div><div
style=""><br style=""></div><div style="">> Roughly what I'm proposing
is the PK on the task_instance table becomes something like (dag_run_id,
task_id, mapping_index) where mapping_index is now an array index into a
JSON value in a row in the new task_mapping table. And since primary key
columns can't be nullable, and arrays are zero indexed, we'd have to use -1
as "not mapped" value.<br style=""></div><div style=""><br
style=""></div><div style="">> Does that make any sense?<br
style=""></div><div style=""><br style=""></div><div style="">Absolutely.
That was very much what I also had in mind. I think<br style=""></div><div
style="">indexing with auto-incremented id and keeping a table with the
index<br style=""></div><div style="">array makes way more sense to be
honest.<br style=""></div><div style=""><br style=""></div><div style="">It
also makes it possible to implement a (potentially interesting)<br
style=""></div><div style="">case where mapping key values will actually be
multiplicated without<br style=""></div><div style="">adding any artificial
indexing column. I am not sure how practical the<br style=""></div><div
style="">case is (though I believe it is quite a common case), but there is
a<br style=""></div><div style="">range of tasks that might have different
output even if they have the<br style=""></div><div style="">same input.
Any kind of randomisation might cause for example that<br
style=""></div><div style="">the same learning task with exactly the same
parameters will lead to a<br style=""></div><div style="">different output.
And you might want to run N of such identical tasks<br style=""></div><div
style="">and average, or somehow differently aggregate the result of those
N<br style=""></div><div style="">tasks with identical input.<br
style=""></div><div style=""><br style=""></div><div style="">It was also
possible with the original design by adding an extra index<br
style=""></div><div style="">field to JSON which would make it unique, but
it was a bit "not clean"<br style=""></div><div style="">in the sense that
it made the input "identic-ish". With the design<br style=""></div><div
style="">where we keep TaskMapping and index it with the index in the
table, we<br style=""></div><div style="">have a much more "clean" solution
for this. You can clearly see which<br style=""></div><div style="">of
those tasks had identical input by simply comparing the JSONS.<br
style=""></div><div style="">Overall I think having a unique index to
handle this case is generally<br style=""></div><div style="">better design
- even if we could index JSONB columns in mysql.<br style=""></div><div
style=""><br style=""></div><div style="">> There's still a bit more
info to work out, such as how we find the right row in task_mapping table,
as it is associated with the _parent_ task, not the mapped task itself.<br
style=""></div><div style="">Yep.<br style=""></div><div style=""><br
style=""></div><div style="">> Oh, it gets a bit more complicated
actually! It is possible to do ` <a
href="https://tr.cloudmagic.com/h/v6/link-track/1.0/1637181737460238-f1446025-06de-9d54-aaba-e65c51ac4ac3/1637181722/268d84bb2a571b6929b99f382db5df30/123d70e5110915346c34d98dc7a8d3e4/23145d5660648b561ac13bcd2b0c5071?redirect_uri=http://task.map"
target="_blank" rel="noopener noreferrer">task.map</a>(x=list1, y=list2)`
to map over two lists at once (making a cartesian product) and in this case
the mapping would be unique per _consuming_ tasks.<br style=""></div><div
style=""><br style=""></div><div style="">Yeah - as usual - the more
digging, the more discovered. I think<br style=""></div><div
style="">simply each task in the group should have a unique, incremental
ID<br style=""></div><div style="">assigned. For cartesians that would mean
that we have to agree on the<br style=""></div><div style="">sequence of
the inputs to take into account to calculate the index<br
style=""></div><div style="">(but they simply could be alphabetically
sorted). And it also means<br style=""></div><div style="">that we have to
"fix it" at particular dag run "mapping evaluation"<br style=""></div><div
style="">time (but that's precisely what TaskMapping table will do I<br
style=""></div><div style="">understand).<br style=""></div><div
style=""><br style=""></div><div style="">J.<br
style=""></div></blockquote></div><div style=""><br style=""></div>
- AIP-42: Dynamic Task Mapping Daniel Imberman
- Re: AIP-42: Dynamic Task Mapping Kaxil Naik
- Re: AIP-42: Dynamic Task Mapping Jarek Potiuk
- Re: AIP-42: Dynamic Task Mapping Xiaodong Deng
- Re: AIP-42: Dynamic Task Mapping Xinbin Huang
- Re: AIP-42: Dynamic Task Mapping Jarek Potiuk
- Re: AIP-42: Dynamic Task Mappi... Ash Berlin-Taylor
- Re: AIP-42: Dynamic Task M... Jarek Potiuk
- Re: AIP-42: Dynamic Task M... Daniel Imberman
- Re: AIP-42: Dynamic Task M... Jarek Potiuk