Re: AIP-42: Dynamic Task Mapping

Daniel Imberman Wed, 17 Nov 2021 12:42:27 -0800

<div style="font-size: 15px;"><div style="">Hey everyone!<br style=""><brstyle=""></div></div><div style="font-size: 15px;"><div style="">Wanted toupdate a few additions/changes we've made and open up the thread for anyremaining comments/questions before we go to a vote:<br style=""></div><divstyle=""><br style=""></div><div style="">Considering <bstyle="">@Jarek Potiuk's </b>comments, we have changed the mapping_idto an integer to ensure that we are within the range of MySQL's key sizelimits. We have also designed a system that can do cartesian productjoining when a user gives two or more lists for processing.<br style=""><brstyle="">Please take a chance to look this over if you have any time, wewill start the vote tomorrow if no one else has comments or questions.<brstyle=""><br style="">Happy Airflowing!<br style="">Daniel</div><divstyle=""><br></div></div><div id="cm_signature" style=""></div><divstyle=""><br style=""></div><div style=""><br style=""></div><divid="cm_quote_div" style="display: block;"><div style="">On Wed, Nov 10,2021 at 3:32am, Jarek Potiuk <<a href="mailto:ja...@potiuk.com";style="">ja...@potiuk.com</a>> wrote:<br style=""></div><blockquotestyle="margin: 0px; border-left: 1px solid rgb(214, 214, 214);padding-left: 10px;"><div style="">> So yes, you can use a generatorliteral, but it will be evaluated to completion at DAG parse time.<brstyle=""></div><div style=""><br style=""></div><div style="">Makes perfectsense.<br style=""></div><div style=""><br style=""></div><divstyle="">> 2) For the UI part - I think we should consider "filtering"from day one and first-class citizens. Filtering for "failed" tasks seemslike a super-useful feature for operations people.<br style=""></div><divstyle="">><br style=""></div><div style="">><br style=""></div><divstyle="">> Yes, Brent and I have thought about that -- it's somewhatorthogonal to this AIP in that the AIP doesn't depend filtering in the UI,and there are cases where filtering would already be useful. But we willlook at doing it as part of this project.<br style=""></div><divstyle=""><br style=""></div><div style="">Cool!<br style=""></div><divstyle=""><br style=""></div><div style="">> Roughly what I'm proposingis the PK on the task_instance table becomes something like (dag_run_id,task_id, mapping_index) where mapping_index is now an array index into aJSON value in a row in the new task_mapping table. And since primary keycolumns can't be nullable, and arrays are zero indexed, we'd have to use -1as "not mapped" value.<br style=""></div><div style=""><brstyle=""></div><div style="">> Does that make any sense?<brstyle=""></div><div style=""><br style=""></div><div style="">Absolutely.That was very much what I also had in mind. I think<br style=""></div><divstyle="">indexing with auto-incremented id and keeping a table with theindex<br style=""></div><div style="">array makes way more sense to behonest.<br style=""></div><div style=""><br style=""></div><div style="">Italso makes it possible to implement a (potentially interesting)<brstyle=""></div><div style="">case where mapping key values will actually bemultiplicated without<br style=""></div><div style="">adding any artificialindexing column. I am not sure how practical the<br style=""></div><divstyle="">case is (though I believe it is quite a common case), but there isa<br style=""></div><div style="">range of tasks that might have differentoutput even if they have the<br style=""></div><div style="">same input.Any kind of randomisation might cause for example that<brstyle=""></div><div style="">the same learning task with exactly the sameparameters will lead to a<br style=""></div><div style="">different output.And you might want to run N of such identical tasks<br style=""></div><divstyle="">and average, or somehow differently aggregate the result of thoseN<br style=""></div><div style="">tasks with identical input.<brstyle=""></div><div style=""><br style=""></div><div style="">It was alsopossible with the original design by adding an extra index<brstyle=""></div><div style="">field to JSON which would make it unique, butit was a bit "not clean"<br style=""></div><div style="">in the sense thatit made the input "identic-ish". With the design<br style=""></div><divstyle="">where we keep TaskMapping and index it with the index in thetable, we<br style=""></div><div style="">have a much more "clean" solutionfor this. You can clearly see which<br style=""></div><div style="">ofthose tasks had identical input by simply comparing the JSONS.<brstyle=""></div><div style="">Overall I think having a unique index tohandle this case is generally<br style=""></div><div style="">better design- even if we could index JSONB columns in mysql.<br style=""></div><divstyle=""><br style=""></div><div style="">> There's still a bit moreinfo to work out, such as how we find the right row in task_mapping table,as it is associated with the _parent_ task, not the mapped task itself.<brstyle=""></div><div style="">Yep.<br style=""></div><div style=""><brstyle=""></div><div style="">> Oh, it gets a bit more complicatedactually! It is possible to do ` <ahref="https://tr.cloudmagic.com/h/v6/link-track/1.0/1637181737460238-f1446025-06de-9d54-aaba-e65c51ac4ac3/1637181722/268d84bb2a571b6929b99f382db5df30/123d70e5110915346c34d98dc7a8d3e4/23145d5660648b561ac13bcd2b0c5071?redirect_uri=http://task.map";target="_blank" rel="noopener noreferrer">task.map</a>(x=list1, y=list2)`to map over two lists at once (making a cartesian product) and in this casethe mapping would be unique per _consuming_ tasks.<br style=""></div><divstyle=""><br style=""></div><div style="">Yeah - as usual - the moredigging, the more discovered. I think<br style=""></div><divstyle="">simply each task in the group should have a unique, incrementalID<br style=""></div><div style="">assigned. For cartesians that would meanthat we have to agree on the<br style=""></div><div style="">sequence ofthe inputs to take into account to calculate the index<brstyle=""></div><div style="">(but they simply could be alphabeticallysorted). And it also means<br style=""></div><div style="">that we have to"fix it" at particular dag run "mapping evaluation"<br style=""></div><divstyle="">time (but that's precisely what TaskMapping table will do I<brstyle=""></div><div style="">understand).<br style=""></div><divstyle=""><br style=""></div><div style="">J.<brstyle=""></div></blockquote></div><div style=""><br style=""></div>

Re: AIP-42: Dynamic Task Mapping

Reply via email to