<div style="font-size: 15px;"><div style="">Hey everyone!<br style=""><br style=""></div></div><div style="font-size: 15px;"><div style="">Wanted to update a few additions/changes we've made and open up the thread for any remaining comments/questions before we go to a vote:<br style=""></div><div style=""><br style=""></div><div style="">Considering&nbsp;<b style="">@Jarek Potiuk's&nbsp;</b>comments, we have changed the mapping_id to an integer to ensure that we are within the range of MySQL's key size limits. We have also designed a system that can do cartesian product joining when a user gives two or more lists for processing.<br style=""><br style="">Please take a chance to look this over if you have any time, we will start the vote tomorrow if no one else has comments or questions.<br style=""><br style="">Happy Airflowing!<br style="">Daniel</div><div style=""><br></div></div><div id="cm_signature" style=""></div><div style=""><br style=""></div><div style=""><br style=""></div><div id="cm_quote_div" style="display: block;"><div style="">On Wed, Nov 10, 2021 at 3:32am, Jarek Potiuk &lt;<a href="mailto:ja...@potiuk.com"; style="">ja...@potiuk.com</a>&gt; wrote:<br style=""></div><blockquote style="margin: 0px; border-left: 1px solid rgb(214, 214, 214); padding-left: 10px;"><div style="">&gt; So yes, you can use a generator literal, but it will be evaluated to completion at DAG parse time.<br style=""></div><div style=""><br style=""></div><div style="">Makes perfect sense.<br style=""></div><div style=""><br style=""></div><div style="">&gt; 2) For the UI part - I think we should consider "filtering" from day one and first-class citizens. Filtering for "failed" tasks seems like a super-useful feature for operations people.<br style=""></div><div style="">&gt;<br style=""></div><div style="">&gt;<br style=""></div><div style="">&gt; Yes, Brent and I have thought about that -- it's somewhat orthogonal to this AIP in that the AIP doesn't depend filtering in the UI, and there are cases where filtering would already be useful. But we will look at doing it as part of this project.<br style=""></div><div style=""><br style=""></div><div style="">Cool!<br style=""></div><div style=""><br style=""></div><div style="">&gt; Roughly what I'm proposing is the PK on the task_instance table becomes something like (dag_run_id, task_id, mapping_index) where mapping_index is now an array index into a JSON value in a row in the new task_mapping table. And since primary key columns can't be nullable, and arrays are zero indexed, we'd have to use -1 as "not mapped" value.<br style=""></div><div style=""><br style=""></div><div style="">&gt; Does that make any sense?<br style=""></div><div style=""><br style=""></div><div style="">Absolutely. That was very much what I also had in mind. I think<br style=""></div><div style="">indexing with auto-incremented id and keeping a table with the index<br style=""></div><div style="">array makes way more sense to be honest.<br style=""></div><div style=""><br style=""></div><div style="">It also makes it possible to implement a (potentially interesting)<br style=""></div><div style="">case where mapping key values will actually be multiplicated without<br style=""></div><div style="">adding any artificial indexing column. I am not sure how practical the<br style=""></div><div style="">case is (though I believe it is quite a common case), but there is a<br style=""></div><div style="">range of tasks that might have different output even if they have the<br style=""></div><div style="">same input. Any kind of randomisation might cause for example that<br style=""></div><div style="">the same learning task with exactly the same parameters will lead to a<br style=""></div><div style="">different output. And you might want to run N of such identical tasks<br style=""></div><div style="">and average, or somehow differently aggregate the result of those N<br style=""></div><div style="">tasks with identical input.<br style=""></div><div style=""><br style=""></div><div style="">It was also possible with the original design by adding an extra index<br style=""></div><div style="">field to JSON which would make it unique, but it was a bit "not clean"<br style=""></div><div style="">in the sense that it made the input "identic-ish". With the design<br style=""></div><div style="">where we keep TaskMapping and index it with the index in the table, we<br style=""></div><div style="">have a much more "clean" solution for this. You can clearly see which<br style=""></div><div style="">of those tasks had identical input by simply comparing the JSONS.<br style=""></div><div style="">Overall I think having a unique index to handle this case is generally<br style=""></div><div style="">better design - even if we could index JSONB columns in mysql.<br style=""></div><div style=""><br style=""></div><div style="">&gt; There's still a bit more info to work out, such as how we find the right row in task_mapping table, as it is associated with the _parent_ task, not the mapped task itself.<br style=""></div><div style="">Yep.<br style=""></div><div style=""><br style=""></div><div style="">&gt; Oh, it gets a bit more complicated actually! It is possible to do ` <a href="https://tr.cloudmagic.com/h/v6/link-track/1.0/1637181737460238-f1446025-06de-9d54-aaba-e65c51ac4ac3/1637181722/268d84bb2a571b6929b99f382db5df30/123d70e5110915346c34d98dc7a8d3e4/23145d5660648b561ac13bcd2b0c5071?redirect_uri=http://task.map"; target="_blank" rel="noopener noreferrer">task.map</a>(x=list1, y=list2)` to map over two lists at once (making a cartesian product) and in this case the mapping would be unique per _consuming_ tasks.<br style=""></div><div style=""><br style=""></div><div style="">Yeah - as usual - the more digging, the more discovered. I think<br style=""></div><div style="">simply each task in the group should have a unique, incremental ID<br style=""></div><div style="">assigned. For cartesians that would mean that we have to agree on the<br style=""></div><div style="">sequence of the inputs to take into account to calculate the index<br style=""></div><div style="">(but they simply could be alphabetically sorted). And it also means<br style=""></div><div style="">that we have to "fix it" at particular dag run "mapping evaluation"<br style=""></div><div style="">time (but that's precisely what TaskMapping table will do I<br style=""></div><div style="">understand).<br style=""></div><div style=""><br style=""></div><div style="">J.<br style=""></div></blockquote></div><div style=""><br style=""></div>

Reply via email to