Looks great! made my small comments in-line, but my main concerns are already addressed !
On Wed, Nov 17, 2021 at 9:42 PM Daniel Imberman <daniel.imber...@gmail.com> wrote: > > Hey everyone! > > Wanted to update a few additions/changes we've made and open up the thread > for any remaining comments/questions before we go to a vote: > > Considering @Jarek Potiuk's comments, we have changed the mapping_id to an > integer to ensure that we are within the range of MySQL's key size limits. We > have also designed a system that can do cartesian product joining when a user > gives two or more lists for processing. > > Please take a chance to look this over if you have any time, we will start > the vote tomorrow if no one else has comments or questions. > > Happy Airflowing! > Daniel > > > > On Wed, Nov 10, 2021 at 3:32am, Jarek Potiuk <ja...@potiuk.com> wrote: > > > So yes, you can use a generator literal, but it will be evaluated to > > completion at DAG parse time. > > Makes perfect sense. > > > 2) For the UI part - I think we should consider "filtering" from day one > > and first-class citizens. Filtering for "failed" tasks seems like a > > super-useful feature for operations people. > > > > > > Yes, Brent and I have thought about that -- it's somewhat orthogonal to > > this AIP in that the AIP doesn't depend filtering in the UI, and there are > > cases where filtering would already be useful. But we will look at doing it > > as part of this project. > > Cool! > > > Roughly what I'm proposing is the PK on the task_instance table becomes > > something like (dag_run_id, task_id, mapping_index) where mapping_index is > > now an array index into a JSON value in a row in the new task_mapping > > table. And since primary key columns can't be nullable, and arrays are zero > > indexed, we'd have to use -1 as "not mapped" value. > > > Does that make any sense? > > Absolutely. That was very much what I also had in mind. I think > indexing with auto-incremented id and keeping a table with the index > array makes way more sense to be honest. > > It also makes it possible to implement a (potentially interesting) > case where mapping key values will actually be multiplicated without > adding any artificial indexing column. I am not sure how practical the > case is (though I believe it is quite a common case), but there is a > range of tasks that might have different output even if they have the > same input. Any kind of randomisation might cause for example that > the same learning task with exactly the same parameters will lead to a > different output. And you might want to run N of such identical tasks > and average, or somehow differently aggregate the result of those N > tasks with identical input. > > It was also possible with the original design by adding an extra index > field to JSON which would make it unique, but it was a bit "not clean" > in the sense that it made the input "identic-ish". With the design > where we keep TaskMapping and index it with the index in the table, we > have a much more "clean" solution for this. You can clearly see which > of those tasks had identical input by simply comparing the JSONS. > Overall I think having a unique index to handle this case is generally > better design - even if we could index JSONB columns in mysql. > > > There's still a bit more info to work out, such as how we find the right > > row in task_mapping table, as it is associated with the _parent_ task, not > > the mapped task itself. > Yep. > > > Oh, it gets a bit more complicated actually! It is possible to do ` > > task.map(x=list1, y=list2)` to map over two lists at once (making a > > cartesian product) and in this case the mapping would be unique per > > _consuming_ tasks. > > Yeah - as usual - the more digging, the more discovered. I think > simply each task in the group should have a unique, incremental ID > assigned. For cartesians that would mean that we have to agree on the > sequence of the inputs to take into account to calculate the index > (but they simply could be alphabetically sorted). And it also means > that we have to "fix it" at particular dag run "mapping evaluation" > time (but that's precisely what TaskMapping table will do I > understand). > > J. > >