Someone pointed me at "behavior testing" for python using the behave package:
https://pythonhosted.org/behave/tutorial.html Basically, in natural language, you specify test cases, which correspond to agile user stories. It's as close to TDD as you can get. In more elaborate scenarios, you can mix small datasets, return values etc. with the test case description itself, so it reads like a user story, annotated with data examples: Scenario: some scenario Given a set of specific users | name | department | | Barry | Beer Cans | | Pudey | Silly Walks | | Two-Lumps | Silly Walks | When we count the number of people in each department Then we will find two people in "Silly Walks" But we will find one person in "Beer Cans" I'm wondering whether behavior testing could be suitable for testing workflows. In the considerations, I've talked about files that an instrumented hook in a test flow could pick up. Maybe wiring them into the text of the test case a la this behavior testing can make things a lot more readable. Anyone having experience with behavior testing who can shine a bright light on how this works out in practice and if/how this contributes to the bottom line of higher quality? Rgds, Gerard On Wed, May 10, 2017 at 8:27 AM, Gerard Toonstra <[email protected]> wrote: > Hi Laura, > > Yes, testing hooks and operators is about the basic behavior of those, so > you look for infrastructural issues. The idea is to have sound, > robust components as a result of that testing that behave as you'd predict > in all circumstances. This would also consider issues like returning values > from > operators that end up in xcom values for example, ways how operators > interact with other operators and errors in connecting to systems, > errors raised from operators, etc... I'm not 100% sure about the scope > yet if it's needed to physically connect to other systems, because you > could claim it's the responsibility of the library you're using to make > sure it can do this. > > For testing workflow, I think it suffices to instrument hooks to return > data from flat files or throw exceptions. So it's a test script of some kind > that decides and tests what happens in specific situations of data > availability / exceptions / system availability. Inspecting the database can > help with that. It really helps to reduce the dependency of a CI suite to > your other infrastructure, because if one system is down, you may > potentially not be able to deploy software for hotfixes, etc... Even more > importantly, the effort to load data into all your systems to prepare for > the workflow testing is quite large and dealing with simpler ways to pass > data around significantly helps reduce that testing effort. If testing > is "too difficult" and takes too much effort, you'll see cases occur where > it's bypassed because of delivery pressure. So the idea is to make it > as easy as possible. Maybe it can also do checks on the queries that > operators send to other systems to confirm the jinja templating works ok. > > ( eventually, many test suites around the world set up like this > contribute to the quality of airflow itself ). > > For business testing, I'm suggesting to make it part of the actual DAG you > run in production and run these checks on a daily basis, so it's not > a test suite you run against it when everything finished, it's a check > after each significant operation to confirm your code did what you expected > it to do. The definition of how to check for that is vague though, because > this is highly contextual and you don't necessarily know if your code > runs until you deploy. So it may be best to recognize another level here: > > 1 Ensure that your business code runs ("it compiles"). So there should be > at least one test to confirm queries do run? Unless there is sufficient > abstraction from within the operator that it only deals with > parametrization, in which case it may not be necessary. > I strongly believe in being as rigorous as you can in input validation > and again, I think it's easier to run many cases using flat files rather > than setting > up target systems in such a state that you get the right check done, > thus it's mostly about effort again. 3rd party API's don't even allow you to > set these up, unless it's been done by some excellent developers with > special id's. > But yeah, 3rd party API's always give surprises and that has to be > resolved as you go along. > > 2. The rest is just running final checks in production daily in your > regular DAG. Assuming that the underlying infrastructure/platform code is > ok, > your components are robust, you did all the input validation checks > that are necessary, your code compiles, you're mostly dealing with > potentially really weird values (but not falling outside the current > validation boundaries), data volumes, dropped records, etc... which you can > correlate somehow with associated data or compare against history. > Datadog for example offers a monitoring service where you can check with > a SARIMA model if the calculated avg margin/invoice has an expected > value. BA's often use a number of checks to validate the quality of data > for a given day before they begin and I'm referring to such checks done > automatically on a daily basis. > > Rgds, > > Gerard > > > On Tue, May 9, 2017 at 9:46 PM, Arthur Wiedmer <[email protected]> > wrote: > >> Hi, >> >> I would love to see if we can contribute some of the work we have done >> internally at Airbnb to support some testing of DAGs. We have a long ways >> to go though :) >> >> Best, >> Arthur >> >> On Tue, May 9, 2017 at 12:34 PM, Sam Elamin <[email protected]> >> wrote: >> >> > Thanks Gerard and Laura, I have created an email thread as agreed in the >> > call so lets take the discussion there. If anyone else is interested in >> > helping us build this library please do get in touch! >> > >> > On Tue, May 9, 2017 at 5:40 PM, Laura Lorenz <[email protected]> >> > wrote: >> > >> > > Good points @Gerard. I think the distinctions you make between >> different >> > > testing considerations could help us focus our efforts. Here's my 2 >> cents >> > > in the buckets you describe; I'm wondering if any of these use cases >> > align >> > > with anyone else and can help narrow our scope, and if I understood >> you >> > > right @Gerard: >> > > >> > > Regarding platform code: For our own platform code (ie custom >> Operators >> > and >> > > Hooks), we have our CI platform running unittests on their >> construction >> > > and, in the case of hooks, integration tests on connectivity. The >> latter >> > > involves us setting up test integration services (i.e. a test MySQL >> > > process) which we start up as docker containers and we flip our >> airflow's >> > > configuration to point at them during testing using environment >> > variables. >> > > It seems from a browse on airflow's testing that operators and hooks >> are >> > > mostly unittested, with the integrations mocked or skipped (ie >> > > https://github.com/apache/incubator-airflow/blob/master/ >> > > tests/contrib/hooks/test_jira_hook.py#L40-L41 >> > > or >> > > https://github.com/apache/incubator-airflow/blob/master/ >> > > tests/contrib/hooks/test_sqoop_hook.py#L123-L125). >> > > If the hook is using some other, well tested library to actually >> > establish >> > > the connection, the case can probably be made here that the custom >> > operator >> > > and hook authors don't need integration tests, so since the normal >> > unittest >> > > library is enough to handle these that might not need to be in scope >> for >> > a >> > > new testing library to describe. >> > > >> > > Regarding data manipulation functions of the business code: >> > > For us, we run tests on each operator in each DAG on CI, seeded with >> test >> > > input data, asserted against known output data, all of which we have >> > > compiled over time to represent different edge cases we expect or have >> > > seen. So this is a test at the level of the operator as described in a >> > > given DAG. Because we only describe edge cases we have seen or can >> > predict, >> > > its a very reactive way to handle testing at this level. >> > > >> > > If I understand your idea right, another way to test (or at least, >> > surface >> > > errors) at this level is, given you have a DAG that is resilient >> against >> > > arbitrary data failures, your DAG should include a validation >> task/report >> > > at its end or a test suite should run daily against the production >> error >> > > log for that DAG that surfaces errors your business code encountered >> on >> > > production data. I think this is really interesting and reminds me of >> an >> > > airflow video I saw once (can't remember who gave the talk) on a DAG >> > whose >> > > last task self-reported error counts and rows lost. If implemented as >> a >> > > test suite you would run against production this might be a direction >> we >> > > would want a testing library to go into. >> > > >> > > Regarding the workflow correctness of the business code: >> > > What we set out to do on our side was a hybrid version of your item 1 >> > and 2 >> > > which we call "end-to-end tests": to call a whole DAG against 'real' >> > > existing systems (though really they are test docker containers of the >> > > processes we need (MySQL and Neo4J specifically) that we use >> environment >> > > variables to switch our airflow to use when instantiating hooks etc), >> > > seeded with test input files for services that are hard to set up >> (i.e. >> > > third party APIs we ingest data from). Since the whole DAG is seeded >> with >> > > known input data, this gives us a way to compare the last output of a >> DAG >> > > to a known file, so that if any workflow changes OR business logic in >> the >> > > middle affected the final output, we would know as part of our test >> suite >> > > instead of when production breaks. In other words, a way to test a >> > > regression of the whole DAG. So this is the framework we were thinking >> > > needed to be created, and is a direction we could go with a testing >> > library >> > > as well. >> > > >> > > This doesn't get to your point of determining what workflow was used, >> > which >> > > is interesting, just not a use case we have encountered yet (we only >> have >> > > deterministic DAGs). In my mind in this case we would want a testing >> > suite >> > > to be able to more or less turn some DAGs "on" against seeded input >> data >> > > and mocked or test integration services, let a scheduler go at it, and >> > then >> > > check the metadata database for what workflow happened (and, if we had >> > test >> > > integration services, maybe also check the output against the known >> > output >> > > for the seeded input). I can definitely see your suggestion of >> developing >> > > instrumentation to inspect a followed workflow as a useful addition a >> > > testing library could include. >> > > >> > > To some degree our end-to-end DAG tests overlaps in our workflow with >> > your >> > > point 3 (UAT environment), but we've found that more useful to test if >> > > "wild data" causes uncaught exceptions or any integration errors with >> > > difficult-to-mock third party services, not DAG level logic >> regressions, >> > > since the input data is unknown and thus we can't compare to a known >> > output >> > > in this case, depending instead on a fallible human QA or just >> accepting >> > > that the DAG running with no exceptions as passing UAT. >> > > >> > > Laura >> > > >> > > On Tue, May 9, 2017 at 2:15 AM, Gerard Toonstra <[email protected]> >> > > wrote: >> > > >> > > > Very interesting video. I was unable to take part. I watched only >> part >> > of >> > > > it for now. >> > > > Let us know where the discussion is being moved to. >> > > > >> > > > The confluence does indeed seem to be the place to put final >> > conclusions >> > > > and thoughts. >> > > > >> > > > For airflow, I like to make a distinction between "platform" and >> > > "business" >> > > > code. The platform code are >> > > > the hooks and operators and provide the capabilities of what your >> ETL >> > > > system can do. You'll test this >> > > > code with a lot of thoroughness, such that each component behaves >> how >> > > you'd >> > > > expect, judging from >> > > > the constructor interface. Any abstractions in there (like copying >> > files >> > > to >> > > > GCS) should be kept as hidden >> > > > as possible (retries, etc). >> > > > >> > > > The "business" code is what runs on a daily basis. This can be >> divided >> > in >> > > > another two concerns >> > > > for testing: >> > > > >> > > > 1 The workflow, the code between the data manipulation functions >> that >> > > > decides which operators get called >> > > > 2 The data manipulation function. >> > > > >> > > > >> > > > I think it's good practice to run tests on "2" on a daily basis and >> not >> > > > just once on CI. The reason is that there >> > > > are too many unforeseen circumstances where data can get into a bad >> > > state. >> > > > So such tests shouldn't run >> > > > once on a highly controlled environment like CI, but run daily in a >> > less >> > > > predictable environment like production, >> > > > where all kind of weird things can happen, but you'll be able to >> catch >> > > with >> > > > proper checks in place. Even if the checks >> > > > are too rigorous, you can skip them and improve on them, so that it >> > fits >> > > > what goes on in your environment >> > > > to your best ability. >> > > > >> > > > >> > > > Which mostly leaves testing workflow correctness and platform code. >> > What >> > > I >> > > > had intended to do was; >> > > > >> > > > 1. Test the platform code against real existing systems (or maybe >> > docker >> > > > containers), to test their behavior >> > > > in success and failure conditions. >> > > > 2. Create workflow scripts for testing the workflow; this probably >> > > requires >> > > > some specific changes in hooks, >> > > > which wouldn't call out to other systems, but would just pick up >> > small >> > > > files you prepare from a testing repo >> > > > and pass them around. The test script could also simulate >> > > > unavailability, etc. >> > > > This relieves you of a huge responsibility of setting up systems, >> > > docker >> > > > containers and load that with data. >> > > > Airflow sets up pretty quickly as a docker container and you can >> > also >> > > > start up a sample database with that. >> > > > Afterwards, from a test script, you can check which workflow was >> > > > followed by inspecting the database, >> > > > so develop some instrumentation for that. >> > > > 3. Test the data manipulation in a UAT environment, mirrorring the >> runs >> > > in >> > > > production to some extent. >> > > > That would be a place to verify if the data comes out correctly >> and >> > > > also show people what kind of >> > > > monitoring is in place to double-check that. >> > > > >> > > > >> > > > On Tue, May 9, 2017 at 1:14 AM, Arnie Salazar < >> [email protected]> >> > > > wrote: >> > > > >> > > > > Scratch that. I see the whole video now. >> > > > > >> > > > > On Mon, May 8, 2017 at 3:33 PM Arnie Salazar < >> [email protected] >> > > >> > > > > wrote: >> > > > > >> > > > > > Thanks Sam! >> > > > > > >> > > > > > Is there a part 2 to the video? If not, can you post the "next >> > steps" >> > > > > > notes you took whenever you have a chance? >> > > > > > >> > > > > > Cheers, >> > > > > > Arnie >> > > > > > >> > > > > > On Mon, May 8, 2017 at 3:08 PM Sam Elamin < >> [email protected] >> > > >> > > > > wrote: >> > > > > > >> > > > > >> Hi Folks >> > > > > >> >> > > > > >> For those of you who missed it, you can catch the discussion >> from >> > > the >> > > > > link >> > > > > >> on this tweet <https://twitter.com/samelamin/status/ >> > > > 861703888298225670> >> > > > > >> >> > > > > >> Please do share and feel free to get involved as the more >> feedback >> > > we >> > > > > get >> > > > > >> the better the library we create is :) >> > > > > >> >> > > > > >> Regards >> > > > > >> Sam >> > > > > >> >> > > > > >> On Mon, May 8, 2017 at 9:43 PM, Sam Elamin < >> > [email protected] >> > > > >> > > > > >> wrote: >> > > > > >> >> > > > > >> > Bit late notice but the call is happening today at 9 15 utc >> so >> > in >> > > > > about >> > > > > >> > 30 mins or so >> > > > > >> > >> > > > > >> > It will be recorded but if anyone would like to join in on >> the >> > > > > >> discussion >> > > > > >> > the hangout link is https://hangouts.google.com/hangouts/_/ >> > > > > >> > mbkr6xassnahjjonpuvrirxbnae >> > > > > >> > >> > > > > >> > Regards >> > > > > >> > Sam >> > > > > >> > >> > > > > >> > On Fri, 5 May 2017 at 21:35, Ali Uz <[email protected]> >> wrote: >> > > > > >> > >> > > > > >> >> I am also very interested in seeing how this turns out. Even >> > > though >> > > > > we >> > > > > >> >> don't have a testing framework in-place on the project I am >> > > working >> > > > > >> on, I >> > > > > >> >> would very much like to contribute to some general framework >> > for >> > > > > >> testing >> > > > > >> >> DAGs. >> > > > > >> >> >> > > > > >> >> As of now we are just implementing dummy tasks that test our >> > > actual >> > > > > >> tasks >> > > > > >> >> and verify if the given input produces the expected output. >> > > Nothing >> > > > > >> crazy >> > > > > >> >> and certainly not flexible in the long run. >> > > > > >> >> >> > > > > >> >> >> > > > > >> >> On Fri, 5 May 2017 at 22:59, Sam Elamin < >> > [email protected] >> > > > >> > > > > >> wrote: >> > > > > >> >> >> > > > > >> >> > Haha yes Scott you are in! >> > > > > >> >> > On Fri, 5 May 2017 at 20:07, Scott Halgrim < >> > > > > [email protected] >> > > > > >> > >> > > > > >> >> > wrote: >> > > > > >> >> > >> > > > > >> >> > > Sounds A+ to me. By “both of you” did you include me? My >> > > first >> > > > > >> >> response >> > > > > >> >> > > was just to your email address. >> > > > > >> >> > > >> > > > > >> >> > > On May 5, 2017, 11:58 AM -0700, Sam Elamin < >> > > > > >> [email protected]>, >> > > > > >> >> > > wrote: >> > > > > >> >> > > > Ok sounds great folks >> > > > > >> >> > > > >> > > > > >> >> > > > Thanks for the detailed response laura! I'll invite >> both >> > of >> > > > you >> > > > > >> to >> > > > > >> >> the >> > > > > >> >> > > > group if you are happy and we can schedule a call for >> > next >> > > > > week? >> > > > > >> >> > > > >> > > > > >> >> > > > How does that sound? >> > > > > >> >> > > > On Fri, 5 May 2017 at 17:41, Laura Lorenz < >> > > > > >> [email protected] >> > > > > >> >> > >> > > > > >> >> > > wrote: >> > > > > >> >> > > > >> > > > > >> >> > > > > We do! We developed our own little in-house DAG test >> > > > > framework >> > > > > >> >> which >> > > > > >> >> > we >> > > > > >> >> > > > > could share insights on/would love to hear what >> other >> > > folks >> > > > > >> are up >> > > > > >> >> > to. >> > > > > >> >> > > > > Basically we use mock a DAG's input data, use the >> > > > BackfillJob >> > > > > >> API >> > > > > >> >> > > directly >> > > > > >> >> > > > > to call a DAG in a test, and compare its outputs to >> the >> > > > > >> intended >> > > > > >> >> > result >> > > > > >> >> > > > > given the inputs. We use docker/docker-compose to >> > manage >> > > > > >> services, >> > > > > >> >> > and >> > > > > >> >> > > > > split our dev and test stack locally so that the >> tests >> > > have >> > > > > >> their >> > > > > >> >> own >> > > > > >> >> > > > > scheduler and metadata database and so that our CI >> tool >> > > > knows >> > > > > >> how >> > > > > >> >> to >> > > > > >> >> > > > > construct the test stack as well. >> > > > > >> >> > > > > >> > > > > >> >> > > > > We co-opted the BackfillJob API for our own purposes >> > > here, >> > > > > but >> > > > > >> it >> > > > > >> >> > > seemed >> > > > > >> >> > > > > overly complicated and fragile to start and interact >> > with >> > > > our >> > > > > >> own >> > > > > >> >> > > > > in-test-process executor like we saw in a few of the >> > > tests >> > > > in >> > > > > >> the >> > > > > >> >> > > Airflow >> > > > > >> >> > > > > test suite. So I'd be really interested on finding a >> > way >> > > to >> > > > > >> >> > streamline >> > > > > >> >> > > how >> > > > > >> >> > > > > to describe a test executor for both the Airflow >> test >> > > suite >> > > > > and >> > > > > >> >> > > people's >> > > > > >> >> > > > > own DAG testing and make that a first class type of >> > API. >> > > > > >> >> > > > > >> > > > > >> >> > > > > Laura >> > > > > >> >> > > > > >> > > > > >> >> > > > > On Fri, May 5, 2017 at 11:46 AM, Sam Elamin < >> > > > > >> >> [email protected] >> > > > > >> >> > > > > wrote: >> > > > > >> >> > > > > >> > > > > >> >> > > > > > Hi All >> > > > > >> >> > > > > > >> > > > > >> >> > > > > > A few people in the Spark community are >> interested in >> > > > > >> writing a >> > > > > >> >> > > testing >> > > > > >> >> > > > > > library for Airflow. We would love anyone who uses >> > > > Airflow >> > > > > >> >> heavily >> > > > > >> >> > in >> > > > > >> >> > > > > > production to be involved >> > > > > >> >> > > > > > >> > > > > >> >> > > > > > At the moment (AFAIK) testing your DAGs is a bit >> of a >> > > > pain, >> > > > > >> >> > > especially if >> > > > > >> >> > > > > > you want to run them in a CI server >> > > > > >> >> > > > > > >> > > > > >> >> > > > > > Is anyone interested in being involved in the >> > > discussion? >> > > > > >> >> > > > > > >> > > > > >> >> > > > > > Kind Regards >> > > > > >> >> > > > > > Sam >> > > > > >> >> > > > > > >> > > > > >> >> > > > > >> > > > > >> >> > > >> > > > > >> >> > >> > > > > >> >> >> > > > > >> > >> > > > > >> >> > > > > > >> > > > > >> > > > >> > > >> > >> > >
