Intro ===== This is a more technical follow up to the points given in a previous thread. Because that thread and the current N(ext) Runner documentation for a good context for this proposal, I encourage everyone to read them first:
https://www.redhat.com/archives/avocado-devel/2020-May/msg00009.html https://avocado-framework.readthedocs.io/en/79.0/future/core/nrunner.html The N(ext) Runner allows for greater flexibility than the the current runner, so to be effective in delivering the N(ext) Runner for general usage, we must define the bare minimum that still needs to be implemented. Basic Job and Task execution ============================ An Task, within the context of the N(ext) Runner, is described as "one specific instance/occurrence of the execution of a runnable with its respective runner". A Task is a very important building block for Avocado Job, and running an Avocado Job means, to a large extent, running a number of Tasks. The Tasks that need to be executed in a Job, are created during the ``create_test_suite()`` phase: https://avocado-framework.readthedocs.io/en/79.0/api/core/avocado.core.html#avocado.core.job.Job.create_test_suite And are kept in the Job's ``test_suite`` attribute: https://avocado-framework.readthedocs.io/en/79.0/api/core/avocado.core.html#avocado.core.job.Job.test_suite Running the tests, then, happens during the ``run_tests()`` phase: https://avocado-framework.readthedocs.io/en/79.0/api/core/avocado.core.html#avocado.core.job.Job.run_tests During the ``run_tests()`` phase, a plugin that run test suites on a job is chosen, based on the ``run.test_runner`` configuration. The current "work in progress" implementation for the N(ext) Runner, can be activated either by setting that configuration key to ``nrunner``, which can be easily done on the command line too:: avocado run --test-runner=nrunner /bin/true A general rule for measuring the quality and completeness of the ``nrunner`` implementation is to run the same jobs with the current runner, and compare its behavior and output with that of the ``nrunner``. For here on, we'll call this simply the "nrunner plugin". Known issues and limitations of the current implementation ========================================================== Different Test IDs ------------------ When running tests with the current runner, the Test IDs are different:: $ avocado run --test-runner=runner --json=- -- /bin/true /bin/false /bin/uname | grep \"id\" "id": "1-/bin/true", "id": "2-/bin/false", "id": "3-/bin/uname", $ avocado run --test-runner=nrunner --json=- -- /bin/true /bin/false /bin/uname | grep \"id\" "id": "1-1-/bin/true", "id": "2-2-/bin/false", "id": "3-3-/bin/uname", The goal is to make the IDs the same. Inability to run Tasks other than exec, exec-test, python-unittest (and noop) ----------------------------------------------------------------------------- The current implementation of the nrunner plugin is based on the fact that Tasks are already present at ``test_suite`` job attribute, and that running Tasks can be (but shouldn't always be) a matter of iterating of the result of its ``run()`` method. This is part of the actual code:: for status in task.run(): result_dispatcher.map_method('test_progress', False) statuses.append(status) The problem here is that only the Python classes implemented in the core "avocado.core.nrunner" module, and registered at: https://avocado-framework.readthedocs.io/en/79.0/api/core/avocado.core.html#avocado.core.nrunner.RUNNERS_REGISTRY_PYTHON_CLASS The goal is to have all other Python classes that inherit from "avocado.core.nrunner.BaseRunner" available in such a registry. Inability to run Tasks with Spawners ------------------------------------ While the "avocado nrun" command makes use of the Spawners, the current implementation of the nrunner plugin described earlier, calls a Task's ``run()`` method directly, and clearly doesn't use spawners. The goal here is to leverage spawners so that other isolation models (or execution environments, depending how you look at processes, containers, etc) are supported. Unoptmized execution of Tasks (extra serialization/deserialization) ------------------------------------------------------------------- At this time, the nrunner plugin runs a Task directly through its ``run()`` method. Besides the earlier point of not supporting other isolation models/execution environments (that means not using spawners), there's an extra layer of work happening when running a task which is most often not necessary: turning a Task instance into a command line, and within its execution, turning it into a Task instance again. The goal is to support an optmized execution of the tasks, without having to turn them into command lines, and back into Task instances. The idea is already present in the spawning method definitions: https://avocado-framework.readthedocs.io/en/79.0/api/core/avocado.core.spawners.html#avocado.core.spawners.common.SpawnMethod.PYTHON_CLASS And a PoC on top of the ``nrun`` command was implemented here: https://github.com/avocado-framework/avocado/pull/3766/commits/ae57ee78df7f2935e40394cdfc72a34b458cdcef Proposal ======== Besides the known limitations listed previously, there are others that will appear along the way, and certainly some new challeges as we solve them. The goal of this proposal is to attempt to identify those challenges, and lay out a plan that can be tackled by the Avocado team/community and not by a single person. Task execution coordination goals --------------------------------- As stated earlier, to run a job, tasks must be executed. Differently than the current runner, the N(ext) Runner architecture allows those to be executed in a much more decoupled way. This characteristic will be maintained, but it needs to be adapted into the current Job execution. >From a high level view, the nrunner plugin needs to: 1. Break apart from the "one at a time" Task execution model that it currently employs; 2. Check if a Tasks can be executed, that is, if its requirements can be fulfilled (the most basic requirement for a task is a matching runner; 3. Prepare for the execution of a task, such as the fulfillment of extra tasks requirements. The requirements resolver is one, if not the only way, component that should be given a chance to act here; 4. Executes a task in prepared environment; 5. Monitor the execution of a task (from an external PoV); 6. Collect the status messages that tasks will send; a. Forward the status messages to the appropriate job components, such as the result plugins. b. Depending on the content of messages, such as the ones containing "status: started" or "status: finished", interfere in the Task execution status, and consequently, in the Job execution status. 7. Verify, warn the user, and attempt to clean up stray tasks. This may be for instance, necessary if a Task on a container seems to be stuck, and the container can not be destroyed. The same applies to process in some time of uninterruptile sleeps. Parallelization --------------- Because the N(ext) Runner features allow for parallel execution of tasks, all other aspects of task execution coordination (fulfilling requirements, collecting results, etc) should not block each other. There are a number of strategies for concurrent programming in Python these days, and the "avocado nrun" command currently makes use of asyncio to have coroutines that spawn tasks and collect results concurrently (in a cooperative preemptive model). The actual language or library features used is, IMO, less important than the end result. Suggested terminology --------------------- Task execution has been requested ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A Task whose execution was requested by the user. All of the tasks on a Job's ``test_suite`` attribute are requested tasks. If a software component deals with this type of task, it's advisable that it refers to ``TASK_REQUESTED`` or ``requested_tasks`` or a similar name that links to this definition. Task is being triaged ~~~~~~~~~~~~~~~~~~~~~ The details of the task are being analyzed, including and most importantly the ability of the system to *attempt* to fulfill its requirements. A task leaves triage and it's either considered "discarded" or proceeds to be prepared and then executed. If a software component deals with this type of task, for instance if a "task scheduler" is looking for runners matching the Task's kind, it should keep it under a ``tasks_under_triage`` or mark the tasks as ``UNDER_TRIAGE`` or ``TRIAGING`` a similar name that links to this definition. Task is being prepared ~~~~~~~~~~~~~~~~~~~~~~ Task has left triage, and it has not been discarded, that is, it's a candidate to be setup, and if it goes well, executed. The requirements for a task are being prepared in its reespective isolation model/execution environment, that is, the spawner it'll be executed with is known, and the setup actions will be visible by the task. If a software component deals with this type of task, for instance the implementation of resolution of specific requirements, it should should keep it under a ``tasks_preparing`` or mark the tasks as ``PREPARING`` or a similar name that links to this definition. Task is ready to be started ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Task has been prepared succesfully, and can now be executed. If a software component deals with this type of task, it should should keep it under a ``tasks_ready`` or mark the tasks as ``READY`` or a similar name that links to this definition. Task is being started ~~~~~~~~~~~~~~~~~~~~~ A hopefully short lived state, in which a task that is ready to be started (see previous point) will be given to the reespective spawner to be started. If a software component deals with this type of task, it should should keep it under a ``tasks_starting`` or mark the tasks as ``STARTING`` or a similar name that links to this definition. The spawner should know if the starting of the task succeeded or failed, and the task should be categorized accordingly. Task has been started ~~~~~~~~~~~~~~~~~~~~~ A task was successfully started by a spawner. Note that it does *not* mean that the test that the task runner (say, an "avocado-runner-$kind task-run" command) will run has already been started. This will be signalled by a "status: started" kind of message. If a software component deals with this type of task, it should should keep it under a ``tasks_started`` or mark the tasks as ``STARTED`` or a similar name that links to this definition. Task has failed to start ~~~~~~~~~~~~~~~~~~~~~~~~ Quite self explanatory. If the spawner failed to start a task, it should be kept under a ``tasks_failed_to_start`` structure or be marked as ``FAILED_TO_START`` or a similar name that links to this definition. Task is finished ~~~~~~~~~~~~~~~~ This means that the task has started, and is now finished. There's no associated meaning here about the pass/fail output of the test payload executed by the task. It should be kept under a ``tasks_finished`` structure or be marked as ``FINISHED`` or a similar name that links to this definition. Task has been interrupted ~~~~~~~~~~~~~~~~~~~~~~~~~ This means that the task has started, but has not finished and it's past due. It should be kept under a ``tasks_interrupted`` structure or be marked as ``INTERRUPTED`` or a similar name that links to this definition. Task workflow ------------- A task will usually be created from a Runnable. A Runnable will, in turn, almost always be created as part of the "avocado.core.resolver" module. Let's consider the following output of a resolution:: +--------------------------------------+ | ReferenceResolution #1 | +--------------------------------------+ | Reference: test.py | | Result: SUCCESS | | +----------------------------------+ | | | Resolution #1 (Runnable): | | | | - kind: python-unittest | | | | - uri: test.py:Test.test_1 | | | | - requirements: | | | | + file: mylib.py | | | | + package: gcc | | | | + package: libc-devel | | | +----------------------------------+ | | +----------------------------------+ | | | Resolution #2 (Runnable): | | | | - kind: python-unittest | | | | - uri: test.py:Test.test_2 | | | | - requirements: | | | | + file: mylib.py | | | +----------------------------------+ | +--------------------------------------+ Two Runnables here will be transformed into Tasks. The process usually includes adding an identification (I) and a status URI (II):: +----------------------------------+ +----------------------------------+ | Resolution #1 (Runnable): | | Resolution #2 (Runnable): | | - kind: python-unittest | | - kind: python-unittest | | - uri: test.py:Test.test_1 | | - uri: test.py:Test.test_2 | | - requirements: | | - requirements: | | + file: mylib.py | | + file: mylib.py | | + package: gcc | +----------------------------------+ | + package: libc-devel | || +----------------------------------+ || || || || || \/ \/ +----------------------------------+ +----------------------------------+ | Task #1: | | Task #2: | | - id: 1-test.py:Test.test_1 (I)| | - id: 2-test.py:Test.test_2 (I)| | - kind: python-unittest | | - kind: python-unittest | | - uri: test.py:Test.test_1 | | - uri: test.py:Test.test_2 | | - requirements: | | - requirements: | | + file: mylib.py | | + file: mylib.py | | + package: gcc | | - status uris: | | + package: libc-devel | | + 127.0.0.1:8080 (II)| | - status uris: | +----------------------------------+ | + 127.0.0.1:8080 (II)| +----------------------------------+ In the end, a job will contain a ``test_suite`` with "Task #1" and "Task #2". It means that the execution of both tasks were requested by the Job owner:: +---------------------------------------------------------------------------+ | REQUESTED | +---------------------------------------------------------------------------+ | +----------------------------------+ +----------------------------------+ | | | Task #1: | | Task #2: | | | | - id: 1-test.py:Test.test_1 | | - id: 2-test.py:Test.test_2 | | | | - kind: python-unittest | | - kind: python-unittest | | | | - uri: test.py:Test.test_1 | | - uri: test.py:Test.test_2 | | | | - requirements: | | - requirements: | | | | + file: mylib.py | | + file: mylib.py | | | | + package: gcc | | - status uris: | | | | + package: libc-devel | | + 127.0.0.1:8080 | | | | - status uris: | +----------------------------------+ | | | + 127.0.0.1:8080 | | | +----------------------------------+ | +---------------------------------------------------------------------------+ These tasks now will be triaged. A suitable implementation will move those tasks to a ``tasks_under_triage`` queue, mark them as ``UNDER_TRIAGE`` or some other strategy to differentiate the tasks at this stage:: +---------------------------------------------------------------------------+ | UNDER_TRIAGE | +---------------------------------------------------------------------------+ | +----------------------------------+ +----------------------------------+ | | | Task #1: | | Task #2: | | | | - id: 1-test.py:Test.test_1 | | - id: 2-test.py:Test.test_2 | | | | - kind: python-unittest | | - kind: python-unittest | | | | - uri: test.py:Test.test_1 | | - uri: test.py:Test.test_2 | | | | - requirements: | | - requirements: | | | | + file: mylib.py | | + file: mylib.py | | | | + package: gcc | | - status uris: | | | | + package: libc-devel | | + 127.0.0.1:8080 | | | | - status uris: | +----------------------------------+ | | | + 127.0.0.1:8080 | | | +----------------------------------+ | +---------------------------------------------------------------------------+ Iteration I ~~~~~~~~~~~ Task #1 is selected on the first iteration, and it's found that: 1. A suitable runner for tasks of kind ``python-unittest`` exists 2. The ``mylib.py`` requirement is already present on the current environment 3. The ``gcc`` and ``libc-devel`` packages are not installed in the current environment 4. The system is capable of *attempting* to fulfill "package" types of requirements. Task #1 will then be prepared. No further action is performed on the first iteration, because no other relevant state exists (Task #2, the only other requested task, has not progressed beyone its initial stage):: +---------------------------------------------------------------------------+ | UNDER_TRIAGE | +---------------------------------------------------------------------------+ | +----------------------------------+ | | | Task #2: | | | | - id: 2-test.py:Test.test_2 | | | | - kind: python-unittest | | | | - uri: test.py:Test.test_2 | | | | - requirements: | | | | + file: mylib.py | | | | - status uris: | | | | + 127.0.0.1:8080 | | | +----------------------------------+ | | | | | +---------------------------------------------------------------------------+ +---------------------------------------------------------------------------+ | PREPARING | +---------------------------------------------------------------------------+ | +----------------------------------+ | | | Task #1: | | | | - id: 1-test.py:Test.test_1 | | | | - kind: python-unittest | | | | - uri: test.py:Test.test_1 | | | | - requirements: | | | | + file: mylib.py | | | | + package: gcc | | | | + package: libc-devel | | | | - status uris: | | | | + 127.0.0.1:8080 | | | +----------------------------------+ | +---------------------------------------------------------------------------+ Iteration II ~~~~~~~~~~~~ On the second iteration, Task #2 is selected, and it's found that: 1. A suitable runner for tasks of kind ``python-unittest`` exists 2. The ``mylib.py`` requirement is already present on the current environment Task #2 is now ready to be started. Possibily concurrently, the setup of Task #1, selected as the single entry being prepared, is having its requirements prepared:: +---------------------------------------------------------------------------+ | UNDER_TRIAGE | +---------------------------------------------------------------------------+ | | +---------------------------------------------------------------------------+ +---------------------------------------------------------------------------+ | READY | +---------------------------------------------------------------------------+ | +----------------------------------+ | | | Task #2: | | | | - id: 2-test.py:Test.test_2 | | | | - kind: python-unittest | | | | - uri: test.py:Test.test_2 | | | | - requirements: | | | | + file: mylib.py | | | | - status uris: | | | | + 127.0.0.1:8080 | | | +----------------------------------+ | | | | | +---------------------------------------------------------------------------+ +---------------------------------------------------------------------------+ | PREPARING | +---------------------------------------------------------------------------+ | +----------------------------------+ | | | Task #1: | | | | - id: 1-test.py:Test.test_1 | | | | - kind: python-unittest | | | | - uri: test.py:Test.test_1 | | | | - requirements: | | | | + file: mylib.py | | | | + package: gcc | | | | + package: libc-devel | | | | - status uris: | | | | + 127.0.0.1:8080 | | | +----------------------------------+ | +---------------------------------------------------------------------------+ Iteration III ~~~~~~~~~~~~~ On the third iteration, there are no tasks left under triage, so the action is now limited to tasks being prepared and ready to be started. Supposing that the "status uri" 127.0.0.1:8080, was set by the job, as its internal status server, it must be started before any task, to avoid any status message being lost. At this stage, Task #2 is started, and Task #1 is now ready:: +---------------------------------------------------------------------------+ | UNDER_TRIAGE | +---------------------------------------------------------------------------+ | | +---------------------------------------------------------------------------+ +---------------------------------------------------------------------------+ | STARTED | +---------------------------------------------------------------------------+ | +----------------------------------+ | | | Task #2: | | | | - id: 2-test.py:Test.test_2 | | | | - kind: python-unittest | | | | - uri: test.py:Test.test_2 | | | | - requirements: | | | | + file: mylib.py | | | | - status uris: | | | | + 127.0.0.1:8080 | | | +----------------------------------+ | | | | | +---------------------------------------------------------------------------+ +---------------------------------------------------------------------------+ | READY | +---------------------------------------------------------------------------+ | +----------------------------------+ | | | Task #1: | | | | - id: 1-test.py:Test.test_1 | | | | - kind: python-unittest | | | | - uri: test.py:Test.test_1 | | | | - requirements: | | | | + file: mylib.py | | | | + package: gcc | | | | + package: libc-devel | | | | - status uris: | | | | + 127.0.0.1:8080 | | | +----------------------------------+ | +---------------------------------------------------------------------------+ +---------------------------------------------------------------------------+ | STATUS SERVER "127.0.0.1:8080" | +---------------------------------------------------------------------------+ | Status Messages: [] | +---------------------------------------------------------------------------+ Iteration IV ~~~~~~~~~~~~ On the fourth iteration, Task #1 is started:: +---------------------------------------------------------------------------+ | STARTED | +---------------------------------------------------------------------------+ | +----------------------------------+ +----------------------------------+ | | | Task #1: | | Task #2: | | | | - id: 1-test.py:Test.test_1 | | - id: 2-test.py:Test.test_2 | | | | - kind: python-unittest | | - kind: python-unittest | | | | - uri: test.py:Test.test_1 | | - uri: test.py:Test.test_2 | | | | - requirements: | | - requirements: | | | | + file: mylib.py | | + file: mylib.py | | | | + package: gcc | | - status uris: | | | | + package: libc-devel | | + 127.0.0.1:8080 | | | | - status uris: | +----------------------------------+ | | | + 127.0.0.1:8080 | | | +----------------------------------+ | +---------------------------------------------------------------------------+ +---------------------------------------------------------------------------+ | STATUS SERVER "127.0.0.1:8080" | +---------------------------------------------------------------------------+ | Status Messages: | | - {id: 2-test.py:Test.test_2, status: started} | +---------------------------------------------------------------------------+ Note: the ideal level of parallelization is still to be defined, that is, it may be that triaging and preparing and starting tasks, all run concurrently. An initial implementation that, on each iteration, looks at all Task states, and attempts to advance them further, blocking other Tasks as much as little as possible should be acceptable. Iteration V ~~~~~~~~~~~ On the fifth iteration, the spawner reports that Task #2 is not alive anymore, and the status server has received a message about it (and also a message about Task #1 having started):: +---------------------------------------------------------------------------+ | STATUS SERVER "127.0.0.1:8080" | +---------------------------------------------------------------------------+ | Status Messages: | | - {id: 2-test.py:Test.test_2, status: started} | | - {id: 1-test.py:Test.test_1, status: started} | | - {id: 2-test.py:Test.test_2, status: finished, result: pass} | +---------------------------------------------------------------------------+ Because of that, Task #2 is now considered ``FINISHED``:: +---------------------------------------------------------------------------+ | FINISHED | +---------------------------------------------------------------------------+ | +----------------------------------+ | | | Task #2: | | | | - id: 2-test.py:Test.test_2 | | | | - kind: python-unittest | | | | - uri: test.py:Test.test_2 | | | | - requirements: | | | | + file: mylib.py | | | | - status uris: | | | | + 127.0.0.1:8080 | | | +----------------------------------+ | +---------------------------------------------------------------------------+ And Task #1 is still a ``STARTED`` task:: +---------------------------------------------------------------------------+ | STARTED | +---------------------------------------------------------------------------+ | +----------------------------------+ | | | Task #1: | | | | - id: 1-test.py:Test.test_1 | | | | - kind: python-unittest | | | | - uri: test.py:Test.test_1 | | | | - requirements: | | | | + file: mylib.py | | | | + package: gcc | | | | + package: libc-devel | | | | - status uris: | | | | + 127.0.0.1:8080 | | | +----------------------------------+ | +---------------------------------------------------------------------------+ Final Iteration ~~~~~~~~~~~~~~~ After a number of iterations with no status changes, and because of a timeout implementation at the job level, it's decided that Task #1 is not to be waited on. The spawner continues to inform that Task #1 is alive (from its PoV), but no further status message has been received. Provided the spawner has support for that, it may attempt to clean up the task (such as destroying a container or killing a process). In the end, it's left with:: +---------------------------------------------------------------------------+ | STATUS SERVER "127.0.0.1:8080" | +---------------------------------------------------------------------------+ | Status Messages: | | - {id: 2-test.py:Test.test_2, status: started} | | - {id: 1-test.py:Test.test_1, status: started} | | - {id: 2-test.py:Test.test_2, status: finished, result: pass} | +---------------------------------------------------------------------------+ +---------------------------------------------------------------------------+ | FINISHED | +---------------------------------------------------------------------------+ | +----------------------------------+ | | | Task #2: | | | | - id: 2-test.py:Test.test_2 | | | | - kind: python-unittest | | | | - uri: test.py:Test.test_2 | | | | - requirements: | | | | + file: mylib.py | | | | - status uris: | | | | + 127.0.0.1:8080 | | | +----------------------------------+ | +---------------------------------------------------------------------------+ +---------------------------------------------------------------------------+ | INTERRUPTED | +---------------------------------------------------------------------------+ | +----------------------------------+ | | | Task #1: | | | | - id: 1-test.py:Test.test_1 | | | | - kind: python-unittest | | | | - uri: test.py:Test.test_1 | | | | - requirements: | | | | + file: mylib.py | | | | + package: gcc | | | | + package: libc-devel | | | | - status uris: | | | | + 127.0.0.1:8080 | | | +----------------------------------+ | +---------------------------------------------------------------------------+ Tallying results ~~~~~~~~~~~~~~~~ The nrunner plugin should be able to provide meaningful results to the Job, and consequently to the user, based on the resulting information on the final iteration. Notice that some information will come, such as the ``PASS`` for the first test, will come from the "result" given in a status message from the task itself. Some other status, such as the ``INTERRUPTED`` status for the second test will not come from a status message received, but from a realization of the actual management of the task execution. It's expected to other information will also have to be inferred, and "filled in" by the nrunner plugin implementation In the end, it's expected that results similar to this would be presented:: JOB ID : f59bd40b8ac905864c4558dc02b6177d4f422ca3 JOB LOG : /home/cleber/avocado/job-results/job-2020-05-20T17.58-f59bd40/job.log (1/2) tests.py:Test.test_2: PASS (2.56 s) (2/2) tests.py:Test.test_1: INTERRUPT (900 s) RESULTS : PASS 1 | ERROR 0 | FAIL 0 | SKIP 0 | WARN 0 | INTERRUPT 1 | CANCEL 0 JOB TIME : 0.19 s JOB HTML : /home/cleber/avocado/job-results/job-2020-05-20T17.58-f59bd40/results.html Notice how Task #2 shows up before Task #1, because it was both started first, and finished earlier. There may be issues associated with the current UI to dealt with regarding out of order task status updates. Summary ======= This proposal contains a number of items that can become GitHub issues at this stage. It also contains a general explanation of what I believe are the crucial missing features to make the N(ext) Runner implementation available to the general public. Feedback is highly appreciated, and it's expected that this document will evolve into a better version, and possibly become a formal Blue Print. Thanks, - Cleber.
signature.asc
Description: PGP signature