Re: New ExecDB

2017-01-12 Thread Josef Skladanka
There's not been a huge amount of effort put to this - I've had other
priorities ever since, but I can get back on it, if you feel it's the time
to do it. The only code to work in that direction is here:
https://bitbucket.org/fedoraqa/execdb/branch/feature/pony where I only
basically started on removing the tight coupling between execdb and
buildbot, and then I went on trying to figure out what's in this thread.

On Tue, Jan 10, 2017 at 6:57 AM, Tim Flink  wrote:

> On Fri, 21 Oct 2016 13:16:04 +0200
> Josef Skladanka  wrote:
>
> > So, after a long discussion, we arrived to this solution.
> >
> > We will clearly split up the "who to notify" part, and "should we
> > re-schedule" part of the proposal. The party to notify will be stored
> > in the `notify` field, with `taskotron, task, unknown` options.
> > Initially any crashes in `shell` or `python` directive, during
> > formula parsing, and when installing the packages specified in the
> > formula's environment will be sent to task maintainers, every other
> > crash to taskotron maintainer. That covers what I initially wanted
> > from the multiple crashed states.
> >
> > On top of that, we feel that having an information on "what went
> > wrong" is important, and we'd like to have as much detail as
> > possible, but on the other hand we don't want the re-scheduling logic
> > to be too complicated. We agreed on using a `cause` field, with
> > `minion, task, network, libtaskotron, unknown` options, and storing
> > any other details in a key-value store. We will likely just
> > re-schedule any crashed task anyway, at the beginning, but this
> > allows us to hoard some data, and make more informed decision later
> > on. On top of that, the `fatal` flag can be set, to say that it is
> > not necessary to reschedule, as the crash is unlikely to be fixed by
> > that.
> >
> > This allows us to keep the re-scheduling logic rather simple, and most
> > imporantly decoupled from the parts that just report what went wrong.
>
> How far did you end up getting on this?
>
> Tim
>
> ___
> qa-devel mailing list -- qa-devel@lists.fedoraproject.org
> To unsubscribe send an email to qa-devel-le...@lists.fedoraproject.org
>
>
___
qa-devel mailing list -- qa-devel@lists.fedoraproject.org
To unsubscribe send an email to qa-devel-le...@lists.fedoraproject.org


Re: New ExecDB

2017-01-09 Thread Tim Flink
On Fri, 21 Oct 2016 13:16:04 +0200
Josef Skladanka  wrote:

> So, after a long discussion, we arrived to this solution.
> 
> We will clearly split up the "who to notify" part, and "should we
> re-schedule" part of the proposal. The party to notify will be stored
> in the `notify` field, with `taskotron, task, unknown` options.
> Initially any crashes in `shell` or `python` directive, during
> formula parsing, and when installing the packages specified in the
> formula's environment will be sent to task maintainers, every other
> crash to taskotron maintainer. That covers what I initially wanted
> from the multiple crashed states.
> 
> On top of that, we feel that having an information on "what went
> wrong" is important, and we'd like to have as much detail as
> possible, but on the other hand we don't want the re-scheduling logic
> to be too complicated. We agreed on using a `cause` field, with
> `minion, task, network, libtaskotron, unknown` options, and storing
> any other details in a key-value store. We will likely just
> re-schedule any crashed task anyway, at the beginning, but this
> allows us to hoard some data, and make more informed decision later
> on. On top of that, the `fatal` flag can be set, to say that it is
> not necessary to reschedule, as the crash is unlikely to be fixed by
> that.
> 
> This allows us to keep the re-scheduling logic rather simple, and most
> imporantly decoupled from the parts that just report what went wrong.

How far did you end up getting on this?

Tim


pgpZnfeI5dQEn.pgp
Description: OpenPGP digital signature
___
qa-devel mailing list -- qa-devel@lists.fedoraproject.org
To unsubscribe send an email to qa-devel-le...@lists.fedoraproject.org


Re: New ExecDB

2016-10-21 Thread Josef Skladanka
So, after a long discussion, we arrived to this solution.

We will clearly split up the "who to notify" part, and "should we
re-schedule" part of the proposal. The party to notify will be stored in
the `notify` field, with `taskotron, task, unknown` options. Initially any
crashes in `shell` or `python` directive, during formula parsing, and when
installing the packages specified in the formula's environment will be sent
to task maintainers, every other crash to taskotron maintainer. That covers
what I initially wanted from the multiple crashed states.

On top of that, we feel that having an information on "what went wrong" is
important, and we'd like to have as much detail as possible, but on the
other hand we don't want the re-scheduling logic to be too complicated. We
agreed on using a `cause` field, with `minion, task, network, libtaskotron,
unknown` options, and storing any other details in a key-value store. We
will likely just re-schedule any crashed task anyway, at the beginning, but
this allows us to hoard some data, and make more informed decision later
on. On top of that, the `fatal` flag can be set, to say that it is not
necessary to reschedule, as the crash is unlikely to be fixed by that.

This allows us to keep the re-scheduling logic rather simple, and most
imporantly decoupled from the parts that just report what went wrong.
___
qa-devel mailing list -- qa-devel@lists.fedoraproject.org
To unsubscribe send an email to qa-devel-le...@lists.fedoraproject.org


Re: New ExecDB

2016-10-12 Thread Kamil Paral
> On Tue, Oct 11, 2016 at 1:14 PM, Kamil Paral < kpa...@redhat.com > wrote:

> > > Proposal looks good to me, I don't have any strong objections.
> > 
> 

> > > 1. If you don't like blame: UNIVERSE, why not use blame: TESTBENCH?
> > 
> 
> > > 2. I think that having enum values in details in crash structure would be
> > > better, but I don't have strong opinion either way.
> > 
> 

> > For consistency checking, yes. But it's somewhat inflexible. If the need
> > arises, I imagine the detail string can be in json format (or
> > semicolon-separated keyvals or something) and we can store several useful
> > properties in there, not just one.
> 

> I'd rather do the key-value thing as we do in ResultsDB than storing plalin
> Json. Yes the new Postgres can do it (and can also search it to some
> extent), but it is not all-mighty, and has its own problems.

Sure, that would be even better (but you didn't sound like supporting that idea 
a few days back). So having "state", "blame" and "keyvals" would work well, in 
my opinion. 

> > E.g. not only that Koji call failed, but what was its HTTP error code. Or
> > not
> > that dnf install failed, but also whether it was the infamous "no more
> > mirror to try" error or a dependency error. I don't want to misuse that to
> > store loads of data, but this could be useful to track specific issues we
> > have hard times to track currently (e.g. our still existing depcheck issue,
> > that happens only rarely and it's difficult for us to get a list of tasks
> > affected by it). With this, we could add a flag "this is related to problem
> > XYZ that we're trying to solve".
> 

> I probably understand, what you want, but I'd rather have a specified set of
> values, which will/can be acted upon. Maybe changing the structure to
> `{state, blame, cause, details}`, where the `cause` is still an enum of
> known values but details is freeform, but strictly used for humans?

Hm, I can't actually imagine any case where freeform text *strictly* for humans 
would get too useful. But I can imagine cases where the field is useful for 
humans *and* machines. As an example, we currently have several known causes 
why minions are not properly created. Linking to backingstores might fail, the 
libvirt VM creation or startup might fail, and we might time out while waiting 
for the machine. Those are 3 different reasons for the same overall problem in 
"MINION_CREATION". So, I imagine we can store it like this: 

state: CRASHED 
blame: TASKOTRON 
keyvals: 
step: MINION_CREATION 
reason: TIMED_OUT 

or like this: 

state: CRASHED 
blame: TASKOTRON 
cause: MINION_CREATION 
keyvals: 
reason: TIMED_OUT 

All of that is human readable and machine parseable. The top level keys are 
enums in the database, the keyvals are defined in libtaskotron and the database 
doesn't care. Of course, keyvals are keyvals, so we can easily add something 
like "human_description: foo" if we need it and use it for some use cases, or 
even "time_elapsed: 70" or whatever. It's keyvals, so we can easily ignore the 
ones we don't care about, and use them only in specific cases, or whatever. 
They are flexible and can be easily extended. 

And now we finally get to the point. When Tim asks "do you know how often 
backing stores linking fails during minion creation?" (as he really did in 
T833), we can say "here's a simple query to find out!". Or, if it is not 
implemented yet, we can say "let's add new 'reason: BACKINGSTORE_LINKING' and 
let's see in a month!". Because currently we can't really say anything and the 
best thing to do is to grep all logs, or make some custom hack on our servers. 

> So we can "CRASHED->THIRDPARTY->UNKNOWN->"text of the exception" for example,

I understand it's just an example, but in this case it would be probably 
CRASHED->UNKNOWN->UNKNOWN, because if you don't know to cause, you can't blame 
anybody. 

> or "CRASHED->TASKOTRON->NETWORK->"dnf - no more mirrors to try".

And here the blame would be THIRDPARTY, since it's NETWORK :-) 

I'd prefer keyvals over this, because you can't fit anything useful for 
querying to the free-form details field (or, as you said, you'd rather avoid 
that). The fact that I know it crashed because of network might not be enough 
information to debug issues, decide rescheduling, etc. 

> I'd rather act on a known set of values, then have code like:

You probably meant 'than', right? I'm a bit confused here. 

> if ('dnf' in detail and 'no more mirrors' in detail) or ('DNF' in detail and
> 'could not connect' in detail)

Do the example with keyvals I posted above look better? I'd say using the 
keyvals with known enum values (defined in libtaskotron) is definitely more 
realiable and maintainable than matching free-form text like above. 

> in the end, it is almost the same, because there will be problems with
> clasifying the errors, and the more layers we add, the harder it gets - that
> is the reason I initially only wanted to do the {state, blame} thing. B

Re: New ExecDB

2016-10-12 Thread Josef Skladanka
On Tue, Oct 11, 2016 at 1:14 PM, Kamil Paral  wrote:

> Proposal looks good to me, I don't have any strong objections.
>
> 1. If you don't like blame: UNIVERSE, why not use blame: TESTBENCH?
> 2. I think that having enum values in details in crash structure would be
> better, but I don't have strong opinion either way.
>
>
> For consistency checking, yes. But it's somewhat inflexible. If the need
> arises, I imagine the detail string can be in json format (or
> semicolon-separated keyvals or something) and we can store several useful
> properties in there, not just one.
>


I'd rather do the key-value thing as we do in ResultsDB than storing plalin
Json. Yes the new Postgres can do it (and can also search it to some
extent), but it is not all-mighty, and has its own problems.



> E.g. not only that Koji call failed, but what was its HTTP error code. Or
> not that dnf install failed, but also whether it was the infamous "no more
> mirror to try" error or a dependency error. I don't want to misuse that to
> store loads of data, but this could be useful to track specific issues we
> have hard times to track currently (e.g. our still existing depcheck issue,
> that happens only rarely and it's difficult for us to get a list of tasks
> affected by it). With this, we could add a flag "this is related to problem
> XYZ that we're trying to solve".
>
>
I probably understand, what you want, but I'd rather have a specified set
of values, which will/can be acted upon. Maybe changing the structure to
`{state, blame, cause, details}`, where the `cause` is still an enum of
known values but details is freeform, but strictly used for humans? So we
can "CRASHED->THIRDPARTY->UNKNOWN->"text of the exception" for example, or
"CRASHED->TASKOTRON->NETWORK->"dnf - no more mirrors to try".

I'd rather act on a known set of values, then have code like:

if ('dnf' in detail and 'no more mirrors' in detail) or ('DNF' in
detail and 'could not connect' in detail)

in the end, it is almost the same, because there will be problems with
clasifying the errors, and the more layers we add, the harder it gets -
that is the reason I initially only wanted to do the {state, blame} thing.
But I feel that this is not enough (just state and blame) information for
us to act upon - e.g. to decide when to automatically reschedule, and when
not, but I'm afraid that with the exploded complexity of the 'crashed
states' the code for handling the "should we reschedule" decisions will be
awfull. Notyfiing the right party is fine (that is what blame gives us),
but this is IMO what we should focus on a bit.

Tim, do you have any comments?
___
qa-devel mailing list -- qa-devel@lists.fedoraproject.org
To unsubscribe send an email to qa-devel-le...@lists.fedoraproject.org


Re: New ExecDB

2016-10-11 Thread Kamil Paral
> Proposal looks good to me, I don't have any strong objections.

> 1. If you don't like blame: UNIVERSE, why not use blame: TESTBENCH?
> 2. I think that having enum values in details in crash structure would be
> better, but I don't have strong opinion either way.

For consistency checking, yes. But it's somewhat inflexible. If the need 
arises, I imagine the detail string can be in json format (or 
semicolon-separated keyvals or something) and we can store several useful 
properties in there, not just one. E.g. not only that Koji call failed, but 
what was its HTTP error code. Or not that dnf install failed, but also whether 
it was the infamous "no more mirror to try" error or a dependency error. I 
don't want to misuse that to store loads of data, but this could be useful to 
track specific issues we have hard times to track currently (e.g. our still 
existing depcheck issue, that happens only rarely and it's difficult for us to 
get a list of tasks affected by it). With this, we could add a flag "this is 
related to problem XYZ that we're trying to solve". 
___
qa-devel mailing list -- qa-devel@lists.fedoraproject.org
To unsubscribe send an email to qa-devel-le...@lists.fedoraproject.org


Re: New ExecDB

2016-10-11 Thread Kamil Paral
> With ResultsDB and Trigger rewrite done, I'd like to get started on ExecDB.

> The current ExecDB is more of a tech-preview, that was to show that it's
> possible to consume the push notifications from Buildbot. The thing is, that
> the code doing it is quite a mess (mostly because the notifications are
> quite a mess), and it's directly tied not only to Buildbot, but quite
> probably to the one version of Buildbot we currently use.
> I'd like to change the process to a style, where ExecDB provides an API, and
> Buildbot (or possibly any other execution tool we use in the future) will
> just use that to switch the execution states.

> ExecDB should be the hub, in which we can go to search for execution state
> and statistics of our jobs/tasks. The execution is tied together via UUID,
> provided by ExecDB at Trigger time. The UUID is passed around through all
> the stack, from Trigger to ResultsDB.

> The process, as I envision it, is:
> 1) Trigger consumes FedMsg
> 2) Trigger creates a new Job in ExecDB, storing data like FedMsg message id,
> and other relevant information (to make rescheduling possible)
> 3) ExecDB provides the UUID, marks the Job s SCHEDULED and Trigger then
> passes the UUID, along with other data, to Buildbot.
> 4) Buildbot runs runtask, (sets ExecDB job to RUNNING)
> 5) Libtaskotron is provided the UUID, so it can then be used to report
> results to ResultsDB.
> 6) Libtaskotron reports to ResultsDB, using the UUID as the Group UUID.
> 7) Libtaskotron ends, creating a status file in a known location
> 8) The status file contains a machine-parsable information about the runtask
> execution - either "OK" or a description of "Fault" (network failed, package
> to be installed did not exist, koji did not respond... you name it)
> 9) Buidbot parses the status file, and reports back to ExecDB, marking the
> Job either as FINISHED or CRASHED (+details)

> This will need changes in Buildbot steps - a step that switches the job to
> RUNNING at the beginnning, and a step that handles the FINISHED/CRASHED
> switch. The way I see it, this can be done via a simple CURL or HTTPie call
> from the command line. No big issue here.

> We should make sure that ExecDB stores data that:
> 1) show the execution state
> 2) allow job re-scheduling
> 3) describe the reason the Job CRASHED

> 1 is obviously the state. 2 I think can be satisfied by storing the Fedmsg
> Message ID and/or the Trigger-parsed data, which are passed to Buildbot.
> Here I'd like to focus on 3:

> My initial idea was to have SCHEDULED, RUNNING, FINISHED states, and four
> crashed states, to describe where the fault was:
> - CRASHED_TASKOTRON for when the error is on "our" side (minion could not be
> started, git repo with task not cloned...)
> - CRASHED_TASK to use when there's an unhandled exception in the Task code
> - CRASHED_RESOURCES when network is down, etc
> - CRASHED_OTHER whenever we are not sure

> The point of the crashed "classes" is to be able to act on different kind of
> crash - notify the right party, or even automatically reschedule the job, in
> the case of network failure, for example.

> After talking this through with Kamil, I'd rather do something slightly
> different. There would only be one CRASHED state, but the job would contain
> additional information to
> - find the right person to notify
> - get more information about the cause of the failure
> To do this, we came up with a structure like this:
> {state: CRASHED, blame: [TASKOTRON, TASK, UNIVERSE], details: "free-text-ish
> description"}

> The "blame" classes are self-describing, although I'd love to have a better
> name for "UNIVERSE".

I was thinking about this and what about "blame: THIRD_PARTY" (or THIRDPARTY)? 
I think that best described the distinction of us (taskotron authors), them 
(task authors) and anyone else (servers, networks, etc). 

I'd also like to add "blame: UNKNOWN" to distinguish third parties we can 
identify (koji, bodhi) from errors we have no idea what caused them. This will 
allow us to more easily spot new or infrequent crashes. Alternatively, the 
"blame" field can be null/none, that can have the same meaning. But "unknown" 
is probably more descriptive (and "none" can be converted to "unknown" when 
saving this to the database). 

> We might want to add more, should it make sense, but my main focus is to find
> the right party to notify.
> The "details" field will contain the actual cause of the failure (in the case
> we know it), and although I have it marked as free-text, I'd like to have a
> set of values defined in docs, to keep things consistent.

> Doing this, we could record that "Koji failed, timed out" (and blame
> UNIVERSE, and possibly reschedule) or "DNF failed, package not found" (blame
> TASK if it was in the formula, and notify the task maintained), or "Minion
> creation failed" (and blame TASKOTRON, notify us, I guess).

> Implementing the crash clasification will obviously take some time, but it
> can be gr

Re: New ExecDB

2016-10-11 Thread Jan Sedlak
Proposal looks good to me, I don't have any strong objections.

1. If you don't like blame: UNIVERSE, why not use blame: TESTBENCH?
2. I think that having enum values in details in crash structure would be
better, but I don't have strong opinion either way.

Jan
___
qa-devel mailing list -- qa-devel@lists.fedoraproject.org
To unsubscribe send an email to qa-devel-le...@lists.fedoraproject.org


New ExecDB

2016-10-10 Thread Josef Skladanka
With ResultsDB and Trigger rewrite done, I'd like to get started on ExecDB.

The current ExecDB is more of a tech-preview, that was to show that it's
possible to consume the push notifications from Buildbot. The thing is,
that the code doing it is quite a mess (mostly because the notifications
are quite a mess), and it's directly tied not only to Buildbot, but quite
probably to the one version of Buildbot we currently use.
I'd like to change the process to a style, where ExecDB provides an API,
and Buildbot (or possibly any other execution tool we use in the future)
will just use that to switch the execution states.

ExecDB should be the hub, in which we can go to search for execution state
and statistics of our jobs/tasks. The execution is tied together via UUID,
provided by ExecDB at Trigger time. The UUID is passed around through all
the stack, from Trigger to ResultsDB.

The process, as I envision it, is:
1) Trigger consumes FedMsg
2) Trigger creates a new Job in ExecDB, storing data like FedMsg message
id, and other relevant information (to make rescheduling possible)
3) ExecDB provides the UUID, marks the Job s SCHEDULED and Trigger then
passes the UUID, along with other data, to Buildbot.
4) Buildbot runs runtask, (sets ExecDB job to RUNNING)
5) Libtaskotron is provided the UUID, so it can then be used to report
results to ResultsDB.
6) Libtaskotron reports to ResultsDB, using the UUID as the Group UUID.
7) Libtaskotron ends, creating a status file in a known location
8) The status file contains a machine-parsable information about the
runtask execution - either "OK" or a description of "Fault" (network
failed, package to be installed did not exist, koji did not respond... you
name it)
9) Buidbot parses the status file, and reports back to ExecDB, marking the
Job either as FINISHED or CRASHED (+details)

This will need changes in Buildbot steps - a step that switches the job to
RUNNING at the beginnning, and a step that handles the FINISHED/CRASHED
switch. The way I see it, this can be done via a simple CURL or HTTPie call
from the command line. No big issue here.

We should make sure that ExecDB stores data that:
1) show the execution state
2) allow job re-scheduling
3) describe the reason the Job CRASHED

1 is obviously the state. 2 I think can be satisfied by storing the Fedmsg
Message ID and/or the Trigger-parsed data, which are passed to Buildbot.
Here I'd like to focus on 3:

My initial idea was to have SCHEDULED, RUNNING, FINISHED states, and four
crashed states, to describe where the fault was:
 - CRASHED_TASKOTRON for when the error is on "our" side (minion could not
be started, git repo with task not cloned...)
 - CRASHED_TASK to use when there's an unhandled exception in the Task code
 - CRASHED_RESOURCES when network is down, etc
 - CRASHED_OTHER whenever we are not sure

The point of the crashed "classes" is to be able to act on different kind
of crash - notify the right party, or even automatically reschedule the
job, in the case of network failure, for example.

After talking this through with Kamil, I'd rather do something slightly
different. There would only be one CRASHED state, but the job would contain
additional information to
 - find the right person to notify
 - get more information about the cause of the failure
To do this, we came up with a structure like this:
  {state: CRASHED, blame: [TASKOTRON, TASK, UNIVERSE], details:
"free-text-ish description"}

The "blame" classes are self-describing, although I'd love to have a better
name for "UNIVERSE". We might want to add more, should it make sense, but
my main focus is to find the right party to notify.
The "details" field will contain the actual cause of the failure (in the
case we know it), and although I have it marked as free-text, I'd like to
have a set of values defined in docs, to keep things consistent.

Doing this, we could record that "Koji failed, timed out" (and blame
UNIVERSE, and possibly reschedule) or "DNF failed, package not found"
(blame TASK if it was in the formula, and notify the task maintained), or
"Minion creation failed" (and blame TASKOTRON, notify us, I guess).

Implementing the crash clasification will obviously take some time, but it
can be gradual, and we can start handling the "well known" failures soon,
for the bigger gain (kparal had some examples, IIRC).

So - what do you think about it? Is it a good idea? Do you feel like there
should be more (I can't really imagine there being less) blame targets
(like NETWORK, for example), and if so, why, and which? How about the
details - hould we go with pre-defined set of values (because enums are
better than free-text, but adding more would mean DB changes), or is
free-text + docs fine? Or do you see some other, better solution?

joza
___
qa-devel mailing list -- qa-devel@lists.fedoraproject.org
To unsubscribe send an email to qa-devel-le...@lists.fedoraproject.org