On Sun, 31 Jul 2016 22:00:11 +0100, John Hardin <[email protected]> wrote:
Folks:
It looks like we didn't get another successful weekly masscheck again,
even though if you check the counts today they are above the thresholds.
I suspect this is happening due to some results being submitted "late".
I think we might want to look into making a change to the masscheck
timing rules, specifically: the cutoff for having enough corpora to run
the scoring and produce a rules update is not a specific time, but is
instead related to the following masscheck run.
Change is good. I must admit it's getting a bit frustrating seeing these
runs result in nothing at all.
In other words:
There is still a cutoff time for the masscheck run, but it only means
"the scoring won't start prior to this time."
If the corpora are above the thresholds when this time is reached, the
scoring and update process commences immediately.
If not, that doesn't mean we've missed an update, at least not yet.
If another result set comes in for that pass, and that result set pushes
it over the thresholds, then we can start the scoring and rule
generation process.
The actual hard cutoff for pass X would be sometime after pass X+1
starts. Perhaps if the cutoff time for pass X+1 is reached and pass X is
still waiting, then we give up on pass X.
This way, a late result set that satisfies the threshholds will just
delay the rule generation, not prevent it.
I like that there's an effort to still push the updates out as early in
the day as possible with this system. The simplest option is no doubt to
just delay the score generation, even to the point of giving a whole 24
hours if need be, as it would at least result in something fairly
reliable. This seems a good hybrid approach though.
This can use some refinement:
Some good thoughts, but ones that I fear may prove an obstacle to getting
a change in place. Perhaps things for a wishlist instead?
If we've started scoring and another result set for that pass comes in,
do we incorporate that into the score generation? We probably should;
the decision could be based on when the delayed results come in (we
don't want to keep resetting the scoring process and collide with the
following pass) and how large the new results are (we might want to
ignore a late small result set, but incorporate a late large result set).
As it stands I'm inclined to take the route that anything submitted after
the run has started gets lost - this is no different to the current
situation (as I understand it anyway) so it's not penalising anyone, but
it also doesn't grant further concessions. Adding in new results just
seems a way to potentially further delay an already delayed process.
Much as the additional data is beneficial it seems added complexity for no
gain. Given how tight the ham threshold is most days (there are a lot of
days in the 140k-150k region) a large result set is unlikely to arrive
after the threshold has been met anyway, it's far more likely to be the
trigger. If we start dividing large and small we need to pick a point and
draw a line and potentially discourage submissions from people who feel
they aren't important enough.
I'd also note that when you look at the uploads you have people like axb
who submit multiple times in small groups - that is always an option to
people if they feel something is important enough to beat the threshold.
If we're still running a score generation for pass X and pass X+1 has
reached its cutoff and has enough corpora to satisfy the thresholds and
immediately start the scoring process, do we give up on processing pass
X? I would think yes.
I don't know how long the process takes, but if we never start a pass by
the time the next day's start point comes I would assume it would never
overlap. I could be wrong, but it seems likely that a hard cut off that
shouldn't overlap the next day's start may be simpler. At some point we
need to give up hope on a day's results anyway, so that may be the
guideline for when that time is.