On Thu, Sep 10, 2009 at 01:33:44PM +0100, Michael Hanselmann wrote:
> Hi Iustin
> 
> Thank you for looking into this!
> 
> 2009/9/8 Iustin Pop <[email protected]>:
> > My temporary patch series does:
> >  - 1) convert the cancel algorithm (for being processed jobs) to
> >    instead add the job id to a queue-level "jobs to be canceled" dict;
> >    this won't be replicated nor saved to disk, which means that a crash
> >    of masterd before jobs in status 'canceling' are actually canceled
> >    by their threads will forget the cancelling status
> 
> I'm a bit worried about keeping the should-be-cancelled-status in
> memory. It means queue operations via luxi are not permanent. Have you
> considered (re-?)introducing a per-job lock?

A little bit orthogonal - I was rather thinking that we could simply
create and replicate a job-NNNN.canceling file.

But speaking of the per-job lock, we would need to look into how this
would play with this proposal. Last time I tried per-job-locks the
speedup was nothing exciting.

> >  - 2) after step 1, only the worker thread for a running jobs will
> >    update it, and thus we can release the job queue lock during job
> >    updates (both save to disk and replicate to other MCs)
> 
> There should be some assertions in the code to make sure we don't
> accidentally update it from somewhere else (or move the update method
> to the queued job).

Of course ☺

My current patch adds an 'owner' field to the job, and the UpdateJob
code tests that either the current thread is the owner of the job (i.e.
we are a worker) or the job has no owner (and that means we do some
queries/read-only stuff on the job from a luxi client thread).

> >  - 3) while at it, convert the job status into a non-persistent
> >    attribute on the _QueuedJob object, and thus allow other code to
> >    simply read the attribute instead of calling CalcStatus; since this
> >    a single attribute, we don't need a lock while reading it (but if
> >    one reads other attributes, then yes the lock is needed)
> 
> At the same time we must make sure to update the status attribute only
> after all updates to the opcodes are done. Thinking about it: Without
> locking there can be inconsistencies between job and opcodes (e.g.
> when running _QueuedJob.MarkUnfinishedOps). Do you have any solutions
> to this?

Not sure if I understand correctly, but I'll try: the job update is done
always like this:

  job.status = job.CalcStatus()

so yes, it needs the opcodes to be correctly setup before, and it's
always done after modifying the opcodes/an opcode.

Anyway, the main point is that we can use the lock for any kind of
attribute access, but we shouldn't use it for the job replication. My
patch does get the lock in WaitForJobChanges since there we read
multiple attributes of the job, for example.

iustin

Reply via email to