Tim Peters <t...@python.org> added the comment:

> I am slightly confused about what .prepare() should do. Why
> is this step necessary?

To say "I'm done adding edges".  Any call to add() after prepare() should raise 
an exception.  Likewise, e.g., any call to get_ready() before prepare() should 
raise an exception.  In a bog-standard topsort implementation, saving for each 
node a sequence of successors and a count of predecessors, this is also the 
time to collect all the nodes that have no predecessors (the first batch 
returned by get_ready()).

Much the same could be done without prepare() by get_ready() making a special 
case out of the first time it's called.  That's more magical, though.  "I'm 
done adding edges" is utterly non-magical.

> - Why we need the .done() method here? Why not instead make get_ready()
> simply a generator so you can just write
>
>    for node in self.get_ready():

The point of done() is to enable maximum exploitation of parallelism.  As 
already sketched, if a user doesn't care about that, fine, a different method 
(like static_order()) can generate all the nodes in _some_ static topsort 
order, with no need for done().

But suppose a user does care about parallelism.  Consider graph

A -> B
A -> C
A -> D
B -> D

Output A B C D is a topsort, but useless unless the user is content to "do" one 
node at a time.

Instead get_ready() first returns [A] (or a tuple, or a generator, or a set ... 
something iterable).  A is handed out to worker processes/threads, but 
get_ready() will return an empty iterable until done(A) is called.  Indeed, if 
"doing" A fails, it's NOT the case that anything else can ever be started.

If/when "doing A" succeeds, then done(A) is called, and the next get_ready() 
returns [B, C].  Those can be done in parallel, but D can't be started until 
done(B) is called.  done(B) may or may not be called before done(C) is called - 
the topsort itself has no way to know in advance, nor _should_ it impose its 
own view of that.  Note that D does not depend on C, so there's no need to wait 
for _both_ in [B, C] to finish.  It's necessary and sufficient that B be marked 
done() for D to be ready to go.

> It seems that the .done() is very tight to use this API as a "task
> scheduler" but maybe I am doing something here in my understanding
> of the API.

done() says nothing about how the user "should" schedule work items, but 
instead allows get_ready() to return all the work items whose predecessors have 
been marked done() (but haven't already been passed out by get_ready()).  
That's the maximum set of nodes that _can_ be worked on at the time.  The 
topsort class itself has no policy about how or when they "should" be worked 
on, get_ready() is just identifying all the possibilities that exist.  Which is 
impossible to know unless the class is also told which nodes it already passed 
out have finished - the purpose of done().

is_active() eventually returns False when all the nodes passed out by 
get_ready() have been marked done(), _and_ there are no more nodes ready to 
pass out.  At that point, there's a cycle in the input relations if and only if 
there's some node get_ready() never passed out.

In my prototype implementation, that's another thing prepare() does:  checks 
for a cycle, and raises CycleError if there is one.  The user can catch & 
ignore that if they like, and continue calling get_ready() and done() until no 
more progress can be made.  I think it's more likely, though, that the user 
would stare at the cycle attached to the CycleError instance, do whatever it 
takes to break the cycle, and start over again.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue17005>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to