Tim Peters <t...@python.org> added the comment:
> I am slightly confused about what .prepare() should do. Why > is this step necessary? To say "I'm done adding edges". Any call to add() after prepare() should raise an exception. Likewise, e.g., any call to get_ready() before prepare() should raise an exception. In a bog-standard topsort implementation, saving for each node a sequence of successors and a count of predecessors, this is also the time to collect all the nodes that have no predecessors (the first batch returned by get_ready()). Much the same could be done without prepare() by get_ready() making a special case out of the first time it's called. That's more magical, though. "I'm done adding edges" is utterly non-magical. > - Why we need the .done() method here? Why not instead make get_ready() > simply a generator so you can just write > > for node in self.get_ready(): The point of done() is to enable maximum exploitation of parallelism. As already sketched, if a user doesn't care about that, fine, a different method (like static_order()) can generate all the nodes in _some_ static topsort order, with no need for done(). But suppose a user does care about parallelism. Consider graph A -> B A -> C A -> D B -> D Output A B C D is a topsort, but useless unless the user is content to "do" one node at a time. Instead get_ready() first returns [A] (or a tuple, or a generator, or a set ... something iterable). A is handed out to worker processes/threads, but get_ready() will return an empty iterable until done(A) is called. Indeed, if "doing" A fails, it's NOT the case that anything else can ever be started. If/when "doing A" succeeds, then done(A) is called, and the next get_ready() returns [B, C]. Those can be done in parallel, but D can't be started until done(B) is called. done(B) may or may not be called before done(C) is called - the topsort itself has no way to know in advance, nor _should_ it impose its own view of that. Note that D does not depend on C, so there's no need to wait for _both_ in [B, C] to finish. It's necessary and sufficient that B be marked done() for D to be ready to go. > It seems that the .done() is very tight to use this API as a "task > scheduler" but maybe I am doing something here in my understanding > of the API. done() says nothing about how the user "should" schedule work items, but instead allows get_ready() to return all the work items whose predecessors have been marked done() (but haven't already been passed out by get_ready()). That's the maximum set of nodes that _can_ be worked on at the time. The topsort class itself has no policy about how or when they "should" be worked on, get_ready() is just identifying all the possibilities that exist. Which is impossible to know unless the class is also told which nodes it already passed out have finished - the purpose of done(). is_active() eventually returns False when all the nodes passed out by get_ready() have been marked done(), _and_ there are no more nodes ready to pass out. At that point, there's a cycle in the input relations if and only if there's some node get_ready() never passed out. In my prototype implementation, that's another thing prepare() does: checks for a cycle, and raises CycleError if there is one. The user can catch & ignore that if they like, and continue calling get_ready() and done() until no more progress can be made. I think it's more likely, though, that the user would stare at the cycle attached to the CycleError instance, do whatever it takes to break the cycle, and start over again. ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue17005> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com