[Python-Dev] Re: Should set objects maintain insertion order too?

2019-12-29 Thread Paddy McCarthy
Hi Larry, sets (and mappings, ie dicts), have a common functionality among
many languages and libraries that does Not include an ordering. Already, in
CPython, there is a need to somehow indicate that insertion ordering is
being used in dicts or use OrderedDict. I am quite happy with keeping sets
as they are and coding any extra functionality required on top of this.
This aids me as I work in many languages that have their own sets and
mappings, usually without any insertion ordering.

Cheers, Paddy.


On Sun, Dec 29, 2019, 12:07 AM Larry Hastings  wrote:

>
> On 12/27/19 7:44 PM, Tim Peters wrote:
>
> [Nick Coghlan  ]
>
> I took Larry's request a slightly different way: he has a use case where
> he wants order preservation (so built in sets aren't good), but combined
> with low cost duplicate identification and elimination and removal of
> arbitrary elements (so lists and collections.deque aren't good). Organising
> a work queue that way seems common enough ...
>
>
> Is it?  I confess I haven't thought of a plausible use case.  Larry
> didn't really explain his problem, just suggested that an ordered set
> would be "a solution" to it.
>
>
> Here is the original description of my problem, from the original email in
> this thread.  I considered this an adequate explanation of my problem at
> the time.
>
>
> On 12/15/19 6:48 PM, Larry Hastings wrote:
>
> I do have a use case for this. In one project I maintain a "ready" list of
> jobs; I need to iterate over it, but I also want fast lookup because I
> soemtimes remove jobs when they're subsequently marked "not ready".  The
> existing set object would work fine here, except that it doesn't maintain
> insertion order.  That means multiple runs of the program with the same
> inputs may result in running jobs in different orders, and this instability
> makes debugging more difficult.  I've therefore switched from a set to a
> dict with all keys mapped to None, which provides all the set-like
> functionality I need.
>
>
> In subsequent emails I've clarified that my workloads are small
> enough--and computers are fast enough--that almost any data structure would
> work fine for me here, e.g. a list.  I don't think my needs should drive
> the decision making process regardless.  I only described my problem to get
> the conversation rolling.
>
> I opine that anybody who iterates over set objects and has bugs in their
> code would appreciate set objects maintaining insertion order.  I suspect
> it would make their debugging easier, because given identical inputs their
> set iteration would behave identically, thus making their bugs that much
> more stable.  That's as much use case as I have for the feature.
>
>
> */arry*
> ___
> Python-Dev mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at
> https://mail.python.org/archives/list/[email protected]/message/XWTP7OD4UMGL7DLVABM3S2CB26LN5RBP/
> Code of Conduct: http://python.org/psf/codeofconduct/
>
___
Python-Dev mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/AETTJXQYXYKFZOWT2CMGC6N6DAXWUCGM/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Should set objects maintain insertion order too?

2019-12-29 Thread Larry Hastings


On 12/28/19 6:24 PM, Tim Peters wrote:

[Larry]

Here is the original description of my problem, from the original email in
this thread.  I considered this an adequate explanation of my problem
at the time.

I do have a use case for this. In one project I maintain a "ready" list of
jobs; I need to iterate over it, but I also want fast lookup because I
soemtimes remove jobs when they're subsequently marked "not ready".

Which is a description not of "the problem", but of the operations you
want to perform.  It's the first time in my life I've heard of a
"ready list of jobs" where the programmer sometimes decides later "oh,
no - this one and that one aren't ready after all" ;-)



It's a lightweight abstract dependency graph.  Its nodes are opaque, 
only required to be hashable.  And it doesn't require that you give it 
all the nodes in strict dependency order.


When you add a node, you can also optionally specify dependencies, and 
those dependencies aren't required to be nodes that the graph has 
previously seen.  So you can add a node A that depends on B and C, 
without showing it B or C first.  When that happens it creates 
placeholder nodes for B and C, and naturally these nodes have no 
dependencies.  You can then later inform the graph that node B has 
dependencies on X Y and Z.


(My specific use case is a package manager.  Packages are identified 
with unique strings.  So the nodes I give the give the graph are simply 
those package names.  It's pretty common for me to get a package with 
dependencies on packages I haven't seen yet.  Designing the graph to 
create placeholders let me skip making two passes over the package list, 
pre-creating the package objects, etc etc.  This made the code simpler 
and presumably faster.)


The graph API maintains an externally-visible "ready" iterable of 
nodes.  And since B can go from ready to not-ready, it can get added to 
"ready" and subsequently removed.


Also, when a node is satisfied, you simply remove it from the graph.  If 
A depends on B and C, and they all represent jobs, and B and C have no 
dependencies, they'll be in the "ready" iterable. You iterate over 
"ready", and execute those jobs, and assuming they are successful you 
then remove them from the graph.  When A's dependencies are all 
satisfied, it'll get added to the "ready" iterable.  And naturally when 
B and C were removed from the graph, they were removed from the "ready" 
iterable too.


Thus it's this "ready" collection that I want to a) be iterable, b) be 
stable, and c) support fast removal.  I add and remove nodes from that 
set a lot, which I realized would perturb the iteration order from run 
to run, etc etc etc, and here we are.



I grant you maybe it's a weird approach.  But after two false starts, 
where I was starting to boil the oceans with ABCs and "node is 
satisfied" APis and separate iterator objects, I had a re-think and hit 
on this super lightweight design.  I'm actually quite pleased with 
it--it's short, it has a minimal API, it's readable, it was easy to get 
right, and it solves my problem.



Happy new year,


//arry/

___
Python-Dev mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/ZV2MXPJ4X7KE7ISJ3N664EVYMYPRMJHX/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Should set objects maintain insertion order too?

2019-12-29 Thread Tim Peters
[Larry]
> It's a lightweight abstract dependency graph.  Its nodes are opaque,
> only required to be hashable.  And it doesn't require that you give it
> all the nodes in strict dependency order.
>
> When you add a node, you can also optionally specify
> dependencies, and those dependencies aren't required
> to be nodes that the graph has previously seen.  So you can add
> a node A that depends on B and C, without showing it B or C
> first.  When that happens it creates placeholder nodes for B
> and C, and naturally these nodes have no dependencies.  You
> can then later inform the graph that node B has dependencies
> on X Y and Z.
>
> (My specific use case is a package manager.  Packages are identified
> with unique strings.  So the nodes I give the give the graph are simply
> those package names.  It's pretty common for me to get a package
> with dependencies on packages I haven't seen yet.  Designing the
> graph to create placeholders let me skip making two passes over
> the package list, pre-creating the package objects, etc etc.  This
> made the code simpler and presumably faster.)
>
> The graph API maintains an externally-visible "ready" iterable of
> nodes.  And since B can go from ready to not-ready, it can get added
> to "ready" and subsequently removed.
>
> Also, when a node is satisfied, you simply remove it from the graph.
>  If A depends on B and C, and they all represent jobs, and B and C
> have no dependencies, they'll be in the "ready" iterable.  You iterate
> over "ready", and execute those jobs, and assuming they are
> successful you then remove them from the graph.  When A's
> dependencies are all satisfied, it'll get added to the "ready" iterable.
>  And naturally when B and C were removed from the graph, they were
> removed from the "ready" iterable too.

OK!  You're doing a topological sort.  There are natural & simple ways
to do that right now that scale efficiently to very large graphs (time
& space linear in the sum of the number of nodes and the number of
dependencies).  Curiously, the difficulties are mostly driven by the
quality of _error_ messages you want (in case of, e.g., cycles in the
dependency graph).

Lots of those implementations became deterministic "by magic" when
ordered dicts were introduced.  This is why:  a bare bones
implementation needs to associate two pieces of info with each node:
a list of its immediate successors, and an integer count of the number
of its immediate predecessors.  A dict is the natural way to implement
that association.

When all the dependency info has been entered, then:

- First all nodes are emitted whose predecessor count is 0.  Iterating
over the association dict once is the natural way to find them, and
that order is defined now.

- Then, as each emitted node N is marked done:
  for child in N.successors:
  assert child.predcount > 0
  child.predcount -= 1
  if child.predcount == 0:
  emit(child)

That's about all there is to it.  But there's no end to the cruft that
_can_ be added to, e.g., verify that a node is marked done no more
than once, or to compute & display strongly connected components if
not all nodes are emitted, etc.

Ordered sets could improve this "natural" implementation if the
successor lists were successor sets instead, simply because then -
like lists - their iteration order would be defined, which can in turn
affect the order in which nodes are emitted in the loop just above.
But lists are adequate to the task, are cheaper in both time and
space, and can't _not_ be ordered ;-)


> Thus it's this "ready" collection that I want to a) be iterable, b) be 
> stable, and
> c) support fast removal.  I add and remove nodes from that set a lot, which I
> realized would perturb the iteration order from run to run, etc etc etc, and 
> here
> we are.

The sketch above (which is a bog-common way to implement topsorts)
doesn't build a data structure recording the final order, and, in
fact, never deletes anything.  You might protest that the initial
iteration step (scanning the association dict to find nodes with no
predecessors) is an expense you skip because nodes with predecessors
are deleted from your "ready" structure all along.  But nodes with
predecessors are _touched_ either way, and merely skipping over them
is bound to be cheaper than deleting them.

After that initial step, no search of any kind is needed again.

> I grant you maybe it's a weird approach.  But after two false starts, where I
> was starting to boil the oceans with ABCs and "node is satisfied" APis and
> separate iterator objects, I had a re-think and hit on this super lightweight
> design.  I'm actually quite pleased with it--it's short, it has a minimal API,
> it's readable, it was easy to get right, and it solves my problem.

Whereas I would have started with a topsort and finished it while I
was sleeping ;-)  Seriously, I've written such a thing many times, but
never reuse the code because it's so easy to redo

[Python-Dev] Re: Should set objects maintain insertion order too?

2019-12-29 Thread Wes Turner
Due to version and platform constraints, a SAT solver is necessary for
conda and (now) pip. Libsolv is one of the fastest around.

 https://github.com/QuantStack/mamba is conda implemented with libsolv (and
parallel downloading of *declarative* dependency metadata).

For general purpose graphs with implicit node instantation from edge
declaration, NetworkX has implementations of many graph algorithms.
https://networkx.github.io/documentation/stable/reference/algorithms/generated/networkx.algorithms.dag.topological_sort.html


CuGraph (a product of the RAPIDS.ai project) does graphs with CUDA (from
Python) with a "NetworkX-like API" but doesn't yet have a topological sort
(though it does have BFS)
https://github.com/rapidsai/cugraph

"pip needs a dependency resolver" + End (Fn+Right) links to the latest work
on the new pip code (that must require declarative dependency metadata)
https://github.com/pypa/pip/issues/988

That said, this implementation of topo sort could have a deterministic
output given an OrderedSet:
https://rosettacode.org/wiki/Topological_sort#Python

A table including Big-O and memory usage for the necessary data structure
and methods would be useful.

On Sun, Dec 29, 2019, 5:12 PM Tim Peters  wrote:

> [Larry]
> > It's a lightweight abstract dependency graph.  Its nodes are opaque,
> > only required to be hashable.  And it doesn't require that you give it
> > all the nodes in strict dependency order.
> >
> > When you add a node, you can also optionally specify
> > dependencies, and those dependencies aren't required
> > to be nodes that the graph has previously seen.  So you can add
> > a node A that depends on B and C, without showing it B or C
> > first.  When that happens it creates placeholder nodes for B
> > and C, and naturally these nodes have no dependencies.  You
> > can then later inform the graph that node B has dependencies
> > on X Y and Z.
> >
> > (My specific use case is a package manager.  Packages are identified
> > with unique strings.  So the nodes I give the give the graph are simply
> > those package names.  It's pretty common for me to get a package
> > with dependencies on packages I haven't seen yet.  Designing the
> > graph to create placeholders let me skip making two passes over
> > the package list, pre-creating the package objects, etc etc.  This
> > made the code simpler and presumably faster.)
> >
> > The graph API maintains an externally-visible "ready" iterable of
> > nodes.  And since B can go from ready to not-ready, it can get added
> > to "ready" and subsequently removed.
> >
> > Also, when a node is satisfied, you simply remove it from the graph.
> >  If A depends on B and C, and they all represent jobs, and B and C
> > have no dependencies, they'll be in the "ready" iterable.  You iterate
> > over "ready", and execute those jobs, and assuming they are
> > successful you then remove them from the graph.  When A's
> > dependencies are all satisfied, it'll get added to the "ready" iterable.
> >  And naturally when B and C were removed from the graph, they were
> > removed from the "ready" iterable too.
>
> OK!  You're doing a topological sort.  There are natural & simple ways
> to do that right now that scale efficiently to very large graphs (time
> & space linear in the sum of the number of nodes and the number of
> dependencies).  Curiously, the difficulties are mostly driven by the
> quality of _error_ messages you want (in case of, e.g., cycles in the
> dependency graph).
>
> Lots of those implementations became deterministic "by magic" when
> ordered dicts were introduced.  This is why:  a bare bones
> implementation needs to associate two pieces of info with each node:
> a list of its immediate successors, and an integer count of the number
> of its immediate predecessors.  A dict is the natural way to implement
> that association.
>
> When all the dependency info has been entered, then:
>
> - First all nodes are emitted whose predecessor count is 0.  Iterating
> over the association dict once is the natural way to find them, and
> that order is defined now.
>
> - Then, as each emitted node N is marked done:
>   for child in N.successors:
>   assert child.predcount > 0
>   child.predcount -= 1
>   if child.predcount == 0:
>   emit(child)
>
> That's about all there is to it.  But there's no end to the cruft that
> _can_ be added to, e.g., verify that a node is marked done no more
> than once, or to compute & display strongly connected components if
> not all nodes are emitted, etc.
>
> Ordered sets could improve this "natural" implementation if the
> successor lists were successor sets instead, simply because then -
> like lists - their iteration order would be defined, which can in turn
> affect the order in which nodes are emitted in the loop just above.
> But lists are adequate to the task, are cheaper in both time and
> space, and can't _not_ be ordered ;-)
>
>
> > Thus it's this "ready" collection that I