[Python-Dev] Re: Should set objects maintain insertion order too?

pylang Mon, 30 Dec 2019 15:15:16 -0800

A set was requested with preserved insertion order.  Mathematically, such a
structure relates more to an Oset (having order) than a Set.  See
relationships in the image below:


[image: libo.png]

Each of the mentioned structure types has a dedicated role.  Python's sets
work well and correlate with existing math principles.  I would be more
convinced to introduce a new collection to Python rather than alter
existing sets.

-1: preserving insertion-order in sets
 0: adding `OrderedSet` (where `dict.fromkeys()` does not suffice, i.e.
needing set operations)
+1: implementing an analogous abstract collection, e.g. "`HashList`" ->
Oset, just as `Counter` -> Mset

On Sun, Dec 29, 2019 at 9:00 PM Wes Turner <wes.tur...@gmail.com> wrote:

> Due to version and platform constraints, a SAT solver is necessary for
> conda and (now) pip. Libsolv is one of the fastest around.
>
>  https://github.com/QuantStack/mamba is conda implemented with libsolv
> (and parallel downloading of *declarative* dependency metadata).
>
> For general purpose graphs with implicit node instantation from edge
> declaration, NetworkX has implementations of many graph algorithms.
>
> https://networkx.github.io/documentation/stable/reference/algorithms/generated/networkx.algorithms.dag.topological_sort.html
>
>
> CuGraph (a product of the RAPIDS.ai project) does graphs with CUDA (from
> Python) with a "NetworkX-like API" but doesn't yet have a topological sort
> (though it does have BFS)
> https://github.com/rapidsai/cugraph
>
> "pip needs a dependency resolver" + End (Fn+Right) links to the latest
> work on the new pip code (that must require declarative dependency metadata)
> https://github.com/pypa/pip/issues/988
>
> That said, this implementation of topo sort could have a deterministic
> output given an OrderedSet:
> https://rosettacode.org/wiki/Topological_sort#Python
>
> A table including Big-O and memory usage for the necessary data structure
> and methods would be useful.
>
> On Sun, Dec 29, 2019, 5:12 PM Tim Peters <tim.pet...@gmail.com> wrote:
>
>> [Larry]
>> > It's a lightweight abstract dependency graph.  Its nodes are opaque,
>> > only required to be hashable.  And it doesn't require that you give it
>> > all the nodes in strict dependency order.
>> >
>> > When you add a node, you can also optionally specify
>> > dependencies, and those dependencies aren't required
>> > to be nodes that the graph has previously seen.  So you can add
>> > a node A that depends on B and C, without showing it B or C
>> > first.  When that happens it creates placeholder nodes for B
>> > and C, and naturally these nodes have no dependencies.  You
>> > can then later inform the graph that node B has dependencies
>> > on X Y and Z.
>> >
>> > (My specific use case is a package manager.  Packages are identified
>> > with unique strings.  So the nodes I give the give the graph are simply
>> > those package names.  It's pretty common for me to get a package
>> > with dependencies on packages I haven't seen yet.  Designing the
>> > graph to create placeholders let me skip making two passes over
>> > the package list, pre-creating the package objects, etc etc.  This
>> > made the code simpler and presumably faster.)
>> >
>> > The graph API maintains an externally-visible "ready" iterable of
>> > nodes.  And since B can go from ready to not-ready, it can get added
>> > to "ready" and subsequently removed.
>> >
>> > Also, when a node is satisfied, you simply remove it from the graph.
>> >  If A depends on B and C, and they all represent jobs, and B and C
>> > have no dependencies, they'll be in the "ready" iterable.  You iterate
>> > over "ready", and execute those jobs, and assuming they are
>> > successful you then remove them from the graph.  When A's
>> > dependencies are all satisfied, it'll get added to the "ready" iterable.
>> >  And naturally when B and C were removed from the graph, they were
>> > removed from the "ready" iterable too.
>>
>> OK!  You're doing a topological sort.  There are natural & simple ways
>> to do that right now that scale efficiently to very large graphs (time
>> & space linear in the sum of the number of nodes and the number of
>> dependencies).  Curiously, the difficulties are mostly driven by the
>> quality of _error_ messages you want (in case of, e.g., cycles in the
>> dependency graph).
>>
>> Lots of those implementations became deterministic "by magic" when
>> ordered dicts were introduced.  This is why:  a bare bones
>> implementation needs to associate two pieces of info with each node:
>> a list of its immediate successors, and an integer count of the number
>> of its immediate predecessors.  A dict is the natural way to implement
>> that association.
>>
>> When all the dependency info has been entered, then:
>>
>> - First all nodes are emitted whose predecessor count is 0.  Iterating
>> over the association dict once is the natural way to find them, and
>> that order is defined now.
>>
>> - Then, as each emitted node N is marked done:
>>       for child in N.successors:
>>           assert child.predcount > 0
>>           child.predcount -= 1
>>           if child.predcount == 0:
>>               emit(child)
>>
>> That's about all there is to it.  But there's no end to the cruft that
>> _can_ be added to, e.g., verify that a node is marked done no more
>> than once, or to compute & display strongly connected components if
>> not all nodes are emitted, etc.
>>
>> Ordered sets could improve this "natural" implementation if the
>> successor lists were successor sets instead, simply because then -
>> like lists - their iteration order would be defined, which can in turn
>> affect the order in which nodes are emitted in the loop just above.
>> But lists are adequate to the task, are cheaper in both time and
>> space, and can't _not_ be ordered ;-)
>>
>>
>> > Thus it's this "ready" collection that I want to a) be iterable, b) be
>> stable, and
>> > c) support fast removal.  I add and remove nodes from that set a lot,
>> which I
>> > realized would perturb the iteration order from run to run, etc etc
>> etc, and here
>> > we are.
>>
>> The sketch above (which is a bog-common way to implement topsorts)
>> doesn't build a data structure recording the final order, and, in
>> fact, never deletes anything.  You might protest that the initial
>> iteration step (scanning the association dict to find nodes with no
>> predecessors) is an expense you skip because nodes with predecessors
>> are deleted from your "ready" structure all along.  But nodes with
>> predecessors are _touched_ either way, and merely skipping over them
>> is bound to be cheaper than deleting them.
>>
>> After that initial step, no search of any kind is needed again.
>>
>> > I grant you maybe it's a weird approach.  But after two false starts,
>> where I
>> > was starting to boil the oceans with ABCs and "node is satisfied" APis
>> and
>> > separate iterator objects, I had a re-think and hit on this super
>> lightweight
>> > design.  I'm actually quite pleased with it--it's short, it has a
>> minimal API,
>> > it's readable, it was easy to get right, and it solves my problem.
>>
>> Whereas I would have started with a topsort and finished it while I
>> was sleeping ;-)  Seriously, I've written such a thing many times, but
>> never reuse the code because it's so easy to redo from scratch.  Every
>> application has _some_ unique twist, which makes a general-purpose API
>> so bloated it's harder to figure out how to use than to write custom
>> code.
>>
>> In your application, I'm guessing (but don't know) that when a package
>> name is emitted, it's possible you're not able to find the package,
>> and so the process has to stop.  That's why I inserted a "as each
>> emitted node N is marked done" step to my description.  That is, I'm
>> picturing an API that adds a (say)
>>
>>     topsorter.record_succeeded(node)
>>
>> method to say that - yes - the node was processed successfully.  Else,
>> by default, its successors (if any) must not be emitted.
>>
>> But if that isn't needed, the whole topsort order can be materialized
>> into (say) a list in one gulp, or delivered by a generator.
>>
>>
>> > Happy new year,
>>
>> And to you too! :-)
>> _______________________________________________
>> Python-Dev mailing list -- python-dev@python.org
>> To unsubscribe send an email to python-dev-le...@python.org
>> https://mail.python.org/mailman3/lists/python-dev.python.org/
>> Message archived at
>> https://mail.python.org/archives/list/python-dev@python.org/message/HRKQPTPQQNIU3DYMYUYRC7USVRJGMRFC/
>> Code of Conduct: http://python.org/psf/codeofconduct/
>>
> _______________________________________________
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-le...@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-dev@python.org/message/6UKXR63M6NYI3EQRXL7AOH3AWMTSNXUW/
> Code of Conduct: http://python.org/psf/codeofconduct/
>

_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/5CSNT37JKYAY52CWRI4PRRCNC5UL7DSK/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Should set objects maintain insertion order too?

Reply via email to