[issue38938] Iterable merge performance and inclusion in itertools

Raymond Hettinger Sat, 30 Nov 2019 14:10:55 -0800


Raymond Hettinger <raymond.hettin...@gmail.com> added the comment:


Several thoughts:

* heapq.merge() could have plausibly been in the itertools module instead of 
the heapq module.  However, since it already has a home, there is no reason to 
move it or to create duplication with an intermodule dependency.  We certainly 
don't have a rule that the itertools module must house everything accepts an 
iterator as an input and returns an iterator in the output.

* The iter_merge() recipe is a bit sprawling and IMHO not a useful teaching aid 
(a primary goal of the itertools recipe), so it doesn't have much documentation 
value.  Instead, consider submitting this to the "more-itertools" project which 
is referenced by the docs.  There it would likely be used directly rather than 
as a recipe.

* Over the next few days, I'll take a closer look at the binary tree approach 
versus a heap approach.  I like how it intrinsically avoids the overhead of the 
4-tuple while maintaining order, but I want to test how well it does with slow 
comparison keys (i.e. does the total number of comparisons get worse).  Am not 
sure whether the tree needs to be rebalanced as some of the input streams 
become exhausted.  I have a vague recollection that Knuth's TAOCP deemed the 
heap approach to be algorithmically superior, but need to dig out my books to 
be verify that.

* We could just build-out the current heapq.merge() code to include a special 
case for only two input iterables.  That would speed-up a common case by 
avoiding the overhead of tracking a 4-tuple for each element.  If you want to 
submit a PR for that, I would be happy to take a close look and verify the 
timings.

* It might be interesting to time the pure python variants against 
"sorted(chain(a, b, c, d, e))".  It isn't quite an apples-to-apples comparison 
because the latter pulls all the data in memory and it doesn't the runs 
pre-segregated, but it might give a sense of how well merge() could do if we 
decided to go gonzo on optimizations.  Until now, we've been satisfied with 
letting the heap structure minimize the number of comparisons and there hasn't 
been a focus on reducing the constant factor overhead.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue38938>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue38938] Iterable merge performance and inclusion in itertools

Reply via email to