[Python-ideas] Re: Set operations with Lists

Richard Higginbotham Mon, 23 Sep 2019 14:26:00 -0700
Andrew Barnert wrote:
> On Sep 22, 2019, at 15:28, Tim Peters tim.pet...@gmail.com wrote:
> > That's not by accident - the inspiration for
> > CPython's sort's basic
> > "galloping" approach was taken from this paper, which wasn't about
> > sorting at all:
> > "Adaptive Set Intersections, Unions, and Differences" (2000)
> >    Erik D. Demaine, Alejandro LÃ³pez-Ortiz, J. Ian Munro
> > That's concerned with manipulating large sets represented as sorted
> > sequences.  They noticed too that data was "lumpy" in real life, and
> > that paper was the start of a number of papers trying ways to take
> > advantage of that.
> > So if someone _really_ wants to go gonzo with this, that's a promising
> > approach.  Do note, though, that it doesn't fit naturally with
> > iterators.  "Galloping" relies on O(1) access to data at arbitrary
> > index positions; it's a fancier variant of binary search
> > I wonder if you could get any benefit by using something like a 
> > skiplist-rope
> hybrid, where you have effectively O(log n) random access rather than O(1), 
> and can only
> copy huge chunks with O(log n) large memcopies instead of 1 huge one. Would 
> those costs be
> enough to sink the benefits of galloping?
> It wouldn’t be usable in all cases, but there are many cases where your huge 
> input is
> naturally iterable in this way (a set of files to concatenate, a huge file 
> that you can
> seek in or mmap chunks of, maybe even a semi-lazy rope that you got from some 
> Rust or
> Haskell library via ctypes). If you know the size of each iterator, you can 
> build the
> first level or an approximate skiplist, and build the rest up on the fly to 
> approximate
> binary search well enough (I think) for the purpose.
> Anyway, if I really did want to do these operations on massive datasets, I’d 
> look at
> pandas+HDF5 to see what it can give me before considering implementing 
> anything from
> scratch. But it could be a fun project…
> A few further half-formed thoughts on this:
> For a single huge file treated this way, you’d probably want to look at 
> ripgrep. The
> author started off with the conventional wisdom on when to read vs. mmap vs. 
> windowed-mmap
> and quickly discovered that all of the conventional wisdom was so badly out 
> of date as to
> be useless…
> Could you get any benefit out of parallelizing the gallop seekaheads? 
> Probably not for
> 1-10 million, but we are talking about huge datasets here.
> If you have an Iterator extended with a nextn method that can iterate a large 
> chunk of
> data at a time (which a file can do if you’re treating it as a sequence of 
> bytes or
> characters, but if you’re treating it as a sequence of lines it can’t, at 
> least without a
> second pass…), you could build the same thing—but the cost is that you might 
> easily end up
> using temporary N/2 memory, which defeats most of the benefits of using 
> arbitrary
> iterables.
> C++, Swift, etc. don’t have a notion of a “bisectable Iterator”, where like a
> bidirectional Iterator it still takes O(m) time to jump forward or backward m 
> steps, but
> it only takes O(1) time to jump (approximately?) halfway between two known 
> positions. You
> could easily build that on top of anything random-access, and also on top of 
> anything
> logarithmic like a tree or skiplist, and it seems like there are useful 
> things you can do
> with it. Or maybe almost all such things are worth specifializing differently 
> for
> random-access vs. inherently-logarithmic data structures, or maybe it’s just 
> too hard to
> come up with a proper useful abstraction?
I think you've accurate described the pros and cons of the iterator approach. I 
found the binary search useful for skipping wasteful comparisons. Since I was 
dealing with file names there were definately going to be elements in both and 
small amounts in only one. I didn't mention that the files were named according 
to their unix time. If I had to do it again and make a general algorithm I 
would use a binary search to find the head (/body) and tail, even though my 
library doesn't have such now. Why find the tail? To enable parallelization.
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/KY5H7IUEHN76ARMK5SLK63GKGFG77ZAM/
Code of Conduct: http://python.org/psf/codeofconduct/
[Python-ideas] Re: Set operations with Lists

Reply via email to