On Sat, Oct 2, 2021 at 7:42 AM Steven D'Aprano <st...@pearwood.info> wrote:
> This half-baked idea is inspired by this thread here: > > > https://mail.python.org/archives/list/python-ideas@python.org/message/5LGWV3YLCNBVSL4QHQKJ7RPNTMWOALQA/ > > which in turn was inspired by this Stackoverflow post: > > > https://stackoverflow.com/questions/56966429/getting-pairs-of-one-item-and-the-rest-over-a-python-list > > Problem: today, more and more people are using Python to work with Big > Data, or at least Biggish Data. For some folks, it is not unusual to > have lists with a gigabyte or more or data. So imagine you have a list > with a hundred million items, and you need a copy of the entire list. > Easy: > > alist = [...] # 100 million items > blist = alist[:] > > And that ought to be pretty efficient too, making the slice is basically > just a copy of a bunch of pointers, plus a bit of overhead. A good > implementation should have lightning fast list slicing, close to the > speed of memcopy in C. Give or take. > No, it would also have to increment the reference count of each item (since blist owns a reference to each). That's what makes this slow. > But now suppose you want a copy of all the items except the item in > position (let's say) 1000. Here's a bad way: > > blist = alist[:] > del blist[1000] > > That's inefficient because the list has to shift almost a hundred > million items over, and I think they have to be moved one at a > time. Which is slowish, even in C. But this doesn't need to update reference counts (except of the one deleted item) and the move is done using C memmove(), which perhaps isn't as fast as memcpy() but still unbeatably fast. > Here's another way: > > blist = alist[:1000] + alist[1001:] > > That has to make a slice (copy 1000 items), a second slice (copy another > many million items), then make a third copy to concatenate them, then > garbage collect the two temporary slices. > Yeah, probably the worst way. And while you might have enough memory for double the original list > (alist and blist) you might not have enough for triple (alist, the two > slices, blist). > > There are lots of other variants we could come up with, but ultimately > we're doing lots of extra work that has to be thrown away, and that > costs time or memory or both. > That's always the case when you use Python though. You use it because it's convenient, not because it's particularly efficient. > How about if lists had an in-place method for doing a fast copy of > pointers from one list to another? We could do this: > > blist = [None]*(len(alist)-1) # Preallocate. > blist.copyfrom(alist, index=0, start=0, end=1000) > blist.copyfrom(alist, index=1000, start=1001) > > No temporary slices are needed. > > Here's a pure-Python (and therefore slow) implementation: > > def copyfrom(dest, source, index=0, start=0, end=None): > if end is None: > end = len(source) > if len(dest) - index < end - start: > raise ValueError('destination too small') > for i in range(start, end): > dest[index+i-start] = source[i] > > In pure Python, copying each item over one by one will save memory but > cost a lot of time. Doing it as a method in C should save both memory > and time. The implementation could optimize the cases where the source > is a list, and fall back on iteration for other sequence types. > > Now *personally* I don't work with a lot of Big Data, so this doesn't > really scratch any of my personal itches, but I thought I'd throw it out > there for others to tell me why it's a dumb idea :-) > At the very least it might lead to a recommendation based on which operation is implemented most efficiently. Though you should just measure it for various N. Are you actually observing that people are doing this with regular lists? Don't people working with Big Data usually use Pandas, which is built on NumPy arrays and custom data structures? -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-change-the-world/>
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-le...@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/FD7H5K5UAKBYZOOZUHTAKFIIN3XE2W56/ Code of Conduct: http://python.org/psf/codeofconduct/