[Python-ideas] Re: More efficient list copying

2021-10-05 Thread Gregory P. Smith
On Sat, Oct 2, 2021, 10:20 PM Christopher Barker 
wrote:

>
> But sure, if we can eliminate inefficiencies in Python standard data
> types, then why not?
>

Because the C implementation becomes hard to maintain.

All of our linear containers could benefit from non-linear implementations
in some scenarios. But these usually have downsides (not trivial O(1)
indexing) in others and greatly add to implementation complexity.

For linear containers, ex: Abseil's absl::cord structure applied to any of
those. https://github.com/abseil/abseil-cpp/blob/master/absl/strings/cord.h

It'd be neat to have something like that.

But having such a thing _replace_ the simple list, str, or bytes, or
bytearray implementations itself presents practical challenges.


> I’ve lost track of the exact use case here, but maybe the proposal a while
> back for sequence views would help?
>

A memoryview-like for sequences? Yeah, that is what I'd start with as a
concept. It'll wind up becoming a sequence-cord though. :P

-gps
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/F5YOP7YXUNA7Y3BFO4R2PIVN5I7IIKCF/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: More efficient list copying

2021-10-03 Thread Finn Mason
On Sat, Oct 2, 2021, 11:20 PM Christopher Barker 
wrote:

[Snip...]
>
> But sure, if we can eliminate  inefficiencies in Python standard data
> types, then why not?
>

I agree. If we can eliminate inefficiencies in core Python features, that
would be great.
I don't work with this kind of thing, so I won't make any further judgement.

>
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/UVB3VG7W6WYTLJV7ZPIHECF3BDN3YI2H/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: More efficient list copying

2021-10-02 Thread Christopher Barker
> See the Stackoverflow post I linked to at the start of my post.
>
>
> https://stackoverflow.com/questions/56966429/getting-pairs-of-one-item-and-the-rest-over-a-python-list


I’m confused— that seems to be a SO post related to another ongoing thread…


> I don't know if a gigabyte of data in a list is Big Data or not,


We’ll, Wes McKinney of Pandas fame once defined “medium data” as too big
for memory, but not too big for a typical workstation hard disk (when we
still used hard disks). So no, 1GB is not big data. But not sure what call
“big enough to stress typical workstation memory”.

But anyway:

If you really have that much data, you probably want numpy arrays or
something (even array.arrays are far more memory efficient for.basic data
types)

But sure, if we can eliminate  inefficiencies in Python standard data
types, then why not?

I’ve lost track of the exact use case here, but maybe the proposal a while
back for sequence views would help?

We often end up making a full copy of a slice simple to iterate over it, or
… if there was sway to work with a subset of a sequence without making an
actual copy, that could help a lot in these cases.

-CHB
-- 
Christopher Barker, PhD (Chris)

Python Language Consulting
  - Teaching
  - Scientific Software Development
  - Desktop GUI and Web Development
  - wxPython, numpy, scipy, Cython
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/BHUBKEBL4ZY7UGGKFYFZFJKDZOR2A37Z/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: More efficient list copying

2021-10-02 Thread Steven D'Aprano
On Sat, Oct 02, 2021 at 07:57:48AM -0700, Guido van Rossum wrote:

> No, it would also have to increment the reference count of each item (since
> blist owns a reference to each). That's what makes this slow.

Ahaha, of course, I forgot about the ref counting.

> > There are lots of other variants we could come up with, but ultimately
> > we're doing lots of extra work that has to be thrown away, and that
> > costs time or memory or both.
> 
> That's always the case when you use Python though. You use it because it's
> convenient, not because it's particularly efficient.

Sure, but we do try to be no more inefficient than we need to be :-)

> Are you actually observing that people are doing this with regular lists?

See the Stackoverflow post I linked to at the start of my post.

https://stackoverflow.com/questions/56966429/getting-pairs-of-one-item-and-the-rest-over-a-python-list

I don't know if a gigabyte of data in a list is Big Data or not, but 
it's still pretty big.


-- 
Steve
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/MC2SH34PJOFERLWYBWPRPT7VNCGY3UW7/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: More efficient list copying

2021-10-02 Thread David Mertz, Ph.D.
On Sat, Oct 2, 2021, 10:58 AM Guido van Rossum  wrote:

> Are you actually observing that people are doing this with regular lists?
> Don't people working with Big Data usually use Pandas, which is built on
> NumPy arrays and custom data structures?
>

Basically, Guido is right. Big data lives in NumPy, or Dask, or semi-big in
Panda, etc.

A partial exception to this is data that flow as JSON, and can therefore
have hierarchical structure. For example, a list of 100M dictionaries. That
doesn't really fit the NumPy or Pandas tabular model.

But when I work with that, I don't think I'd ever want a new collection of
"all but a few" copied to new data structure. Generally is expect to
process it sequentially, with maybe some condition to `continue` on certain
items. This might take the form of an exclusion-index set (process
everything but items {7, 1000, }).

I suppose that if Python lists were linked lists, I might instead delete
items from the middle. But they're not, and the savings Steven suggests
isn't order-of-magnitude or big-O, so I still wouldn't want to remove a few
items from big lists if it were 2x cheaper (in time, memory, etc).

>
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/2DBCEXNQEYDILRBB7BRE4STAHBWTSTTC/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: More efficient list copying

2021-10-02 Thread Guido van Rossum
On Sat, Oct 2, 2021 at 7:42 AM Steven D'Aprano  wrote:

> This half-baked idea is inspired by this thread here:
>
>
> https://mail.python.org/archives/list/python-ideas@python.org/message/5LGWV3YLCNBVSL4QHQKJ7RPNTMWOALQA/
>
> which in turn was inspired by this Stackoverflow post:
>
>
> https://stackoverflow.com/questions/56966429/getting-pairs-of-one-item-and-the-rest-over-a-python-list
>
> Problem: today, more and more people are using Python to work with Big
> Data, or at least Biggish Data. For some folks, it is not unusual to
> have lists with a gigabyte or more or data. So imagine you have a list
> with a hundred million items, and you need a copy of the entire list.
> Easy:
>
> alist = [...]  # 100 million items
> blist = alist[:]
>
> And that ought to be pretty efficient too, making the slice is basically
> just a copy of a bunch of pointers, plus a bit of overhead. A good
> implementation should have lightning fast list slicing, close to the
> speed of memcopy in C. Give or take.
>

No, it would also have to increment the reference count of each item (since
blist owns a reference to each). That's what makes this slow.


> But now suppose you want a copy of all the items except the item in
> position (let's say) 1000. Here's a bad way:
>
> blist = alist[:]
> del blist[1000]
>
> That's inefficient because the list has to shift almost a hundred
> million items over, and I think they have to be moved one at a
> time. Which is slowish, even in C.


But this doesn't need to update reference counts (except of the one deleted
item) and the move is done using C memmove(), which perhaps isn't as fast
as memcpy() but still unbeatably fast.


> Here's another way:
>
> blist = alist[:1000] + alist[1001:]
>
> That has to make a slice (copy 1000 items), a second slice (copy another
> many million items), then make a third copy to concatenate them, then
> garbage collect the two temporary slices.
>

Yeah, probably the worst way.

And while you might have enough memory for double the original list
> (alist and blist) you might not have enough for triple (alist, the two
> slices, blist).
>
> There are lots of other variants we could come up with, but ultimately
> we're doing lots of extra work that has to be thrown away, and that
> costs time or memory or both.
>

That's always the case when you use Python though. You use it because it's
convenient, not because it's particularly efficient.


> How about if lists had an in-place method for doing a fast copy of
> pointers from one list to another? We could do this:
>
> blist = [None]*(len(alist)-1)  # Preallocate.
> blist.copyfrom(alist, index=0, start=0, end=1000)
> blist.copyfrom(alist, index=1000, start=1001)
>
> No temporary slices are needed.
>
> Here's a pure-Python (and therefore slow) implementation:
>
> def copyfrom(dest, source, index=0, start=0, end=None):
> if end is None:
> end = len(source)
> if len(dest) - index < end - start:
> raise ValueError('destination too small')
> for i in range(start, end):
> dest[index+i-start] = source[i]
>
> In pure Python, copying each item over one by one will save memory but
> cost a lot of time. Doing it as a method in C should save both memory
> and time. The implementation could optimize the cases where the source
> is a list, and fall back on iteration for other sequence types.
>
> Now *personally* I don't work with a lot of Big Data, so this doesn't
> really scratch any of my personal itches, but I thought I'd throw it out
> there for others to tell me why it's a dumb idea :-)
>

At the very least it might lead to a recommendation based on which
operation is implemented most efficiently. Though you should just measure
it for various N.

Are you actually observing that people are doing this with regular lists?
Don't people working with Big Data usually use Pandas, which is built on
NumPy arrays and custom data structures?

-- 
--Guido van Rossum (python.org/~guido)
*Pronouns: he/him **(why is my pronoun here?)*

___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/FD7H5K5UAKBYZOOZUHTAKFIIN3XE2W56/
Code of Conduct: http://python.org/psf/codeofconduct/