Re: yield from () Was: Re: weirdness with list()

2021-03-12 Thread Thomas Jollans

On 03/03/2021 01:01, Cameron Simpson wrote:

On 02Mar2021 15:06, Larry Martell  wrote:

I discovered something new (to me) yesterday. Was writing a unit test
for generator function and I found that none of the function got
executed at all until I iterated on the return value.

Aye. Generators are lazy - they don't run at all until you ask for a
value.

By contrast, this is unlike Go's goroutines, which are busy - they
commence operation as soon as invoked and run until the first yield
(channel put, I forget how it is spelled now). This can cause excessive
CPU utilisation, but it handle for _fast_ production of results. Which
is a primary goal in Go's design.

Cheers,
Cameron Simpson 




I've been learning a bit more JavaScript recently (I know, I know, 
that's no fun) and I think that's the main practical difference between 
JavaScript's async functions, which are scheduled even if nobody awaits 
on them, and Python async functions which are just funky generators and 
therefore scheduled only when somebody awaits their result.




--
https://mail.python.org/mailman/listinfo/python-list


Friday Finking: following, weirdness with list()

2021-03-11 Thread dn via Python-list
The in-person version of 'Friday Finking' has been set-aside by
COVID-precautions. Here's hoping the questions asked below will
stimulate some thinking, or mild entertainment...


On 02/03/2021 03.10, Grant Edwards wrote:
> On 2021-03-01, Greg Ewing  wrote:
>> On 28/02/21 1:17 pm, Cameron Simpson wrote:
>>> [its length in bytes] is presented via the object's __len__ method,
>>
>>> BUT... It also has a __iter__ value, which like any Box iterates over
>>> the subboxes.
>>
>> You're misusing __len__ here. If an object is iterable and
>> also has a __len__, its __len__ should return the number of
>> items you would get if you iterated over it. Anything else
>> is confusing and can lead to trouble, as you found here.
> 
> That was certainly my reaction. Can you imagine the confusion if len()
> of a list returned the number of bytes required for srorage insttead
> of the number of elements?


Why?

Isn't one of the 'fun' things about modern* languages is the
"over-loading" of operators/operations?

* ie newer than FORTRAN-IV or COBOL (or my grey hair)


Thus we have:

2 + 3# int( 5 )

and

"2" + "3"# "23"

...and we are quite comfortable with the dissonant 'sameness' and
'difference'.

If we can "over-load" __add__(), why not __len__()?


That said, it is confusing: what does len() mean? Are we talking about
the number of elements in a collection, or something else?

What do the docs say?

https://docs.python.org/3/library/functions.html#len talks of "the
length (the number of items) of an object". In the OP, what are the
"items" in this object/"subbox"?

https://docs.python.org/3/reference/datamodel.html covers
object.__len__(self) saying "Called to implement the built-in function
len(). Should return the length of the object, an integer >= 0." without
actually determining what "length of the object" may actually mean in
any or every context.


Here's another example/application:

If we were playing with our own custom-class to work with vectors,
should __len__() be coded to report (through len()) the number of
dimensions considered in the vector:

v = Vector( 1, 2, 3, 4 )
len( v )# 4

...or should "len" stand for the "magnitude" of the vector, ie a
distance of 5.5 (rounded)?


Horses for courses?

In the case of (Unicode) strings len() reports in characters, yet lists
are sized in numbers of elements, etc. Each according to what we might
call the 'unit' which should be counted.

The implicit 'confusion' (and flexibility) of over-loading precedes (and
to a degree, causes) "imagine the confusion if len() of a list returned
the number of bytes required".

That said, shouldn't we agreeing with the statement? Should one (sort
of) class/file-structure demand that all other custom-, library-, and
'built-in'-classes report in bytes?

(but is that being proposed/demanded?)


The lengths of files are reported by the computer's
ls-command/file-manager in [M/K]-bytes!

This subject matter is a binary file/container format (MP4). Am working
on a similar container format at the moment, where the length of
sub-components may be reported in bytes (if not delineated by 'markers').

So, there are many reasons why "bytes" is a 'good' measure of length -
in this context.

Is it "misusing __len__" in a class/object designed to manipulate such
files? Hope not!
(or I'm 'in trouble' - again...)
-- 
Regards,
=dn
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: yield from () Was: Re: weirdness with list()

2021-03-11 Thread Chris Angelico
On Fri, Mar 12, 2021 at 8:20 AM Serhiy Storchaka  wrote:
>
> 01.03.21 23:59, Cameron Simpson пише:
> > On 28Feb2021 23:47, Alan Gauld  wrote:
> >> On 28/02/2021 00:17, Cameron Simpson wrote:
> >>> BUT... It also has a __iter__ value, which like any Box iterates over
> >>> the subboxes. For MDAT that is implemented like this:
> >>>
> >>> def __iter__(self):
> >>> yield from ()
> >>
> >> Sorry, a bit OT but I'm curious. I haven't seen
> >> this before:
> >>
> >> yield from ()
> >>
> >> What is it doing?
> >> What do the () represent in this context?
> >
> > It's an empty tuple. The yield from iterates over the tuple, yielding
> > zero times. There are shorter ways to write that (eg outright omitting
> > the yield), except when you're writing a generator function with only a
> > single yield statement - then you need something like that to make it a
> > generator.
>
> I was wondering what from following variants is more efficient:
>
> def gen1():
> yield from ()
>
> def gen2():
> return
> yield
>
> def gen3():
> return iter(())
>
>
> $ python3.9 -m timeit -s 'def g(): yield from ()' 'list(g())'
> 100 loops, best of 5: 266 nsec per loop
> $ python3.9 -m timeit -s 'def g():' -s ' return' -s ' yield' 'list(g())'
> 100 loops, best of 5: 219 nsec per loop
> $ python3.9 -m timeit -s 'def g(): return iter(())' 'list(g())'
> 200 loops, best of 5: 192 nsec per loop
>

They're not identical. The first two are, I believe, equivalent (and
you could add "if False: yield" as another comparison if you care),
but the third one isn't a generator. So if all you need is an
iterator, sure, but gen3 actually isn't doing as much as the other two
are.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: yield from () Was: Re: weirdness with list()

2021-03-11 Thread Serhiy Storchaka
01.03.21 23:59, Cameron Simpson пише:
> On 28Feb2021 23:47, Alan Gauld  wrote:
>> On 28/02/2021 00:17, Cameron Simpson wrote:
>>> BUT... It also has a __iter__ value, which like any Box iterates over
>>> the subboxes. For MDAT that is implemented like this:
>>>
>>> def __iter__(self):
>>> yield from ()
>>
>> Sorry, a bit OT but I'm curious. I haven't seen
>> this before:
>>
>> yield from ()
>>
>> What is it doing?
>> What do the () represent in this context?
> 
> It's an empty tuple. The yield from iterates over the tuple, yielding 
> zero times. There are shorter ways to write that (eg outright omitting 
> the yield), except when you're writing a generator function with only a 
> single yield statement - then you need something like that to make it a 
> generator.

I was wondering what from following variants is more efficient:

def gen1():
yield from ()

def gen2():
return
yield

def gen3():
return iter(())


$ python3.9 -m timeit -s 'def g(): yield from ()' 'list(g())'
100 loops, best of 5: 266 nsec per loop
$ python3.9 -m timeit -s 'def g():' -s ' return' -s ' yield' 'list(g())'
100 loops, best of 5: 219 nsec per loop
$ python3.9 -m timeit -s 'def g(): return iter(())' 'list(g())'
200 loops, best of 5: 192 nsec per loop

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: yield from () Was: Re: weirdness with list()

2021-03-02 Thread Cameron Simpson
On 02Mar2021 15:06, Larry Martell  wrote:
>I discovered something new (to me) yesterday. Was writing a unit test
>for generator function and I found that none of the function got
>executed at all until I iterated on the return value.

Aye. Generators are lazy - they don't run at all until you ask for a 
value.

By contrast, this is unlike Go's goroutines, which are busy - they 
commence operation as soon as invoked and run until the first yield 
(channel put, I forget how it is spelled now). This can cause excessive 
CPU utilisation, but it handle for _fast_ production of results. Which 
is a primary goal in Go's design.

Cheers,
Cameron Simpson 
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: yield from () Was: Re: weirdness with list()

2021-03-02 Thread Larry Martell
On Tue, Mar 2, 2021 at 2:16 PM Chris Angelico  wrote:
>
> On Tue, Mar 2, 2021 at 5:51 AM Alan Gauld via Python-list
>  wrote:
> >
> > On 28/02/2021 00:17, Cameron Simpson wrote:
> >
> > > BUT... It also has a __iter__ value, which like any Box iterates over
> > > the subboxes. For MDAT that is implemented like this:
> > >
> > > def __iter__(self):
> > > yield from ()
> >
> > Sorry, a bit OT but I'm curious. I haven't seen
> > this before:
> >
> > yield from ()
> >
> > What is it doing?
> > What do the () represent in this context?
> >
>
> It's yielding all the elements in an empty tuple. Which is none of
> them, meaning that - for this simple example - iterating over the
> object will produce zero results.

I discovered something new (to me) yesterday. Was writing a unit test
for generator function and I found that none of the function got
executed at all until I iterated on the return value. It was blowing
my mind as I was debugging the test and had a BP set in the first line
of the function but it was not hit when I called the function.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: yield from () Was: Re: weirdness with list()

2021-03-02 Thread Marco Sulla
On Mon, 1 Mar 2021 at 19:51, Alan Gauld via Python-list
 wrote:
> Sorry, a bit OT but I'm curious. I haven't seen
> this before:
>
> yield from ()
>
> What is it doing?
> What do the () represent in this context?

It's the empty tuple.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: yield from () Was: Re: weirdness with list()

2021-03-02 Thread Chris Angelico
On Tue, Mar 2, 2021 at 5:51 AM Alan Gauld via Python-list
 wrote:
>
> On 28/02/2021 00:17, Cameron Simpson wrote:
>
> > BUT... It also has a __iter__ value, which like any Box iterates over
> > the subboxes. For MDAT that is implemented like this:
> >
> > def __iter__(self):
> > yield from ()
>
> Sorry, a bit OT but I'm curious. I haven't seen
> this before:
>
> yield from ()
>
> What is it doing?
> What do the () represent in this context?
>

It's yielding all the elements in an empty tuple. Which is none of
them, meaning that - for this simple example - iterating over the
object will produce zero results.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: yield from () Was: Re: weirdness with list()

2021-03-02 Thread Alan Gauld via Python-list
On 28/02/2021 23:47, Alan Gauld via Python-list wrote:
> On 28/02/2021 00:17, Cameron Simpson wrote:
> 
>> BUT... It also has a __iter__ value, which like any Box iterates over 
>> the subboxes. For MDAT that is implemented like this:
>>
>> def __iter__(self):
>> yield from ()
> 
> Sorry, a bit OT but I'm curious. I haven't seen
> this before:
> 
> yield from ()
> 
> What is it doing?
> What do the () represent in this context?
> 

Thanks for the replies.
I should have known better but I was thinking some
cleverness with callables and completely forgot
the empty tuple syntax. Oops!

-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: yield from () Was: Re: weirdness with list()

2021-03-02 Thread Chris Angelico
On Wed, Mar 3, 2021 at 8:21 AM Dieter Maurer  wrote:
>
> Alan Gauld wrote at 2021-2-28 23:47 +:
> >yield from ()
>
> "yield from iterator" is similar to "for i in iterator: yield i" (with
> special handling when data/exceptions are injected into the generator).
>
> Thus, "yield from ()" does essentially nothing with the side effect
> that the containing function is treated as generator function.
>

Another way to write the same thing is:

if False: yield None

This, too, will do nothing - in fact, it will be optimized out
completely in current versions of CPython - but, again, will force the
function to be a generator.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


yield from () Was: Re: weirdness with list()

2021-03-02 Thread Dieter Maurer
Alan Gauld wrote at 2021-2-28 23:47 +:
>yield from ()

"yield from iterator" is similar to "for i in iterator: yield i" (with
special handling when data/exceptions are injected into the generator).

Thus, "yield from ()" does essentially nothing with the side effect
that the containing function is treated as generator function.



--
Dieter
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: yield from () Was: Re: weirdness with list()

2021-03-02 Thread Peter Otten

On 01/03/2021 00:47, Alan Gauld via Python-list wrote:

On 28/02/2021 00:17, Cameron Simpson wrote:


BUT... It also has a __iter__ value, which like any Box iterates over
the subboxes. For MDAT that is implemented like this:

 def __iter__(self):
 yield from ()


Sorry, a bit OT but I'm curious. I haven't seen
this before:

yield from ()

What is it doing?
What do the () represent in this context?


The 0-tuple ;)

yield from x

is syntactic sugar for

for item in x:
yield item

Instead of () you can use any empty iterable.
If x is empty nothing is ever yielded.

--
https://mail.python.org/mailman/listinfo/python-list


Re: weirdness with list()

2021-03-02 Thread Cameron Simpson
On 01Mar2021 14:10, Grant Edwards  wrote:
>That was certainly my reaction. Can you imagine the confusion if len()
>of a list returned the number of bytes required for srorage insttead
>of the number of elements?

Yeah, well the ancestry of these classes is a binary 
deserialise/serialise base class, so __len__ _is_ the natural thing - 
the length of the object when serialised.

The conflation came when making a recursive hierarchical system to parse 
ISO14496 files (MOV, MP4). These have variable sized binary records 
which can themselves enclose other records, often an array of other 
records.

That led me down the path of making an __iter__ (not previously 
present), without considering the __len__ interaction.

I've split these things apart now, and will probably go the full step of 
not providing __iter__ at all, instead requiring things to reach for the 
.boxes attribute or a generic .subboxes() method, since not all these 
things have .boxes (depends on the record type).

The design question is answered, and I consider myself at least somewhat 
spanked. However, the primary question was about sidestepping list()'s 
preallocation feature. That is also answered.

Cheers,
Cameron Simpson 
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: yield from () Was: Re: weirdness with list()

2021-03-02 Thread Cameron Simpson
On 28Feb2021 23:47, Alan Gauld  wrote:
>On 28/02/2021 00:17, Cameron Simpson wrote:
>> BUT... It also has a __iter__ value, which like any Box iterates over
>> the subboxes. For MDAT that is implemented like this:
>>
>> def __iter__(self):
>> yield from ()
>
>Sorry, a bit OT but I'm curious. I haven't seen
>this before:
>
>yield from ()
>
>What is it doing?
>What do the () represent in this context?

It's an empty tuple. The yield from iterates over the tuple, yielding 
zero times. There are shorter ways to write that (eg outright omitting 
the yield), except when you're writing a generator function with only a 
single yield statement - then you need something like that to make it a 
generator.

Cheers,
Cameron Simpson 
-- 
https://mail.python.org/mailman/listinfo/python-list


yield from () Was: Re: weirdness with list()

2021-03-01 Thread Alan Gauld via Python-list
On 28/02/2021 00:17, Cameron Simpson wrote:

> BUT... It also has a __iter__ value, which like any Box iterates over 
> the subboxes. For MDAT that is implemented like this:
> 
> def __iter__(self):
> yield from ()

Sorry, a bit OT but I'm curious. I haven't seen
this before:

yield from ()

What is it doing?
What do the () represent in this context?

-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: weirdness with list()

2021-03-01 Thread Grant Edwards
On 2021-03-01, Greg Ewing  wrote:
> On 28/02/21 1:17 pm, Cameron Simpson wrote:
>> [its length in bytes] is presented via the object's __len__ method,
>
>> BUT... It also has a __iter__ value, which like any Box iterates over
>> the subboxes.
>
> You're misusing __len__ here. If an object is iterable and
> also has a __len__, its __len__ should return the number of
> items you would get if you iterated over it. Anything else
> is confusing and can lead to trouble, as you found here.

That was certainly my reaction. Can you imagine the confusion if len()
of a list returned the number of bytes required for srorage insttead
of the number of elements?

>> But is there a cleaner way to do this?
>
> Yes. Give up on using __len__ to get the length in bytes,
> and provide another way to do that.



-- 
https://mail.python.org/mailman/listinfo/python-list


Re: weirdness with list()

2021-03-01 Thread Cameron Simpson
On 01Mar2021 00:06, MRAB  wrote:
>I'm not seeing a huge problem here:
>
>Python 3.9.2 (tags/v3.9.2:1a79785, Feb 19 2021, 13:44:55) [MSC v.1928 
>64 bit (AMD64)] on win32
>Type "help", "copyright", "credits" or "license" for more information.
 import time
 class A:
>... def __len__(self):
>... return 1024**3
>... def __iter__(self):
>... yield from ()
>...
 a = A()
 len(a)
>1073741824
 s = time.time()
 list(a)
>[]
 print(time.time() - s)
>0.16294455528259277

3.9.1 on MacOS: 14.529589891433716
3.9.2 on MacOS: instant again

Interesting. - Cameron Simpson 
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: weirdness with list()

2021-02-28 Thread Greg Ewing

On 28/02/21 1:17 pm, Cameron Simpson wrote:

[its length in bytes] is presented via the object's __len__ method,



BUT... It also has a __iter__ value, which like any Box iterates over
the subboxes.


You're misusing __len__ here. If an object is iterable and
also has a __len__, its __len__ should return the number of
items you would get if you iterated over it. Anything else
is confusing and can lead to trouble, as you found here.


But is there a cleaner way to do this?


Yes. Give up on using __len__ to get the length in bytes,
and provide another way to do that.

--
Greg

--
https://mail.python.org/mailman/listinfo/python-list


Re: weirdness with list()

2021-02-28 Thread MRAB

On 2021-02-28 23:28, Peter Otten wrote:

On 28/02/2021 23:33, Marco Sulla wrote:

On Sun, 28 Feb 2021 at 01:19, Cameron Simpson  wrote:

My object represents an MDAT box in an MP4 file: it is the ludicrously
large data box containing the raw audiovideo data; for a TV episode it
is often about 2GB and a movie is often 4GB to 6GB.
[...]
That length is presented via the object's __len__ method
[...]

I noticed that it was stalling, and investigation revealed it was
stalling at this line:

 subboxes = list(self)

when doing the MDAT box. That box (a) has no subboxes at all and (b) has
a very large __len__ value.

BUT... It also has a __iter__ value, which like any Box iterates over
the subboxes. For MDAT that is implemented like this:

 def __iter__(self):
 yield from ()

What I was expecting was pretty much instant construction of an empty
list. What I was getting was a very time consuming (10 seconds or more)
construction of an empty list.


I can't reproduce, Am I missing something?

marco@buzz:~$ python3
Python 3.6.9 (default, Jan 26 2021, 15:33:00)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

class A:

... def __len__(self):
... return 1024**3
... def __iter__(self):
... yield from ()
...

a = A()
len(a)

1073741824

list(a)

[]




It takes milliseconds to run list(a)


Looks like you need at least Python 3.8 to see this. Quoting
https://docs.python.org/3/whatsnew/3.8.html:

"""
The list constructor does not overallocate the internal item buffer if
the input iterable has a known length (the input implements __len__).
This makes the created list 12% smaller on average. (Contributed by
Raymond Hettinger and Pablo Galindo in bpo-33234.)
"""


I'm not seeing a huge problem here:

Python 3.9.2 (tags/v3.9.2:1a79785, Feb 19 2021, 13:44:55) [MSC v.1928 64 
bit (AMD64)] on win32

Type "help", "copyright", "credits" or "license" for more information.
>>> import time
>>> class A:
... def __len__(self):
... return 1024**3
... def __iter__(self):
... yield from ()
...
>>> a = A()
>>> len(a)
1073741824
>>> s = time.time()
>>> list(a)
[]
>>> print(time.time() - s)
0.16294455528259277
--
https://mail.python.org/mailman/listinfo/python-list


Re: weirdness with list()

2021-02-28 Thread Cameron Simpson
On 01Mar2021 00:28, Peter Otten <__pete...@web.de> wrote:
>On 28/02/2021 23:33, Marco Sulla wrote:
>>I can't reproduce, Am I missing something?
>>
>>marco@buzz:~$ python3
>>Python 3.6.9 (default, Jan 26 2021, 15:33:00)
>>[GCC 8.4.0] on linux
>>Type "help", "copyright", "credits" or "license" for more information.
>class A:
>>... def __len__(self):
>>... return 1024**3
>>... def __iter__(self):
>>... yield from ()
>>...
>a = A()
>len(a)
>>1073741824
>list(a)
>>[]
>
>>
>>It takes milliseconds to run list(a)
>
>Looks like you need at least Python 3.8 to see this. Quoting
>https://docs.python.org/3/whatsnew/3.8.html:
>"""
>The list constructor does not overallocate the internal item buffer if
>the input iterable has a known length (the input implements __len__).
>This makes the created list 12% smaller on average. (Contributed by
>Raymond Hettinger and Pablo Galindo in bpo-33234.)
>"""

That may also explain why I hadn't noticed this before, eg last year.

I do kind of wish __length_hint__ overrode __len__ rather than the other 
way around, if it's doing what I think it's doing.

Cheers,
Cameron Simpson 
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: weirdness with list()

2021-02-28 Thread Cameron Simpson
On 28Feb2021 10:51, Peter Otten <__pete...@web.de> wrote:
>On 28/02/2021 01:17, Cameron Simpson wrote:
>>I noticed that it was stalling, and investigation revealed it was
>>stalling at this line:
>>
>> subboxes = list(self)
>>
>>when doing the MDAT box. That box (a) has no subboxes at all and (b) has
>>a very large __len__ value.
[...]
>
>list(iter(self))
>
>should work, too. It may be faster than the explicit loop, but also
>defeats the list allocation optimization.

Yes, very neat. I went with [subbox for subbox in self] last night, but 
the above is better.

[...]
>>Still, thoughts? I'm interested in any approaches that would have let 
>>me
>>make list() fast while keeping __len__==binary_length.
>>
>>I'm accepting that __len__ != len(__iter__) is a bad idea now, though.
>
>Indeed. I see how that train wreck happened -- but the weirdness is not
>the list behavior.

I agree. The only weirdness is that list(empty-iterable) took a very 
long time. Weirdness in the eye of the beholder I guess.

>Maybe you can capture the intended behavior of your class with two
>classes, a MyIterable without length that can be converted into MyList
>as needed.

Hmm. Maybe.

What I've done so far is:

The afore mentioned [subbox for subbox in self] which I'll replace with 
your nicer one today.

Given my BinaryMixin a transcribed_length method which measures the 
length of the binary transcription. For small things that's actually 
fairly cheap, and totally general. By default it is aliased to __len__, 
which still seems a natural thing - the length of the binary object is 
the number of bytes required to serialise it.

The alias lets me override transcribed_length() for bulky things like 
MDAT where (a) transcription _is_ expensive and (b) the source data may 
not be present anyway ("skip" mode), but the measurement of the data 
from the parse is recorded.

And I can disassociate __len__ from transcribed_length() if need be in 
subclasses. I've not done that, given the iter() shuffle above.

Cheers,
Cameron Simpson 
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: weirdness with list()

2021-02-28 Thread Peter Otten

On 28/02/2021 23:33, Marco Sulla wrote:

On Sun, 28 Feb 2021 at 01:19, Cameron Simpson  wrote:

My object represents an MDAT box in an MP4 file: it is the ludicrously
large data box containing the raw audiovideo data; for a TV episode it
is often about 2GB and a movie is often 4GB to 6GB.
[...]
That length is presented via the object's __len__ method
[...]

I noticed that it was stalling, and investigation revealed it was
stalling at this line:

 subboxes = list(self)

when doing the MDAT box. That box (a) has no subboxes at all and (b) has
a very large __len__ value.

BUT... It also has a __iter__ value, which like any Box iterates over
the subboxes. For MDAT that is implemented like this:

 def __iter__(self):
 yield from ()

What I was expecting was pretty much instant construction of an empty
list. What I was getting was a very time consuming (10 seconds or more)
construction of an empty list.


I can't reproduce, Am I missing something?

marco@buzz:~$ python3
Python 3.6.9 (default, Jan 26 2021, 15:33:00)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

class A:

... def __len__(self):
... return 1024**3
... def __iter__(self):
... yield from ()
...

a = A()
len(a)

1073741824

list(a)

[]




It takes milliseconds to run list(a)


Looks like you need at least Python 3.8 to see this. Quoting
https://docs.python.org/3/whatsnew/3.8.html:

"""
The list constructor does not overallocate the internal item buffer if 
the input iterable has a known length (the input implements __len__). 
This makes the created list 12% smaller on average. (Contributed by 
Raymond Hettinger and Pablo Galindo in bpo-33234.)

"""



--
https://mail.python.org/mailman/listinfo/python-list


Re: weirdness with list()

2021-02-28 Thread Marco Sulla
On Sun, 28 Feb 2021 at 01:19, Cameron Simpson  wrote:
> My object represents an MDAT box in an MP4 file: it is the ludicrously
> large data box containing the raw audiovideo data; for a TV episode it
> is often about 2GB and a movie is often 4GB to 6GB.
> [...]
> That length is presented via the object's __len__ method
> [...]
>
> I noticed that it was stalling, and investigation revealed it was
> stalling at this line:
>
> subboxes = list(self)
>
> when doing the MDAT box. That box (a) has no subboxes at all and (b) has
> a very large __len__ value.
>
> BUT... It also has a __iter__ value, which like any Box iterates over
> the subboxes. For MDAT that is implemented like this:
>
> def __iter__(self):
> yield from ()
>
> What I was expecting was pretty much instant construction of an empty
> list. What I was getting was a very time consuming (10 seconds or more)
> construction of an empty list.

I can't reproduce, Am I missing something?

marco@buzz:~$ python3
Python 3.6.9 (default, Jan 26 2021, 15:33:00)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> class A:
... def __len__(self):
... return 1024**3
... def __iter__(self):
... yield from ()
...
>>> a = A()
>>> len(a)
1073741824
>>> list(a)
[]
>>>

It takes milliseconds to run list(a)
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: weirdness with list()

2021-02-28 Thread Peter Otten

On 28/02/2021 01:17, Cameron Simpson wrote:

I just ran into a surprising (to me) issue with list() on an iterable
object.

My object represents an MDAT box in an MP4 file: it is the ludicrously
large data box containing the raw audiovideo data; for a TV episode it
is often about 2GB and a movie is often 4GB to 6GB. For obvious reasons,
I do not always want to load that into memory, or even read the data
part at all when scanning an MP4 file, for example to recite its
metadata.

So my parser has a "skip" mode where it seeks straight past the data,
but makes a note of its length in bytes. All good.

That length is presented via the object's __len__ method, because I want
to know that length later and this is a subclass of a suite of things
which return their length in bytes this way.

So, to my problem:

I've got a walk method which traverses the hierarchy of boxes in the MP4
file. Until some minutes ago, it looked like this:

   def walk(self):
 subboxes = list(self)
 yield self, subboxes
 for subbox in subboxes:
   if isinstance(subbox, Box):
 yield from subbox.walk()

somewhat like os.walk does for a file tree.

I noticed that it was stalling, and investigation revealed it was
stalling at this line:

 subboxes = list(self)

when doing the MDAT box. That box (a) has no subboxes at all and (b) has
a very large __len__ value.

BUT... It also has a __iter__ value, which like any Box iterates over
the subboxes. For MDAT that is implemented like this:

 def __iter__(self):
 yield from ()

What I was expecting was pretty much instant construction of an empty
list. What I was getting was a very time consuming (10 seconds or more)
construction of an empty list.

I believe that this is because list() tries to preallocate storage. I
_infer_ from the docs that this is done maybe using
operator.length_hint, which in turn consults "the actual length of the
object" (meaning __len__ for me?), then __length_hint__, then defaults
to 0.

I've changed my walk function like so:

   def walk(self):
 subboxes = []
 for subbox in self:
   subboxes.append(subbox)
 ##subboxes = list(self)


list(iter(self))

should work, too. It may be faster than the explicit loop, but also
defeats the list allocation optimization.


and commented out the former list(self) incantation. This is very fast,
because it makes an empty list and then appends nothing to it. And for
your typical movie file this is fine, because there are never _very_
many immediate subboxes anyway.

But is there a cleaner way to do this?

I'd like to go back to my former list(self) incantation, and modify the
MDAT box class to arrange something efficient. Setting __length_hint__
didn't help: returning NotImplemeneted or 0 had no effect, because
presumably __len__ was consulted first.

Any suggestions? My current approach feels rather hacky.

I'm already leaning towards making __len__ return the number of subboxes
to match the iterator, especially as on reflection not all my subclasses
are consistent about __len__ meaning the length of their binary form;
I'm probably going to have to fix that - some subclasses are actually
namedtuples where __len__ would be the field count. Ugh.

Still, thoughts? I'm interested in any approaches that would have let me
make list() fast while keeping __len__==binary_length.

I'm accepting that __len__ != len(__iter__) is a bad idea now, though.


Indeed. I see how that train wreck happened -- but the weirdness is not
the list behavior.

Maybe you can capture the intended behavior of your class with two
classes, a MyIterable without length that can be converted into MyList
as needed.


--
https://mail.python.org/mailman/listinfo/python-list


weirdness with list()

2021-02-27 Thread Cameron Simpson
I just ran into a surprising (to me) issue with list() on an iterable 
object.

My object represents an MDAT box in an MP4 file: it is the ludicrously 
large data box containing the raw audiovideo data; for a TV episode it 
is often about 2GB and a movie is often 4GB to 6GB. For obvious reasons, 
I do not always want to load that into memory, or even read the data 
part at all when scanning an MP4 file, for example to recite its 
metadata.

So my parser has a "skip" mode where it seeks straight past the data, 
but makes a note of its length in bytes. All good.

That length is presented via the object's __len__ method, because I want 
to know that length later and this is a subclass of a suite of things 
which return their length in bytes this way.

So, to my problem:

I've got a walk method which traverses the hierarchy of boxes in the MP4 
file. Until some minutes ago, it looked like this:

  def walk(self):
subboxes = list(self)
yield self, subboxes
for subbox in subboxes:
  if isinstance(subbox, Box):
yield from subbox.walk()

somewhat like os.walk does for a file tree.

I noticed that it was stalling, and investigation revealed it was 
stalling at this line:

subboxes = list(self)

when doing the MDAT box. That box (a) has no subboxes at all and (b) has 
a very large __len__ value.

BUT... It also has a __iter__ value, which like any Box iterates over 
the subboxes. For MDAT that is implemented like this:

def __iter__(self):
yield from ()

What I was expecting was pretty much instant construction of an empty 
list. What I was getting was a very time consuming (10 seconds or more) 
construction of an empty list.

I believe that this is because list() tries to preallocate storage. I 
_infer_ from the docs that this is done maybe using 
operator.length_hint, which in turn consults "the actual length of the 
object" (meaning __len__ for me?), then __length_hint__, then defaults 
to 0.

I've changed my walk function like so:

  def walk(self):
subboxes = []
for subbox in self:
  subboxes.append(subbox)
##subboxes = list(self)

and commented out the former list(self) incantation. This is very fast, 
because it makes an empty list and then appends nothing to it. And for 
your typical movie file this is fine, because there are never _very_ 
many immediate subboxes anyway.

But is there a cleaner way to do this?

I'd like to go back to my former list(self) incantation, and modify the 
MDAT box class to arrange something efficient. Setting __length_hint__ 
didn't help: returning NotImplemeneted or 0 had no effect, because 
presumably __len__ was consulted first.

Any suggestions? My current approach feels rather hacky.

I'm already leaning towards making __len__ return the number of subboxes 
to match the iterator, especially as on reflection not all my subclasses 
are consistent about __len__ meaning the length of their binary form; 
I'm probably going to have to fix that - some subclasses are actually 
namedtuples where __len__ would be the field count. Ugh.

Still, thoughts? I'm interested in any approaches that would have let me 
make list() fast while keeping __len__==binary_length.

I'm accepting that __len__ != len(__iter__) is a bad idea now, though.

Cheers,
Cameron Simpson 
-- 
https://mail.python.org/mailman/listinfo/python-list