Re: yield from () Was: Re: weirdness with list()
On 03/03/2021 01:01, Cameron Simpson wrote: On 02Mar2021 15:06, Larry Martell wrote: I discovered something new (to me) yesterday. Was writing a unit test for generator function and I found that none of the function got executed at all until I iterated on the return value. Aye. Generators are lazy - they don't run at all until you ask for a value. By contrast, this is unlike Go's goroutines, which are busy - they commence operation as soon as invoked and run until the first yield (channel put, I forget how it is spelled now). This can cause excessive CPU utilisation, but it handle for _fast_ production of results. Which is a primary goal in Go's design. Cheers, Cameron Simpson I've been learning a bit more JavaScript recently (I know, I know, that's no fun) and I think that's the main practical difference between JavaScript's async functions, which are scheduled even if nobody awaits on them, and Python async functions which are just funky generators and therefore scheduled only when somebody awaits their result. -- https://mail.python.org/mailman/listinfo/python-list
Friday Finking: following, weirdness with list()
The in-person version of 'Friday Finking' has been set-aside by COVID-precautions. Here's hoping the questions asked below will stimulate some thinking, or mild entertainment... On 02/03/2021 03.10, Grant Edwards wrote: > On 2021-03-01, Greg Ewing wrote: >> On 28/02/21 1:17 pm, Cameron Simpson wrote: >>> [its length in bytes] is presented via the object's __len__ method, >> >>> BUT... It also has a __iter__ value, which like any Box iterates over >>> the subboxes. >> >> You're misusing __len__ here. If an object is iterable and >> also has a __len__, its __len__ should return the number of >> items you would get if you iterated over it. Anything else >> is confusing and can lead to trouble, as you found here. > > That was certainly my reaction. Can you imagine the confusion if len() > of a list returned the number of bytes required for srorage insttead > of the number of elements? Why? Isn't one of the 'fun' things about modern* languages is the "over-loading" of operators/operations? * ie newer than FORTRAN-IV or COBOL (or my grey hair) Thus we have: 2 + 3# int( 5 ) and "2" + "3"# "23" ...and we are quite comfortable with the dissonant 'sameness' and 'difference'. If we can "over-load" __add__(), why not __len__()? That said, it is confusing: what does len() mean? Are we talking about the number of elements in a collection, or something else? What do the docs say? https://docs.python.org/3/library/functions.html#len talks of "the length (the number of items) of an object". In the OP, what are the "items" in this object/"subbox"? https://docs.python.org/3/reference/datamodel.html covers object.__len__(self) saying "Called to implement the built-in function len(). Should return the length of the object, an integer >= 0." without actually determining what "length of the object" may actually mean in any or every context. Here's another example/application: If we were playing with our own custom-class to work with vectors, should __len__() be coded to report (through len()) the number of dimensions considered in the vector: v = Vector( 1, 2, 3, 4 ) len( v )# 4 ...or should "len" stand for the "magnitude" of the vector, ie a distance of 5.5 (rounded)? Horses for courses? In the case of (Unicode) strings len() reports in characters, yet lists are sized in numbers of elements, etc. Each according to what we might call the 'unit' which should be counted. The implicit 'confusion' (and flexibility) of over-loading precedes (and to a degree, causes) "imagine the confusion if len() of a list returned the number of bytes required". That said, shouldn't we agreeing with the statement? Should one (sort of) class/file-structure demand that all other custom-, library-, and 'built-in'-classes report in bytes? (but is that being proposed/demanded?) The lengths of files are reported by the computer's ls-command/file-manager in [M/K]-bytes! This subject matter is a binary file/container format (MP4). Am working on a similar container format at the moment, where the length of sub-components may be reported in bytes (if not delineated by 'markers'). So, there are many reasons why "bytes" is a 'good' measure of length - in this context. Is it "misusing __len__" in a class/object designed to manipulate such files? Hope not! (or I'm 'in trouble' - again...) -- Regards, =dn -- https://mail.python.org/mailman/listinfo/python-list
Re: yield from () Was: Re: weirdness with list()
On Fri, Mar 12, 2021 at 8:20 AM Serhiy Storchaka wrote: > > 01.03.21 23:59, Cameron Simpson пише: > > On 28Feb2021 23:47, Alan Gauld wrote: > >> On 28/02/2021 00:17, Cameron Simpson wrote: > >>> BUT... It also has a __iter__ value, which like any Box iterates over > >>> the subboxes. For MDAT that is implemented like this: > >>> > >>> def __iter__(self): > >>> yield from () > >> > >> Sorry, a bit OT but I'm curious. I haven't seen > >> this before: > >> > >> yield from () > >> > >> What is it doing? > >> What do the () represent in this context? > > > > It's an empty tuple. The yield from iterates over the tuple, yielding > > zero times. There are shorter ways to write that (eg outright omitting > > the yield), except when you're writing a generator function with only a > > single yield statement - then you need something like that to make it a > > generator. > > I was wondering what from following variants is more efficient: > > def gen1(): > yield from () > > def gen2(): > return > yield > > def gen3(): > return iter(()) > > > $ python3.9 -m timeit -s 'def g(): yield from ()' 'list(g())' > 100 loops, best of 5: 266 nsec per loop > $ python3.9 -m timeit -s 'def g():' -s ' return' -s ' yield' 'list(g())' > 100 loops, best of 5: 219 nsec per loop > $ python3.9 -m timeit -s 'def g(): return iter(())' 'list(g())' > 200 loops, best of 5: 192 nsec per loop > They're not identical. The first two are, I believe, equivalent (and you could add "if False: yield" as another comparison if you care), but the third one isn't a generator. So if all you need is an iterator, sure, but gen3 actually isn't doing as much as the other two are. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: yield from () Was: Re: weirdness with list()
01.03.21 23:59, Cameron Simpson пише: > On 28Feb2021 23:47, Alan Gauld wrote: >> On 28/02/2021 00:17, Cameron Simpson wrote: >>> BUT... It also has a __iter__ value, which like any Box iterates over >>> the subboxes. For MDAT that is implemented like this: >>> >>> def __iter__(self): >>> yield from () >> >> Sorry, a bit OT but I'm curious. I haven't seen >> this before: >> >> yield from () >> >> What is it doing? >> What do the () represent in this context? > > It's an empty tuple. The yield from iterates over the tuple, yielding > zero times. There are shorter ways to write that (eg outright omitting > the yield), except when you're writing a generator function with only a > single yield statement - then you need something like that to make it a > generator. I was wondering what from following variants is more efficient: def gen1(): yield from () def gen2(): return yield def gen3(): return iter(()) $ python3.9 -m timeit -s 'def g(): yield from ()' 'list(g())' 100 loops, best of 5: 266 nsec per loop $ python3.9 -m timeit -s 'def g():' -s ' return' -s ' yield' 'list(g())' 100 loops, best of 5: 219 nsec per loop $ python3.9 -m timeit -s 'def g(): return iter(())' 'list(g())' 200 loops, best of 5: 192 nsec per loop -- https://mail.python.org/mailman/listinfo/python-list
Re: yield from () Was: Re: weirdness with list()
On 02Mar2021 15:06, Larry Martell wrote: >I discovered something new (to me) yesterday. Was writing a unit test >for generator function and I found that none of the function got >executed at all until I iterated on the return value. Aye. Generators are lazy - they don't run at all until you ask for a value. By contrast, this is unlike Go's goroutines, which are busy - they commence operation as soon as invoked and run until the first yield (channel put, I forget how it is spelled now). This can cause excessive CPU utilisation, but it handle for _fast_ production of results. Which is a primary goal in Go's design. Cheers, Cameron Simpson -- https://mail.python.org/mailman/listinfo/python-list
Re: yield from () Was: Re: weirdness with list()
On Tue, Mar 2, 2021 at 2:16 PM Chris Angelico wrote: > > On Tue, Mar 2, 2021 at 5:51 AM Alan Gauld via Python-list > wrote: > > > > On 28/02/2021 00:17, Cameron Simpson wrote: > > > > > BUT... It also has a __iter__ value, which like any Box iterates over > > > the subboxes. For MDAT that is implemented like this: > > > > > > def __iter__(self): > > > yield from () > > > > Sorry, a bit OT but I'm curious. I haven't seen > > this before: > > > > yield from () > > > > What is it doing? > > What do the () represent in this context? > > > > It's yielding all the elements in an empty tuple. Which is none of > them, meaning that - for this simple example - iterating over the > object will produce zero results. I discovered something new (to me) yesterday. Was writing a unit test for generator function and I found that none of the function got executed at all until I iterated on the return value. It was blowing my mind as I was debugging the test and had a BP set in the first line of the function but it was not hit when I called the function. -- https://mail.python.org/mailman/listinfo/python-list
Re: yield from () Was: Re: weirdness with list()
On Mon, 1 Mar 2021 at 19:51, Alan Gauld via Python-list wrote: > Sorry, a bit OT but I'm curious. I haven't seen > this before: > > yield from () > > What is it doing? > What do the () represent in this context? It's the empty tuple. -- https://mail.python.org/mailman/listinfo/python-list
Re: yield from () Was: Re: weirdness with list()
On Tue, Mar 2, 2021 at 5:51 AM Alan Gauld via Python-list wrote: > > On 28/02/2021 00:17, Cameron Simpson wrote: > > > BUT... It also has a __iter__ value, which like any Box iterates over > > the subboxes. For MDAT that is implemented like this: > > > > def __iter__(self): > > yield from () > > Sorry, a bit OT but I'm curious. I haven't seen > this before: > > yield from () > > What is it doing? > What do the () represent in this context? > It's yielding all the elements in an empty tuple. Which is none of them, meaning that - for this simple example - iterating over the object will produce zero results. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: yield from () Was: Re: weirdness with list()
On 28/02/2021 23:47, Alan Gauld via Python-list wrote: > On 28/02/2021 00:17, Cameron Simpson wrote: > >> BUT... It also has a __iter__ value, which like any Box iterates over >> the subboxes. For MDAT that is implemented like this: >> >> def __iter__(self): >> yield from () > > Sorry, a bit OT but I'm curious. I haven't seen > this before: > > yield from () > > What is it doing? > What do the () represent in this context? > Thanks for the replies. I should have known better but I was thinking some cleverness with callables and completely forgot the empty tuple syntax. Oops! -- Alan G Author of the Learn to Program web site http://www.alan-g.me.uk/ http://www.amazon.com/author/alan_gauld Follow my photo-blog on Flickr at: http://www.flickr.com/photos/alangauldphotos -- https://mail.python.org/mailman/listinfo/python-list
Re: yield from () Was: Re: weirdness with list()
On Wed, Mar 3, 2021 at 8:21 AM Dieter Maurer wrote: > > Alan Gauld wrote at 2021-2-28 23:47 +: > >yield from () > > "yield from iterator" is similar to "for i in iterator: yield i" (with > special handling when data/exceptions are injected into the generator). > > Thus, "yield from ()" does essentially nothing with the side effect > that the containing function is treated as generator function. > Another way to write the same thing is: if False: yield None This, too, will do nothing - in fact, it will be optimized out completely in current versions of CPython - but, again, will force the function to be a generator. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
yield from () Was: Re: weirdness with list()
Alan Gauld wrote at 2021-2-28 23:47 +: >yield from () "yield from iterator" is similar to "for i in iterator: yield i" (with special handling when data/exceptions are injected into the generator). Thus, "yield from ()" does essentially nothing with the side effect that the containing function is treated as generator function. -- Dieter -- https://mail.python.org/mailman/listinfo/python-list
Re: yield from () Was: Re: weirdness with list()
On 01/03/2021 00:47, Alan Gauld via Python-list wrote: On 28/02/2021 00:17, Cameron Simpson wrote: BUT... It also has a __iter__ value, which like any Box iterates over the subboxes. For MDAT that is implemented like this: def __iter__(self): yield from () Sorry, a bit OT but I'm curious. I haven't seen this before: yield from () What is it doing? What do the () represent in this context? The 0-tuple ;) yield from x is syntactic sugar for for item in x: yield item Instead of () you can use any empty iterable. If x is empty nothing is ever yielded. -- https://mail.python.org/mailman/listinfo/python-list
Re: weirdness with list()
On 01Mar2021 14:10, Grant Edwards wrote: >That was certainly my reaction. Can you imagine the confusion if len() >of a list returned the number of bytes required for srorage insttead >of the number of elements? Yeah, well the ancestry of these classes is a binary deserialise/serialise base class, so __len__ _is_ the natural thing - the length of the object when serialised. The conflation came when making a recursive hierarchical system to parse ISO14496 files (MOV, MP4). These have variable sized binary records which can themselves enclose other records, often an array of other records. That led me down the path of making an __iter__ (not previously present), without considering the __len__ interaction. I've split these things apart now, and will probably go the full step of not providing __iter__ at all, instead requiring things to reach for the .boxes attribute or a generic .subboxes() method, since not all these things have .boxes (depends on the record type). The design question is answered, and I consider myself at least somewhat spanked. However, the primary question was about sidestepping list()'s preallocation feature. That is also answered. Cheers, Cameron Simpson -- https://mail.python.org/mailman/listinfo/python-list
Re: yield from () Was: Re: weirdness with list()
On 28Feb2021 23:47, Alan Gauld wrote: >On 28/02/2021 00:17, Cameron Simpson wrote: >> BUT... It also has a __iter__ value, which like any Box iterates over >> the subboxes. For MDAT that is implemented like this: >> >> def __iter__(self): >> yield from () > >Sorry, a bit OT but I'm curious. I haven't seen >this before: > >yield from () > >What is it doing? >What do the () represent in this context? It's an empty tuple. The yield from iterates over the tuple, yielding zero times. There are shorter ways to write that (eg outright omitting the yield), except when you're writing a generator function with only a single yield statement - then you need something like that to make it a generator. Cheers, Cameron Simpson -- https://mail.python.org/mailman/listinfo/python-list
yield from () Was: Re: weirdness with list()
On 28/02/2021 00:17, Cameron Simpson wrote: > BUT... It also has a __iter__ value, which like any Box iterates over > the subboxes. For MDAT that is implemented like this: > > def __iter__(self): > yield from () Sorry, a bit OT but I'm curious. I haven't seen this before: yield from () What is it doing? What do the () represent in this context? -- Alan G Author of the Learn to Program web site http://www.alan-g.me.uk/ http://www.amazon.com/author/alan_gauld Follow my photo-blog on Flickr at: http://www.flickr.com/photos/alangauldphotos -- https://mail.python.org/mailman/listinfo/python-list
Re: weirdness with list()
On 2021-03-01, Greg Ewing wrote: > On 28/02/21 1:17 pm, Cameron Simpson wrote: >> [its length in bytes] is presented via the object's __len__ method, > >> BUT... It also has a __iter__ value, which like any Box iterates over >> the subboxes. > > You're misusing __len__ here. If an object is iterable and > also has a __len__, its __len__ should return the number of > items you would get if you iterated over it. Anything else > is confusing and can lead to trouble, as you found here. That was certainly my reaction. Can you imagine the confusion if len() of a list returned the number of bytes required for srorage insttead of the number of elements? >> But is there a cleaner way to do this? > > Yes. Give up on using __len__ to get the length in bytes, > and provide another way to do that. -- https://mail.python.org/mailman/listinfo/python-list
Re: weirdness with list()
On 01Mar2021 00:06, MRAB wrote: >I'm not seeing a huge problem here: > >Python 3.9.2 (tags/v3.9.2:1a79785, Feb 19 2021, 13:44:55) [MSC v.1928 >64 bit (AMD64)] on win32 >Type "help", "copyright", "credits" or "license" for more information. import time class A: >... def __len__(self): >... return 1024**3 >... def __iter__(self): >... yield from () >... a = A() len(a) >1073741824 s = time.time() list(a) >[] print(time.time() - s) >0.16294455528259277 3.9.1 on MacOS: 14.529589891433716 3.9.2 on MacOS: instant again Interesting. - Cameron Simpson -- https://mail.python.org/mailman/listinfo/python-list
Re: weirdness with list()
On 28/02/21 1:17 pm, Cameron Simpson wrote: [its length in bytes] is presented via the object's __len__ method, BUT... It also has a __iter__ value, which like any Box iterates over the subboxes. You're misusing __len__ here. If an object is iterable and also has a __len__, its __len__ should return the number of items you would get if you iterated over it. Anything else is confusing and can lead to trouble, as you found here. But is there a cleaner way to do this? Yes. Give up on using __len__ to get the length in bytes, and provide another way to do that. -- Greg -- https://mail.python.org/mailman/listinfo/python-list
Re: weirdness with list()
On 2021-02-28 23:28, Peter Otten wrote: On 28/02/2021 23:33, Marco Sulla wrote: On Sun, 28 Feb 2021 at 01:19, Cameron Simpson wrote: My object represents an MDAT box in an MP4 file: it is the ludicrously large data box containing the raw audiovideo data; for a TV episode it is often about 2GB and a movie is often 4GB to 6GB. [...] That length is presented via the object's __len__ method [...] I noticed that it was stalling, and investigation revealed it was stalling at this line: subboxes = list(self) when doing the MDAT box. That box (a) has no subboxes at all and (b) has a very large __len__ value. BUT... It also has a __iter__ value, which like any Box iterates over the subboxes. For MDAT that is implemented like this: def __iter__(self): yield from () What I was expecting was pretty much instant construction of an empty list. What I was getting was a very time consuming (10 seconds or more) construction of an empty list. I can't reproduce, Am I missing something? marco@buzz:~$ python3 Python 3.6.9 (default, Jan 26 2021, 15:33:00) [GCC 8.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. class A: ... def __len__(self): ... return 1024**3 ... def __iter__(self): ... yield from () ... a = A() len(a) 1073741824 list(a) [] It takes milliseconds to run list(a) Looks like you need at least Python 3.8 to see this. Quoting https://docs.python.org/3/whatsnew/3.8.html: """ The list constructor does not overallocate the internal item buffer if the input iterable has a known length (the input implements __len__). This makes the created list 12% smaller on average. (Contributed by Raymond Hettinger and Pablo Galindo in bpo-33234.) """ I'm not seeing a huge problem here: Python 3.9.2 (tags/v3.9.2:1a79785, Feb 19 2021, 13:44:55) [MSC v.1928 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import time >>> class A: ... def __len__(self): ... return 1024**3 ... def __iter__(self): ... yield from () ... >>> a = A() >>> len(a) 1073741824 >>> s = time.time() >>> list(a) [] >>> print(time.time() - s) 0.16294455528259277 -- https://mail.python.org/mailman/listinfo/python-list
Re: weirdness with list()
On 01Mar2021 00:28, Peter Otten <__pete...@web.de> wrote: >On 28/02/2021 23:33, Marco Sulla wrote: >>I can't reproduce, Am I missing something? >> >>marco@buzz:~$ python3 >>Python 3.6.9 (default, Jan 26 2021, 15:33:00) >>[GCC 8.4.0] on linux >>Type "help", "copyright", "credits" or "license" for more information. >class A: >>... def __len__(self): >>... return 1024**3 >>... def __iter__(self): >>... yield from () >>... >a = A() >len(a) >>1073741824 >list(a) >>[] > >> >>It takes milliseconds to run list(a) > >Looks like you need at least Python 3.8 to see this. Quoting >https://docs.python.org/3/whatsnew/3.8.html: >""" >The list constructor does not overallocate the internal item buffer if >the input iterable has a known length (the input implements __len__). >This makes the created list 12% smaller on average. (Contributed by >Raymond Hettinger and Pablo Galindo in bpo-33234.) >""" That may also explain why I hadn't noticed this before, eg last year. I do kind of wish __length_hint__ overrode __len__ rather than the other way around, if it's doing what I think it's doing. Cheers, Cameron Simpson -- https://mail.python.org/mailman/listinfo/python-list
Re: weirdness with list()
On 28Feb2021 10:51, Peter Otten <__pete...@web.de> wrote: >On 28/02/2021 01:17, Cameron Simpson wrote: >>I noticed that it was stalling, and investigation revealed it was >>stalling at this line: >> >> subboxes = list(self) >> >>when doing the MDAT box. That box (a) has no subboxes at all and (b) has >>a very large __len__ value. [...] > >list(iter(self)) > >should work, too. It may be faster than the explicit loop, but also >defeats the list allocation optimization. Yes, very neat. I went with [subbox for subbox in self] last night, but the above is better. [...] >>Still, thoughts? I'm interested in any approaches that would have let >>me >>make list() fast while keeping __len__==binary_length. >> >>I'm accepting that __len__ != len(__iter__) is a bad idea now, though. > >Indeed. I see how that train wreck happened -- but the weirdness is not >the list behavior. I agree. The only weirdness is that list(empty-iterable) took a very long time. Weirdness in the eye of the beholder I guess. >Maybe you can capture the intended behavior of your class with two >classes, a MyIterable without length that can be converted into MyList >as needed. Hmm. Maybe. What I've done so far is: The afore mentioned [subbox for subbox in self] which I'll replace with your nicer one today. Given my BinaryMixin a transcribed_length method which measures the length of the binary transcription. For small things that's actually fairly cheap, and totally general. By default it is aliased to __len__, which still seems a natural thing - the length of the binary object is the number of bytes required to serialise it. The alias lets me override transcribed_length() for bulky things like MDAT where (a) transcription _is_ expensive and (b) the source data may not be present anyway ("skip" mode), but the measurement of the data from the parse is recorded. And I can disassociate __len__ from transcribed_length() if need be in subclasses. I've not done that, given the iter() shuffle above. Cheers, Cameron Simpson -- https://mail.python.org/mailman/listinfo/python-list
Re: weirdness with list()
On 28/02/2021 23:33, Marco Sulla wrote: On Sun, 28 Feb 2021 at 01:19, Cameron Simpson wrote: My object represents an MDAT box in an MP4 file: it is the ludicrously large data box containing the raw audiovideo data; for a TV episode it is often about 2GB and a movie is often 4GB to 6GB. [...] That length is presented via the object's __len__ method [...] I noticed that it was stalling, and investigation revealed it was stalling at this line: subboxes = list(self) when doing the MDAT box. That box (a) has no subboxes at all and (b) has a very large __len__ value. BUT... It also has a __iter__ value, which like any Box iterates over the subboxes. For MDAT that is implemented like this: def __iter__(self): yield from () What I was expecting was pretty much instant construction of an empty list. What I was getting was a very time consuming (10 seconds or more) construction of an empty list. I can't reproduce, Am I missing something? marco@buzz:~$ python3 Python 3.6.9 (default, Jan 26 2021, 15:33:00) [GCC 8.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. class A: ... def __len__(self): ... return 1024**3 ... def __iter__(self): ... yield from () ... a = A() len(a) 1073741824 list(a) [] It takes milliseconds to run list(a) Looks like you need at least Python 3.8 to see this. Quoting https://docs.python.org/3/whatsnew/3.8.html: """ The list constructor does not overallocate the internal item buffer if the input iterable has a known length (the input implements __len__). This makes the created list 12% smaller on average. (Contributed by Raymond Hettinger and Pablo Galindo in bpo-33234.) """ -- https://mail.python.org/mailman/listinfo/python-list
Re: weirdness with list()
On Sun, 28 Feb 2021 at 01:19, Cameron Simpson wrote: > My object represents an MDAT box in an MP4 file: it is the ludicrously > large data box containing the raw audiovideo data; for a TV episode it > is often about 2GB and a movie is often 4GB to 6GB. > [...] > That length is presented via the object's __len__ method > [...] > > I noticed that it was stalling, and investigation revealed it was > stalling at this line: > > subboxes = list(self) > > when doing the MDAT box. That box (a) has no subboxes at all and (b) has > a very large __len__ value. > > BUT... It also has a __iter__ value, which like any Box iterates over > the subboxes. For MDAT that is implemented like this: > > def __iter__(self): > yield from () > > What I was expecting was pretty much instant construction of an empty > list. What I was getting was a very time consuming (10 seconds or more) > construction of an empty list. I can't reproduce, Am I missing something? marco@buzz:~$ python3 Python 3.6.9 (default, Jan 26 2021, 15:33:00) [GCC 8.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> class A: ... def __len__(self): ... return 1024**3 ... def __iter__(self): ... yield from () ... >>> a = A() >>> len(a) 1073741824 >>> list(a) [] >>> It takes milliseconds to run list(a) -- https://mail.python.org/mailman/listinfo/python-list
Re: weirdness with list()
On 28/02/2021 01:17, Cameron Simpson wrote: I just ran into a surprising (to me) issue with list() on an iterable object. My object represents an MDAT box in an MP4 file: it is the ludicrously large data box containing the raw audiovideo data; for a TV episode it is often about 2GB and a movie is often 4GB to 6GB. For obvious reasons, I do not always want to load that into memory, or even read the data part at all when scanning an MP4 file, for example to recite its metadata. So my parser has a "skip" mode where it seeks straight past the data, but makes a note of its length in bytes. All good. That length is presented via the object's __len__ method, because I want to know that length later and this is a subclass of a suite of things which return their length in bytes this way. So, to my problem: I've got a walk method which traverses the hierarchy of boxes in the MP4 file. Until some minutes ago, it looked like this: def walk(self): subboxes = list(self) yield self, subboxes for subbox in subboxes: if isinstance(subbox, Box): yield from subbox.walk() somewhat like os.walk does for a file tree. I noticed that it was stalling, and investigation revealed it was stalling at this line: subboxes = list(self) when doing the MDAT box. That box (a) has no subboxes at all and (b) has a very large __len__ value. BUT... It also has a __iter__ value, which like any Box iterates over the subboxes. For MDAT that is implemented like this: def __iter__(self): yield from () What I was expecting was pretty much instant construction of an empty list. What I was getting was a very time consuming (10 seconds or more) construction of an empty list. I believe that this is because list() tries to preallocate storage. I _infer_ from the docs that this is done maybe using operator.length_hint, which in turn consults "the actual length of the object" (meaning __len__ for me?), then __length_hint__, then defaults to 0. I've changed my walk function like so: def walk(self): subboxes = [] for subbox in self: subboxes.append(subbox) ##subboxes = list(self) list(iter(self)) should work, too. It may be faster than the explicit loop, but also defeats the list allocation optimization. and commented out the former list(self) incantation. This is very fast, because it makes an empty list and then appends nothing to it. And for your typical movie file this is fine, because there are never _very_ many immediate subboxes anyway. But is there a cleaner way to do this? I'd like to go back to my former list(self) incantation, and modify the MDAT box class to arrange something efficient. Setting __length_hint__ didn't help: returning NotImplemeneted or 0 had no effect, because presumably __len__ was consulted first. Any suggestions? My current approach feels rather hacky. I'm already leaning towards making __len__ return the number of subboxes to match the iterator, especially as on reflection not all my subclasses are consistent about __len__ meaning the length of their binary form; I'm probably going to have to fix that - some subclasses are actually namedtuples where __len__ would be the field count. Ugh. Still, thoughts? I'm interested in any approaches that would have let me make list() fast while keeping __len__==binary_length. I'm accepting that __len__ != len(__iter__) is a bad idea now, though. Indeed. I see how that train wreck happened -- but the weirdness is not the list behavior. Maybe you can capture the intended behavior of your class with two classes, a MyIterable without length that can be converted into MyList as needed. -- https://mail.python.org/mailman/listinfo/python-list
weirdness with list()
I just ran into a surprising (to me) issue with list() on an iterable object. My object represents an MDAT box in an MP4 file: it is the ludicrously large data box containing the raw audiovideo data; for a TV episode it is often about 2GB and a movie is often 4GB to 6GB. For obvious reasons, I do not always want to load that into memory, or even read the data part at all when scanning an MP4 file, for example to recite its metadata. So my parser has a "skip" mode where it seeks straight past the data, but makes a note of its length in bytes. All good. That length is presented via the object's __len__ method, because I want to know that length later and this is a subclass of a suite of things which return their length in bytes this way. So, to my problem: I've got a walk method which traverses the hierarchy of boxes in the MP4 file. Until some minutes ago, it looked like this: def walk(self): subboxes = list(self) yield self, subboxes for subbox in subboxes: if isinstance(subbox, Box): yield from subbox.walk() somewhat like os.walk does for a file tree. I noticed that it was stalling, and investigation revealed it was stalling at this line: subboxes = list(self) when doing the MDAT box. That box (a) has no subboxes at all and (b) has a very large __len__ value. BUT... It also has a __iter__ value, which like any Box iterates over the subboxes. For MDAT that is implemented like this: def __iter__(self): yield from () What I was expecting was pretty much instant construction of an empty list. What I was getting was a very time consuming (10 seconds or more) construction of an empty list. I believe that this is because list() tries to preallocate storage. I _infer_ from the docs that this is done maybe using operator.length_hint, which in turn consults "the actual length of the object" (meaning __len__ for me?), then __length_hint__, then defaults to 0. I've changed my walk function like so: def walk(self): subboxes = [] for subbox in self: subboxes.append(subbox) ##subboxes = list(self) and commented out the former list(self) incantation. This is very fast, because it makes an empty list and then appends nothing to it. And for your typical movie file this is fine, because there are never _very_ many immediate subboxes anyway. But is there a cleaner way to do this? I'd like to go back to my former list(self) incantation, and modify the MDAT box class to arrange something efficient. Setting __length_hint__ didn't help: returning NotImplemeneted or 0 had no effect, because presumably __len__ was consulted first. Any suggestions? My current approach feels rather hacky. I'm already leaning towards making __len__ return the number of subboxes to match the iterator, especially as on reflection not all my subclasses are consistent about __len__ meaning the length of their binary form; I'm probably going to have to fix that - some subclasses are actually namedtuples where __len__ would be the field count. Ugh. Still, thoughts? I'm interested in any approaches that would have let me make list() fast while keeping __len__==binary_length. I'm accepting that __len__ != len(__iter__) is a bad idea now, though. Cheers, Cameron Simpson -- https://mail.python.org/mailman/listinfo/python-list