[Python-ideas] Re: Documenting iterators vs. iterables [was: Adding slice Iterator ...]

Andrew Barnert via Python-ideas Fri, 15 May 2020 15:43:28 -0700

On May 14, 2020, at 20:17, Stephen J. Turnbull 
<turnbull.stephen...@u.tsukuba.ac.jp> wrote:
> 
> Andrew Barnert writes:
> 
>> Students often want to know why this doesn’t work:
>>   with open("file") as f:
>>       for line in file:
>>           do_stuff(line)
>>       for line in file:
>>           do_other_stuff(line)
> 
> Sure.  *Some* students do.  I've never gotten that question from mine,
> though I do occasionally see
> 
>   with open("file") as f:
>       for line in f:        # ;-)
>           do_stuff(line)
>   with open("file") as f:
>       for line in f:
>           do_other_stuff(line)
> 
> I don't know, maybe they asked the student next to them. :-)


Or they got it off StackOverflow or Python-list or Quora or wherever. Those 
resources really do occasionally work as intended, providing answers to people 
who search without them having to ask a duplicate question. :)

>> The answer is that files are iterators, while lists are… well,
>> there is no word.
> 
> As Chris B said, sure there are words:  File objects are *already*
> iterators, while lists are *not*.  My question is, "why isn't that
> instructive?"

Well, it’s not _completely_ not instructive, it’s just not _sufficiently_ 
instructive.

Language is more useful when the concepts it names carve up the world in the 
same way you usually think about it.

Yes, it’s true that we can talk about “iterables that are not iterators”. But 
that doesn’t mean there’s no need for a word. We don’t technically need the 
word “liquid” because we could always talk about “compressibles that are not 
solid” (or “fluids that are not gas”); we don’t need the word “bird” because we 
could always talk about “diapsids that are not reptiles”; etc. Theoretically, 
English could express all the same propositions and questions and so on that it 
does today without those words. But practically, it would be harder to 
communicate with. And that’s why we have the words “bird” and “liquid”. And the 
reason we don’t have a word for all diapsids except birds and turtles is that 
we don’t need to communicate about that category. 

Natural languages get there naturally; jargon sometimes needs help.

>> We shouldn’t define everything up front, just the most important
>> things. But this is one of the most important things. People need
>> to understand this distinction very early on to use Python,
> 
> No, they don't.  They neither understand, nor (to a large extent) do
> they *need* to.

> ISTM that all we need to say is that
> 
> 1.  An *iterator* is a Python object whose only necessary function is
>   to return an object when next is applied to it.  Its purpose is to
>   keep track of "next" for *for*.  (It might do other useful things
>   for the user, eg, file objects.)
> 
> 2.  The *for* statement and the *next* builtin require an iterator
>   object to work.  Since for *always* needs an iterator object, it
>   automatically converts the "in" object to an iterator implicitly.
>   (Technical note: for the convenience of implementors of 'for',
>   when iter is applied to an iterator, it always returns the
>   iterator itself.)

I think this is more complicated than people need to know, or usually learn. 
People use for loops almost from the start, but many people get by with never 
calling next. All you need is the concept “thing that can be used in a for 
loop”, which we call “iterable”. Once you know that, everything else in Python 
that loops is the same as a for loop—the inputs to zip and enumerate are 
iterables, because they get looped over.

“Iterable” is the fundamental concept. Yeah, it sucks that it has such a clumsy 
word, but at least it has a word.

You don’t need the concept “iterator” here, much less need to know that looping 
uses iterables by calling iter() to get an iterator and then calling next() 
until StopIteration, until you get to the point of needing to read or write 
some code that iterates manually.

Of course you will need to learn the concept “iterator” pretty soon anyway, but 
only because Python actually gives you iterators all over the place. In a 
language (like Swift) where zip and enumerate were views, files weren’t 
iterable at all, etc., you wouldn’t need the concept “iterator” until very 
late, but in Python it shows up early. But you still don’t need to learn about 
next(); that’s as much a technical detail as the fact that they return self 
from iter(). You want to know whether they can be used in for loops—and they 
can, because (unlike in Swift) iterators are iterable, and you already 
understand that.

> 3.  When a "generic" iterator "runs out", it's exhausted, it's truly
>   done.  It is no longer useful, and there's nothing you can do but
>   throw it away.  Generic iterators do not have a reset method.
>   Specialized iterators may provide one, but most do not.

Yes, this is the next thing you need to know about iterators.

But you also need to know that many iterables don’t get consumed in this way. 
Lists, ranges, dicts, etc. do _not_ run out when you use them in a for loop. 
There’s a wide range of things you use every day that can be looped over 
repeatedly. And they all act the same way—each time you loop over them, you get 
all of their contents, from start to finish.

That isn’t part of the Iterable protocol, or the concept underneath it. It 
can’t be, because it’s not true for some common iterables, like all iterators.

People try to guess at what that concept is, and that’s where they run into 
problems. Because:

> 5.  Most Python objects are not iterators, but many can be converted.
>   However, some Python objects are constructed as iterators because
>   they want to be "lazy".  Examples are files (so that a huge file
>   can be processed line by line without reading the whole thing into
>   memory) and "generators" which yield a new item each time they are
>   called.
> 
> But AFAIK we *do* say that, and it doesn't get through.

I think many people do get this, and that’s exactly what leads to confusion. 
They think that “lazy” and “iterator” (or “consumed when you loop over it”) go 
together. But they don’t.

If you learned that “some Python objects are constructed as iterators because 
they want to be lazy”, and you know ranges are lazy, you’re liable to think 
that ranges are consumed when you loop over them, and if they know the term 
“iterator”, they’ll apply it to ranges (as so many people do—even people 
writing blog posts and StackOverflow answers).

And if you think of files as _not_ lazy—because, after all, the lines do exist 
in advance on disk—then you expect them to be reusable in for loops, just like 
lists and dicts. (If you think about socket.makefile() or open('/dev/random') 
that would probably disabuse you of the notion, but how many novices are using 
those files?) You could explain this by further refining the concept of “lazy” 
to explain that files are lazy in the sense of processing, or heap usage, or 
something, not just ontological existence or whatever. But that’s pretty 
complicated. And it’s ultimately misleading, because it still gives people the 
wrong answer for ranges.

>> I can teach a child why a glass will break permanently when you hit
>> it while a lake won’t by using the words “solid” and “liquid”.
> 
> Terrible example, since a glass is just a geologically slow liquid. ;-)

No, a glass is a solid. It doesn’t flow (except in the very loose sense that 
all solids do).

And even if that factoid weren’t false, it would be a fact about physicists’ 
jargon, not about the everyday words. If I ask you to bring a fruit salad to 
the potluck and you show up with tomatoes, peas, peanuts, wheat grains, and 
eggplants but no strawberries, nobody is going to be impressed.

> Back to the discussion: the child can touch both, and does so
> frequently (assuming you don't feed them from the dog's bowl and also
> bathe them regularly).  They've seen glasses break, most likely, and
> splashed water.

And someone learning Python does get to touch both things here. They get lists, 
dicts, and ranges, and they get files, zips, and enumerate. Both categories 
come up pretty early in learning Python, just like both solids and liquids come 
up pretty early in learning to be human.

> Iterators have one overriding purpose: to be fed to *for* statements,
> be exhausted, and then discarded.  This is so important that it's done
> implicitly and in every single *for* statement.  We have the necessary
> word, "iterator," but students don't have the necessary experience of
> "touching" the iterator that *for* actually iterates over instead of
> the list that is explicit in the *for* statement.  That iterator is
> created implicitly and becomes garbage as soon as the *for* statement.
> And there's no way for the student to touch it, it doesn't have a
> name!

No, it’s iterables whose purpose is being fed to a for statement. Yes, 
iterators are what for statements use under the covers to deal with iterables, 
but you don’t need to learn that until well after you’ve learned that iterators 
are what you get from open and zip.

> If you want to fix nomenclature, don't call them "files," don't call
> them "file objects," call them "file iterators".  Then students have
> an everyday iterator they can touch.  I'll guarantee that causes other
> problems, though, and gets a ton of resistence.  Even from me. :-)

You don’t have to call them “file iterators”, you just have to have to word 
“iterator” lying around to teach them when they ask why they can’t loop over a 
file twice. Which we do.

In the same way, you don’t need to call lists “list iterables”, you just need 
to have the word “iterable” lying around to teach them when they ask what other 
kinds of things can go in a for loop. (As either you or Christopher said, it’s 
not a great word, but that’s another problem.)

And you don’t need to call lists “list collections”, you just need to have the 
word “collection” lying around to teach them when they ask why ranges and lists 
and dicts let you loop over their values over and over. And that’s the word we 
don’t have. Which is why people keep trying to use the word “sequence” when it 
isn’t appropriate (calling a dict a sequence is very misleading—and 
range/xrange had the same problem before 3.2), or talk about “laziness” when 
it’s the wrong concept (ranges are lazy), etc. And it’s why I used the word 
“collection” even though it’s also incorrect, and had to follow up later in 
this paragraph to clarify, because not all of these things are sized containers 
(and maybe even vice-versa?), but that’s what “collection” means in Python. 
Because we have a concept and we don’t have a word for it.

>> Yes, and defining terminology for the one distinction that almost
>> always is relevant helps distinguish that distinction from the
>> other ones that rarely come up. Most people (especially novices)
>> don’t often need to think about the distinction between iterables
>> that are sized and also containers vs. those that are not both
>> sized and containers, so the word for that doesn’t buy us much. But
>> the distinction between iterators and things-like-list-and-so-on
>> comes up earlier, and a lot more often, so a word for that would
>> buy us a lot more.
> 
> We have that word and distinction.  A file object *is* an iterator.  A
> list is *not* an iterator.  *for* works *with* iterators internally,
> and *on* iterables through the magic of __iter__.

“Not an iterator” is not a word. Of course you _can_ talk about things that 
don’t have names by being circuitous, but it’s harder. In theory, you could 
build a language out of any set of categories that carve up the world, and 
build all of the rest by composition. We don’t need the word “bird” when we 
could say “diapsids that aren’t reptiles”, or “liquid” when we could say 
“compressed matter that isn’t solid” or “fluid that isn’t gas or plasma”. Such 
a language would technically be able to discuss all the same things as 
English—but it would make communication much harder. And thinking clearly, 
too—human brains work better when the categories picked out by language are a 
rough match for the categories they need to think about than when they aren’t.

And in practice, people do need to think about “things that can be looped over 
repeatedly and give you their values over and over”, and having to say 
“iterables that are not iterators” may be technically sufficient, but 
practically it makes communication
and thought harder. It means we have to be more verbose and less to the point, 
and people make silly mistakes like the one in the parent thread, and people 
make more serious mistakes like teaching others that ranges are iterators, and 
then having to speak circuitously makes it harder to explain their mistakes to 
them.

>>> But you *don't* use seek(0) on files (which are not iterators, and in
>>> fact don't actually exist inside of Python, only names for them do).
>>> You use them on opened *file objects* which are iterators.
>> A file object is a file, in the same way that a list object is a
>> list and an int object is an int.
> 
> No, it's not the same: your level of abstraction is so high that
> you've lost sight of the iterable/iterator distinction.  All of the
> latter objects own their own data in a way that a file object does
> not.  All of the latter objects are different from their iterators
> (where such iterators exist), while the file object is not.

That really is the wrong distinction, both at the novice level and at the 
Python-ideas level. You’re talking about laziness again. And while (nearly) all 
iterators are lazy, not all lazy things are iterators.

In what sense does a range own its data? It doesn’t store it anywhere; it 
creates it in demand by doing arithmetic on the things it actually does store. 
If you’re really careful you can sort of explain that one, but then in what 
sense does a dict_keys or a memoryview or an mmap “own” its data that a file 
doesn’t? And yet, they all work like lists.

>> The fact that we use “file” ambiguously for a bunch of related but
>> contradictory abstractions (a stream that you can read or write, a
>> directory entry, the thing an inode points to, a document that an
>> app is working on, …) makes it a bit more confusing, but
>> unfortunately that ambiguity is forced on people before they even
>> get to their first attempt at programming, so it’s probably too
>> late for Python to help (or hurt).
> 
> Agreed.  I would be much happier if we could discuss an example that
> is *not* iterating over files but *does* come up every day on
> StackOverflow.  Maybe zips would work but I'm not sure the motivation
> comes together the way it does for files (why do zips want to be lazy?
> what are the compelling examples for zip of "restarting the iteration
> where you left off" with a new *for* statement?)

I think zips want to be lazy for exactly the same reason dict_items want to be 
lazy. People had real-life code that was wasting too much time or space 
building a list that was usually only going to be used for a single pass 
through a loop, so Python fixed that by making them lazy.

But notice that one of them is an iterator and the other isn’t. So the 
distinction between the two isn’t about laziness.

So why are zips lazy iterators instead of lazy views? I think it comes down to 
historical reasons and implementation simplicity. Designing a view for zip 
would be harder than for dict.items (see Swift for evidence) because its inputs 
are so much more general. A lot of tricky questions come up about both the API 
design and the implementation, that all have obvious answers for dict_items but 
not for zip. Meanwhile, zip was invented as itertools.izip, and itertools is… 
well, it’s right there in the name. And it was invented before Python has lots 
of other views to inspire it. So, it’s no surprise that it was an iterator. And 
even when 3.0 came along, it was a lot easier to say “let’s move izip, ifilter, 
and imap out of itertools and replace the old list-producing functions” than to 
design something entirely new, which, in the absence of really compelling need 
for something entirely new, should have won out, and did.

>>> Lists, sets, ranges, dict_keys, etc. are not iterators. You can
>> write `for x in xs:` over and over and get the values over and
>> over. Because each time, you get a new iterator over their values.
> 
> You and I know that, because we know what an iterator is, and we know
> it's there because it has to be: *for* doesn't iterate anything but an
> iterator. But (except via a bytecode-level debugger) nobody has ever
> seen that iterator.  You can use iter to get a similar iterator, of
> course, but it's not the same object that any for statement ever
> used.  (Unless you explicitly created it with iter, but then you can
> re-run the for statement on it the way you do with a list.)

This is exactly why I wouldn’t explain it to a novice in terms of “for doesn’t 
iterate anything but an iterator”. Sure, you and I know that it does something 
nearly equivalent to calling iter() and then calls next() on the result until 
it receives a StopIteration, but that’s not why lists can be used in for loops; 
that’s just how Python does it. And in fact, if CPython had special-case 
opcodes for looping over old-style sequences or SequenceFast C sequences 
without ever creating the iterator, it wouldn’t change the visible behavior. In 
fact, under the covers, some C functions (like, IIRC, tuple.__new__) that 
accept any iterable do exactly that. It doesn’t change their observable 
behavior, so nobody needs to know.

Of course when talking to you, or to Python-ideas, I can count on the fact that 
you know that iterators return self from iter(), and that “like a for loop” 
means “as if calling iter() and then calling next() repeatedly until an 
exception and swallowing the exception if it’s StopIteration”, but I don’t 
expect everyone who uses Python to know all of that.

>> Files, maps, zips, generators, etc. are not like that. They’re
>> iterators. If you write `for x in xs:` twice, you get nothing the
>> second time, because each time you’re using the same iterator, and
>> you’ve already used it up. Because iter(xs) is xs when it’s a file
>> or generator etc.
> 
> Genexps are iterators, but generators (in the sense of the product of
> a def that contains "yield") are not even iterable.  Those are
> iterator factories

The word “generator” is ambiguous. The type with the name “generator” that’s 
publicly available as “types.GeneratorType” and testable with 
inspect.isgenerator and that has the attributes like gi_frame that the docs say 
all generators have, those are generator iterators. And the things testable 
with .__code__.co_flags & CO_GENERATOR, those are generator functions. They’re 
both called “generator” so often that you have to be careful to say “generator 
iterator” or “generator function” when it’s not clear from context which one 
you mean, but I think it’s pretty clear from the context “generators are 
iterators” and “if you write for x in xs:” and so on which one I meant.

>> The only representation of files in Python is file objects—the
>> thing you get back from open (or socket.makefile or io.StringIO or
>> whatever else)—and those are iterators.
> 
> The thought occurred to me, "What if that was a bad decision?  Maybe
> in principle files shouldn't be iterators, but rather iterables with a
> real __iter__ that creates the iterable."  I realized that I'd already
> answered my own question in part: I find it easy to imagine cases
> where I'd want to get some lines of input from a file as a
> higher-level unit, then stop and do some processing.  The killer app
> for me is mbox files.  Another plausible case is reading top-level
> Lisp expressions from a file (although that doesn't necessarily divide
> neatly into lines.)  I also found it surprisingly complicated to think
> about the consequences to the type of making that change.

I think there’s an easier way to see why this was a good decision: because 
files have positions. (Or, if you prefer, because files are streams, which 
implies that they have positions.) We don’t have a read_at(pos, size) method, 
we have a read(size) method that reads from where you left of. Seeking does 
exist, but it’s secondary—and it works by changing where the file thinks it 
left off.

Once you think of files as things that know where they are, it makes more sense 
to wrap an iterator, rather than a reusable iterable, around them.

You could argue that having a position was a bad idea in the first place, that 
Python shouldn’t have done it just because C stdio does it (and Unix kernels 
make it easy). Sure, that would mean we couldn’t use sockets and pipes as files 
and it would be weird to deal with special Unix files like /dev/random, but 
none of those things are exactly fundamental to novices. And we could even have 
two abstractions—a “stream” is what we call a file today, a “file” is a 
higher-level thing that you can randomly access, iterate repeatably, or ask for 
a stream from, and novices would only have to learn files rather than streams 
(until they have to do something like dealing with an mbox).

But nearly every other language and platform in use today does the same thing 
as Python (and C and UNIX). If you know FILE*, NSFileHandle, or whatever the 
thing is called in bash, PHP, Ruby, C#, Go, Lisp, etc., a Python file is the 
exact same thing. And vice versa. And if you need to deal with native Win32 
file handles via pywin32, they work pretty much the same way as the files you 
already know; you just have to know how to change the spelling of some of the 
functions. And so on. That’s worth a lot. 

(Plus, two abstractions is always more to learn than one.)

> Going back to the documentation theme, maybe one way to approach
> explaining iterators is to start with the use case of files as
> (non-seekable) streams, show how 'for iteration' can be "restarted"
> where you left off in the file, and teach that "this is the canonical
> behavior of iterators; lists etc are *iterable* because 'for'
> automatically converts them to iterators "behind the scenes".

I still think this is getting it backward. Iterating lists is more fundamental 
that iterating files. Possibly even iterating ranges is. And you don’t have to 
understand that it works by converting them to iterators to understand it. 

And even if you do understand that, it doesn’t really solve the problem, 
because “convert to an iterator behind the scenes” doesn’t really tell you that 
you can do that repeatedly and get independent results. Most other cases where 
Python converts something behind the scenes, like adding 2 to a float or a 
Fraction, this doesn’t matter. Nobody cares whether each time you add 2 you get 
the same 2.0 or a different one, or whether each time you write the same string 
to a text file you get the same UTF-8 bytes or a new one. Iterators probably 
aren’t the _only_ exception to that, but I’m pretty sure they’re the first one 
many people run into.

On the other hand, this would certainly get the notion of “files are streams” 
across to novices (as opposed to people coming from other languages) faster and 
more easily than we do today, which might help a lot of them. It might even 
turn out to solve the “why can’t I loop over this file twice” question for a 
lot of people in a different way, and that different way might be something you 
could build on to explain the difference between zip and range. “Like a stream” 
is much more accurate than “because it wants to be lazy”, and maybe easier to 
understand as well.


_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/U7D5HBEFBOWJQXBCOELSXUJJL5BL5JXE/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Documenting iterators vs. iterables [was: Adding slice Iterator ...]

Reply via email to