Guido van Rossum wrote: > On 8/29/06, Talin <[EMAIL PROTECTED]> wrote: >> I've been thinking more about the iostack proposal. Right now, a typical >> file handle consists of 3 "layers" - one representing the backing store >> (file, memory, network, etc.), one for adding buffering, and one >> representing the program-level API for reading strings, bytes, decoded >> text, etc. >> >> I wonder if it wouldn't be better to cut that down to two. Specifically, >> I would like to suggest eliminating the buffering layer. >> >> My reasoning is fairly straightforward: Most file system handles, >> network handles and other operating system handles already support >> buffering, and they do a far better job of it than we can. The handles >> that don't support buffering are memory streams - which don't need >> buffering anyway. >> >> Of course, it would make sense for Python to provide its own buffering >> implementation if we were going to always use the lowest-level i/o API >> provided by the operating system, but I can't see why we would want to >> do that. The OS knows how to allocate an optimal buffer, using >> information such as the block size of the filesystem, whereas trying to >> achieve this same level of functionality in the Python standard library >> would be needlessly complex IMHO. > > I'm not sure I follow. > > We *definitely* don't want to use stdio -- it's not part of the OS > anyway, and has some annoying quirks like not giving you any insight > in how it is using the buffer, nor changing the buffer size on the > fly, and crashing when you switch read and write calls. > > So given that, how would you implement readline()? Reading one byte at > a time until you've got the \n is definitely way too slow given the > constant overhead of system calls. > > Regarding optimal buffer size, I've never seen a program for which 8K > wasn't optimal. Larger buffers simply don't pay off.
Well, as far as readline goes: In order to split the text into lines, you have to decode the text first anyway, which is a layer 3 operation. You can't just read bytes until you get a \n, because the file you are reading might be encoded in UCS2 or something. So for example, in a big-endian UCS2 encoding, newline would be encoded as 0x00 0x0a, whereas in a little-endian UCS2 encoding, it would be 0x0A 0x00. Merely stopping at the 0x0A byte is incorrect, you've only read half the character. You're correct that reading by line does require a buffer if you want to do it efficiently. However, in a world of character encodings, the readline buffer has to be implemented at a higher level in the IO stack, at the same level which understands text encodings. There may be a different set of buffers at the lower level to minimize the number of disk i/o operations, but they can't really be the same buffer -- either that, or the text encoding layer will need to have fairly incestuous knowledge of what's going on at the lower layers so that it can peek inside its buffers. It seems to me that no matter how you slice it, you can't have an abstract "buffering" layer that is independent of both the layer beneath and the layer above. Both the text decoding layer and the disk i/o layer need to have fairly intimate knowledge of their buffers if you want maximum efficiency. (I'm not opposed to a custom implementation of buffering in the level 1 file object itself, although I suspect in most cases you'd be better off using what the OS or its standard libs provide.) As far as stdio not giving you hints as to how it is using the buffer, I am not sure what you mean...what kind of information would a custom buffer implementation give you that stdio would not? If its early detection of \n is what you are thinking of, I've already shown that won't work unless you are assuming an 8-bit encoding. -- Talin _______________________________________________ Python-3000 mailing list [email protected] http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com
