Partly as an educational exercise, and partly for its practical benefit, I'm trying to pick up a programming project from where I left off in 2001. It implemented in slightly generalised form the "buffer pair" scheme for lexical analysis described on pp. 88--92 of Aho et al., /Compilers: Principles, Techniques and Tools/ (1986). (I'm afraid I don't have a page reference for the 2007 second edition. Presumably it's also in Knuth somewhere.)
Documentation for one of the C++ header files describes it thus (but I never quite got the hang of C++, so some of the language- specific details may be very poorly conceived): "An <ipfile> object incorporates a handle to a file, opened in read-only mode, and a buffer containing (by default) raw binary data from that file. The constructor also has an option to open a file in text mode. The buffer may, optionally, consist of several segments, linked to one another in cyclic sequence. The number of segments is a constant class member, nblocks (1 <= nblocks <= 32,767). A second constant class member, block (1 <= block <= 32,767) gives the size of each of the segments in bytes. The purpose of creating a buffer in cyclically linked segments is to allow reference to the history of reading the file, even though it is being read sequentially. The bare class <ipfile> does not do this itself, but is designed so that classes derived from it may incorporate one or more pointers to parts of the buffer that have already been read (assuming these parts have not yet been overwritten). If there were only one segment, the length of available history would periodically be reduced to zero, when the buffer is re- freshed. In general, the available history occupies at least a fraction (nblocks - 1)/nblocks of a full buffer." Aho et al. describe the scheme thus (p. 90): "Two pointers to the input buffer are maintained. The string of characters between the two pointers is the current lexeme. Initially, both pointers point to the first character of the next lexeme to be found. One, called the forward pointer, scans ahead until a match for a pattern is found. Once the next lexeme is determined, the forward pointer is set to the character at its right end. After the lexeme is processed, both pointers are set to the character immediately past the lexeme." [There follows a description of the use of "sentinels" to test efficiently for pointers moving past the end of input to date.] I seem to remember (but my memory is still very hazy) that there was some annoying difficulty in coding the raw binary input file reading operation in C++ in an implementation-independent way; and I'm reluctant to go back and perhaps get bogged down again in whatever way I got bogged down before; so I would prefer to use Python for the whole thing, if possible (either using some existing library, or else by recoding it all myself in Python). Does some Python library already provide some functionality like this? (It's enough to do it with nblocks = 2, as in Aho et al.) If not, is this a reasonable thing to try to program in Python? (At the same time as learning the language, and partly as a fairly demanding exercise intended to help me to learn it.) Or should I just get my hands dirty with some C++ compiler or other, and get my original code working on my present machine (possibly in ANSI C instead of C++), and call it from Python? -- Angus Rodgers -- http://mail.python.org/mailman/listinfo/python-list