If I get to add multi-line strings today, I'll have a complete tokenizer. Interior looks a lot like C minus semi-colons. (Though I did figure out that there wasn't any need for tokens that didn't come from a real to have a doubleValue field. In C++ or Java all the Tokens had a doubleValue, because one needed it.)
Theory: there are some problems which, by their nature, end up looking a lot like C, regardless of language. I wrote a Perl-to-HTML pretty-printer. It lets you embed HTML in comments and creates a table of contents hyperlinked to the individual functions. It's at http://www.MartinRinehart.com , output and source code, in the Articles section. I started with a Perl-style solution: regex for the chars left of "#" and for the chars to the right. One line of Perl. Nice, except that it won't work when applied to itself. It will split the line based on the "#" in the regex, of course. Kludged around that, but then met "#" embedded in strings, "#" embedded in regex passed to functions, ... Ended up marching down the input line, one character at a time, like a C program. -- http://mail.python.org/mailman/listinfo/python-list