On 16/09/2010 16:37, Georg Brandl wrote:
That reminds me of the undocumented re.Scanner -- which is meant to do
exactly this.  Wouldn't it be about time to document or remove it?


There was a long discussion about this on the bug tracker (the suggestion to document it was rejected at the time).

http://bugs.python.org/issue5337

Michael Foord

Georg

Am 16.09.2010 14:02, schrieb raymond.hettinger:
Author: raymond.hettinger
Date: Thu Sep 16 14:02:17 2010
New Revision: 84847

Log:
Add tokenizer example to regex docs.

Modified:
    python/branches/py3k/Doc/library/re.rst

Modified: python/branches/py3k/Doc/library/re.rst
==============================================================================
--- python/branches/py3k/Doc/library/re.rst     (original)
+++ python/branches/py3k/Doc/library/re.rst     Thu Sep 16 14:02:17 2010
@@ -1282,3 +1282,66 @@
     <_sre.SRE_Match object at ...>
     >>>  re.match("\\\\", r"\\")
     <_sre.SRE_Match object at ...>
+
+
+Writing a Tokenizer
+^^^^^^^^^^^^^^^^^^^
+
+A `tokenizer or scanner<http://en.wikipedia.org/wiki/Lexical_analysis>`_
+analyzes a string to categorize groups of characters.  This is a useful first
+step in writing a compiler or interpreter.
+
+The text categories are specified with regular expressions.  The technique is
+to combine those into a single master regular expression and to loop over
+successive matches::
+
+    Token = collections.namedtuple('Token', 'typ value line column')
+
+    def tokenize(s):
+        tok_spec = [
+            ('NUMBER', r'\d+(.\d+)?'),  # Integer or decimal number
+            ('ASSIGN', r':='),          # Assignment operator
+            ('END', ';'),               # Statement terminator
+            ('ID', r'[A-Za-z]+'),       # Identifiers
+            ('OP', r'[+*\/\-]'),        # Arithmetic operators
+            ('NEWLINE', r'\n'),         # Line endings
+            ('SKIP', r'[ \t]'),         # Skip over spaces and tabs
+        ]
+        tok_re = '|'.join('(?P<%s>%s)' % pair for pair in tok_spec)
+        gettok = re.compile(tok_re).match
+        line = 1
+        pos = line_start = 0
+        mo = gettok(s)
+        while mo is not None:
+            typ = mo.lastgroup
+            if typ == 'NEWLINE':
+                line_start = pos
+                line += 1
+            elif typ != 'SKIP':
+                yield Token(typ, mo.group(typ), line, mo.start()-line_start)
+            pos = mo.end()
+            mo = gettok(s, pos)
+        if pos != len(s):
+            raise RuntimeError('Unexpected character %r on line %d' %(s[pos], 
line))
+
+>>>  statements = '''\
+        total := total + price * quantity;
+        tax := price * 0.05;
+    '''
+>>>  for token in tokenize(statements):
+    ...     print(token)
+    ...
+    Token(typ='ID', value='total', line=1, column=8)
+    Token(typ='ASSIGN', value=':=', line=1, column=14)
+    Token(typ='ID', value='total', line=1, column=17)
+    Token(typ='OP', value='+', line=1, column=23)
+    Token(typ='ID', value='price', line=1, column=25)
+    Token(typ='OP', value='*', line=1, column=31)
+    Token(typ='ID', value='quantity', line=1, column=33)
+    Token(typ='END', value=';', line=1, column=41)
+    Token(typ='ID', value='tax', line=2, column=9)
+    Token(typ='ASSIGN', value=':=', line=2, column=13)
+    Token(typ='ID', value='price', line=2, column=16)
+    Token(typ='OP', value='*', line=2, column=22)
+    Token(typ='NUMBER', value='0.05', line=2, column=24)
+    Token(typ='END', value=';', line=2, column=28)



--
http://www.ironpythoninaction.com/
http://www.voidspace.org.uk/blog

READ CAREFULLY. By accepting and reading this email you agree, on behalf of 
your employer, to release me from all obligations and waivers arising from any 
and all NON-NEGOTIATED agreements, licenses, terms-of-service, shrinkwrap, 
clickwrap, browsewrap, confidentiality, non-disclosure, non-compete and 
acceptable use policies (”BOGUS AGREEMENTS”) that I have entered into with your 
employer, its partners, licensors, agents and assigns, in perpetuity, without 
prejudice to my ongoing rights and privileges. You further represent that you 
have the authority to release me from any BOGUS AGREEMENTS on behalf of your 
employer.


_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to