On 2013-04-04, Roy Smith <r...@panix.com> wrote: > re.X is a pretty cool tool for making huge regexes readable. > But, it turns out that python's auto-continuation and string > literal concatenation rules are enough to let you get much the > same effect. Here's a regex we use to parse haproxy log files. > This would be utter line noise all run together. This way, it's > almost readable :-) > > pattern = re.compile(r'haproxy\[(?P<pid>\d+)]: ' > r'(?P<client_ip>(\d{1,3}\.){3}\d{1,3}):' > r'(?P<client_port>\d{1,5}) ' > > r'\[(?P<accept_date>\d{2}/\w{3}/\d{4}(:\d{2}){3}\.\d{3})] ' > r'(?P<frontend_name>\S+) ' > r'(?P<backend_name>\S+)/' > r'(?P<server_name>\S+) ' > r'(?P<Tq>(-1|\d+))/' > r'(?P<Tw>(-1|\d+))/' > r'(?P<Tc>(-1|\d+))/' > r'(?P<Tr>(-1|\d+))/' > r'(?P<Tt>\+?\d+) ' > r'(?P<status_code>\d{3}) ' > r'(?P<bytes_read>\d+) ' > r'(?P<captured_request_cookie>\S+) ' > r'(?P<captured_response_cookie>\S+) ' > r'(?P<termination_state>[\w-]{4}) ' > r'(?P<actconn>\d+)/' > r'(?P<feconn>\d+)/' > r'(?P<beconn>\d+)/' > r'(?P<srv_conn>\d+)/' > r'(?P<retries>\d+) ' > r'(?P<srv_queue>\d+)/' > r'(?P<backend_queue>\d+) ' > r'(\{(?P<request_id>.*?)\} )?' > r'(\{(?P<captured_request_headers>.*?)\} )?' > r'(\{(?P<captured_response_headers>.*?)\} )?' > r'"(?P<http_request>.+)"' > ) > > And, for those of you who go running in the other direction every time > regex is suggested as a solution, I challenge you to come up with easier > to read (or write) code for parsing a line like this (probably > hopelessly mangled by the time you read it): > > 2013-04-03T00:00:00+00:00 localhost haproxy[5199]: 10.159.19.244:57291 > [02/Apr/2013:23:59:59.811] app-nodes next-song-nodes/web8.songza.com > 0/0/3/214/219 200 593 sessionid=NWiX5KGOdvg6dSaA > sessionid=NWiX5KGOdvg6dSaA ---- 249/249/149/14/0 0/0 > {4C0ABFA9-515B6DEF-933229} "POST > /api/1/station/892337/song/16024201/notify-play HTTP/1.0"
The big win from the above seems to me the groupdict result. The parsing is also very simple, with virtually no nesting. It's a good application of re. It seems easy enough to do with str methods, but would it be an improvement? I ran out of time before the prototype was finished, but here's a sketch. import re import datetime import pprint s =('2013-04-03T00:00:00+00:00 localhost haproxy[5199]: 10.159.19.244:57291' ' [02/Apr/2013:23:59:59.811] app-nodes next-song-nodes/web8.songza.com' ' 0/0/3/214/219 200 593 sessionid=NWiX5KGOdvg6dSaA' ' sessionid=NWiX5KGOdvg6dSaA ---- 249/249/149/14/0 0/0' ' {4C0ABFA9-515B6DEF-933229}' ' "POST /api/1/station/892337/song/16024201/notify-play HTTP/1.0"') def get_haproxy(s): prefix = 'haproxy[' if s.startswith(prefix): return int(s[len(prefix):s.index(']')]) return False def get_client_info(s): ip, colon, port = s.partition(':') if colon != ':': return False else: return ip, int(port) def get_accept_date(s): try: return datetime.datetime.strptime(s, '[%d/%b/%Y:%H:%M:%S.%f]') except ValueError: return False def get_backend(s): name, slash, server = s.partition('/') if slash != '/': return False else: return name, server def get_track_info(s): try: return s.split('/') except TypeError: return False matchers = [ (None, None), (None, 'localhost'), ('haproxy', get_haproxy), (('client_ip', 'client_port'), get_client_info), ('accept_date', get_accept_date), ('frontend_name', lambda s: s), (('backend_name', 'server_name'), get_backend), (('Tq', 'Tw', 'Tc', 'Tr', 'Tt'), get_track_info), ] result = {} for i, s in enumerate(s.split()): if i < len(matchers): # I'm not finished writing matchers yet. key, matcher = matchers[i] if matcher is None: pass else: if isinstance(matcher, str): value = matcher == s else: value = matcher(s) if value is False: raise ValueError('Parse error {}: {} "{}"'.format( key, matcher, s)) if isinstance(key, tuple): result.update(zip(*[key, value])) elif key is not None: result[key] = value pprint.pprint(result) The engine would need to be improved in implementation and made more flexible once it's working and tested. I think the error handling is a good feature and the ability to customize parsing and return custom types is cool. -- Neil Cerutti -- http://mail.python.org/mailman/listinfo/python-list