Hi, I've written some (primitive) code to parse some apache logfies and establish if apache has appended a session cookie to the end. We're finding that some browsers don't and apache doesn't just append a "-" - it just omits it.
It's working fine, but for an edge case: Couldn't match 192.168.1.107 - - [24/Feb/2010:20:30:44 +0100] "GET http://sekrit.com/node/175523 HTTP/1.1" 200 - "http://sekrit.com/search/results/"3%2B2%20course"" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; GTB6.4)" Couldn't match 192.168.1.107 - - [24/Feb/2010:20:31:15 +0100] "GET http://sekrit.com/node/175521 HTTP/1.1" 200 - "http://sekrit.com/search/results/"3%2B2%20course"" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; GTB6.4)" Couldn't match 192.168.1.107 - - [24/Feb/2010:20:32:07 +0100] "GET http://sekrit.com/node/175520 HTTP/1.1" 200 - "http://sekrit.com/search/results/"3%2B2%20course"" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; GTB6.4)" Couldn't match 192.168.1.107 - - [24/Feb/2010:20:32:33 +0100] "GET http://sekrit.com/node/175522 HTTP/1.1" 200 - "http://sekrit.com/search/results/"3%2B2%20course"" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; GTB6.4)" Couldn't match 192.168.1.107 - - [24/Feb/2010:20:33:01 +0100] "GET http://sekrit.com/node/175527 HTTP/1.1" 200 - "http://sekrit.com/search/results/"3%2B2%20course"" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; GTB6.4)" Couldn't match 192.168.1.107 - - [25/Feb/2010:17:01:54 +0100] "GET http://sekrit.com/search/results/ HTTP/1.0" 200 - "http://sekrit.com/search/results/"guideline%20grids"&page=1" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)" Couldn't match 192.168.1.107 - - [25/Feb/2010:17:02:15 +0100] "GET http://sekrit.com/search/results/ HTTP/1.0" 200 - "http://sekrit.com/search/results/"guideline%20grids"&page=1" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)" If there are " " inside the request string, my regex breaks. Here's the code: #!/usr/bin/env python import re pattern = r'(?P<ForwardedFor>^(-|[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}(, [0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3})*){1}) (?P<RemoteLogname>(\S*)) (?P<RemoteUser>(\S*)) (?P<Timestamp>(\[[^\]]+\])) (?P<FirstLineOfRequest>(\"([^"\\]*(?:\\.[^"\\]*)*)\")?) (?P<Status>(\S*)) (?P<Size>(\S*)) (?P<Referrer>(\"([^"\\]*(?:\\.[^"\\]*)*)\")?) (?P<UserAgent>(\"([^"\\]*(?:\\.[^"\\]*)*)\")?)( )?(?P<SiteIntelligenceCookie>(\"([^"\\]*(?:\\.[^"\\]*)*)\")?)' regex = re.compile(pattern) lines = 0 no_cookies = 0 unmatched = 0 for line in open('/home/stephen/scratch/test-data.txt'): lines +=1 line = line.strip() match = regex.match(line) if match: data = match.groupdict() if data['SiteIntelligenceCookie'] == '': no_cookies +=1 else: print "Couldn't match ", line unmatched +=1 print "I analysed %s lines." % (lines,) print "There were %s lines with missing Site Intelligence cookies." % (no_cookies,) print "I was unable to process %s lines." % (unmatched,) How can I make the regex a bit more resilient so it doesn't break when " " is embedded? -- Stephen Nelson-Smith Technical Director Atalanta Systems Ltd www.atalanta-systems.com -- http://mail.python.org/mailman/listinfo/python-list