Re: Parsing haproxy log files (python)

2011-03-20 Thread Willy Tarreau
Hi Holger,

On Sat, Mar 19, 2011 at 05:32:08PM +0100, Holger Just wrote:
> Hi Roy,
> 
> On 2011-03-18 22:21, Roy Smith wrote:
> > Before I reinvent the wheel, has anybody already written code to parse
> > haproxy log messages with Python?
> 
> I have, although it's not _that_ fast. My approach requires about 1
> minutes per 100 MB gziped logs (with a roughly 10:1 compression).
> 
> If your usecase matches on the features of halog, you should definitly
> try that instead. It's written by Willy himself and is able to easily
> maxout your streaming file I/O (meaning it is magnitudes faster than you
> could ever do it in python itself)

in fact I'd like halog to be more commonly usable as a low-level
"pre-parser", which means it would take care of extracting the useful
information from the logs so that higher level scripts can process
pre-digested information.

Of course it will never be able to do everything, but if some scripts
don't need all the lines of a log file, we should ensure that halog
provides enough means to filter those lines out. For instance, right
now you can already use halog to ensure that only valid parsable lines
are returned. Most likely a number of other filtering options need to
be added, we need to figure out which ones.

Regards,
Willy




Re: Parsing haproxy log files (python)

2011-03-19 Thread Holger Just
Hi Roy,

On 2011-03-18 22:21, Roy Smith wrote:
> Before I reinvent the wheel, has anybody already written code to parse
> haproxy log messages with Python?

I have, although it's not _that_ fast. My approach requires about 1
minutes per 100 MB gziped logs (with a roughly 10:1 compression).

If your usecase matches on the features of halog, you should definitly
try that instead. It's written by Willy himself and is able to easily
maxout your streaming file I/O (meaning it is magnitudes faster than you
could ever do it in python itself)

That said, the gist of my analyzing implementation follows. It is
targeted at the verbose HTTP log format of HAProxy and Python 2.4. The
terminology is the one used in the configuration manual of HAProxy.
Refer to it for a description of the various fields.

--Holger

--

#!/usr/bin/env python
# encoding: utf-8

import re
import subprocess as sub

# Does the syslog server escape quotes?
template_escape = True

haproxy_re = (r'haproxy\[(?P\d+)\]: '
r'(?P(\d{1,3}\.){3}\d{1,3}):(?P\d{1,5}) '
r'\[(?P\d{2}/\w{3}/\d{4}(:\d{2}){3}\.\d{3})\] '
r'(?P\S+) (?P\S+) '
r'(?P(-1|\d+))/(?P(-1|\d+))/(?P(-1|\d+))/(?P(-1|\d+))/'
r'(?P\+?\d+) '
r'(?P\d{3}) (?P\d+) '
r'(?P\S+) (?P\S+) '
r'(?P[\w-]{4}) (?P\d+)/(?P\d+)/'
r'(?P\d+)/(?P\d+)/(?P\d+) '
r'(?P\d+)/(?P\d+) '
r'(\{(?P.*?)\} )?'
r'(\{(?P.*?)\} )?')

if template_escape:
  haproxy_re += r'\\"(?P.+)\\"'
else:
  haproxy_re += r'"(?P.+)"'

haproxy_re = re.compile(haproxy_re)

def scan(logfile_path):
  (root, ext) = os.path.splitext(logfile_path)
  process = None
  if ext == ".gz":
# Use a shellout for unzipping. This is about 2-5 times faster
# than doing it in python.
process = sub.Popen(["/bin/gunzip", "--stdout", path],
stdout=sub.PIPE, bufsize=1)
fd = process.stdout
  else:
fd = open(path, "r")

  line_no = 0
  for line in fd:
line_no += 1
try:
  match = haproxy_re.search(line)
  if not match:
# A non-request, e.g. an error or an info message of HAProxy
# We just ignore it and continue with the next line
continue

  fields = match.groupdict()
  if fields["captured_request_headers"]:
fields["captured_request_headers"] = \
fields["captured_request_headers"].split("|")
  if fields["captured_response_headers"]:
fields["captured_response_headers"] = \
fields["captured_response_headers"].split("|")

  # Now you have the matched parts in the fields dict
  # And you can do whatever you like with it :)

except:
  print "An error occurred in line %s. Last line was:" % line_no
  print line
  raise

  # finalize the file reading
  if process:
process.communicate()
  else:
fd.close()



Parsing haproxy log files (python)

2011-03-18 Thread Roy Smith
Before I reinvent the wheel, has anybody already written code to parse
haproxy log messages with Python?