2009/9/29 Scooter <slbent...@gmail.com>: > I'm attempting to reformat an apache log file that was written with a > custom output format. I'm attempting to get it to w3c format using a > python script. The problem I'm having is the field-to-field matching. > In my python code I'm using split with spaces as my delimiter. But it > fails when it reaches the user agent because that field itself > contains spaces. But that user agent is enclosed with double quotes. > So is there a way to split on a certain delimiter but not to split > within quoted words. > > i.e. a line might look like > > 2009-09-29 12:00:00 - GET / "Mozilla/4.0 (compatible; MSIE 7.0; > Windows NT 6.0; GTB5; SLCC1; .NET CLR 2.0.50727; Media Center PC > 5.0; .NET CLR 3.0.04506; .NET CLR 3.5.21022)" http://somehost.com 200 > 1923 1360 31715 -
Try shlex: >>> import shlex >>> s = '2009-09-29 12:00:00 - GET / "Mozilla/4.0 (compatible; MSIE 7.0; >>> Windows NT 6.0; GTB5; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET >>> CLR 3.0.04506; .NET CLR 3.5.21022)" http://somehost.com 200' >>> shlex.split(s) ['2009-09-29', '12:00:00', '-', 'GET', '/', 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; GTB5; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506; .NET CLR 3.5.21022)', 'http://somehost.com', '200'] -- mvh Björn -- http://mail.python.org/mailman/listinfo/python-list