Re: [Tutor] Logfile Manipulation

2009-11-09 Thread ALAN GAULD
> > what the default python sorting algorithm is on a list, but AFAIK you'd be > > looking at a constant O(log 10) > > I'm not a mathematician - what does this mean, in layperson's terms? O(log10) is a way of expressing the efficiency of an algorithm. Its execution time is proportional (in the

Re: [Tutor] Logfile Manipulation

2009-11-09 Thread Stephen Nelson-Smith
On Mon, Nov 9, 2009 at 3:15 PM, Wayne Werner wrote: > On Mon, Nov 9, 2009 at 7:46 AM, Stephen Nelson-Smith > wrote: >> >> And the problem I have with the below is that I've discovered that the >> input logfiles aren't strictly ordered - ie there is variance by a >> second or so in some of the ent

Re: [Tutor] Logfile Manipulation

2009-11-09 Thread ALAN GAULD
> I can sort the biggest logfile (800M) using unix sort in about 1.5 > mins on my workstation. That's not really fast enough, with > potentially 12 other files You won't beat sort with Python. You have to be realistic, these are very big files! Python should be faster overall but for speci

Re: [Tutor] Logfile Manipulation

2009-11-09 Thread Wayne Werner
On Mon, Nov 9, 2009 at 7:46 AM, Stephen Nelson-Smith wrote: > And the problem I have with the below is that I've discovered that the > input logfiles aren't strictly ordered - ie there is variance by a > second or so in some of the entries. > Within a given set of 10 lines, is the first line and

Re: [Tutor] Logfile Manipulation

2009-11-09 Thread Stephen Nelson-Smith
And the problem I have with the below is that I've discovered that the input logfiles aren't strictly ordered - ie there is variance by a second or so in some of the entries. I can sort the biggest logfile (800M) using unix sort in about 1.5 mins on my workstation. That's not really fast enough,

Re: [Tutor] Logfile Manipulation

2009-11-09 Thread Stephen Nelson-Smith
Hi, > If you create iterators from the files that yield (timestamp, entry) > pairs, you can merge the iterators using one of these recipes: > http://code.activestate.com/recipes/491285/ > http://code.activestate.com/recipes/535160/ Could you show me how I might do that? So far I'm at the stage o

Re: [Tutor] Logfile Manipulation

2009-11-09 Thread Gerard Flanagan
Stephen Nelson-Smith wrote: Hi, Any advice or experiences? go here and download the pdf! http://www.dabeaz.com/generators-uk/ Someone posted this the other day, and I went and read through it and played around a bit and it's exactly what you're looking for - plus it has one vs. slid

Re: [Tutor] Logfile Manipulation

2009-11-09 Thread Stephen Nelson-Smith
Hi, >> Any advice or experiences? >> > > go here and download the pdf! > http://www.dabeaz.com/generators-uk/ > Someone posted this the other day, and I went and read through it and played > around a bit and it's exactly what you're looking for - plus it has one vs. > slide of python vs. awk. > I

Re: [Tutor] Logfile Manipulation

2009-11-09 Thread Wayne Werner
On Sun, Nov 8, 2009 at 11:41 PM, Stephen Nelson-Smith wrote: > I've got a large amount of data in the form of 3 apache and 3 varnish > logfiles from 3 different machines. They are rotated at 0400. The > logfiles are pretty big - maybe 6G per server, uncompressed. > > I've got to produce a combin

Re: [Tutor] Logfile Manipulation

2009-11-09 Thread Kent Johnson
On Mon, Nov 9, 2009 at 4:36 AM, Stephen Nelson-Smith wrote: I want to extract 24 hrs of data based timestamps like this: [04/Nov/2009:04:02:10 +] >>> >>> OK It looks like you could use a regex to extract the first >>> thing you find between square brackets. Then convert that to a

Re: [Tutor] Logfile Manipulation

2009-11-09 Thread Martin A. Brown
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hello, : An apache logfile entry looks like this: : : 89.151.119.196 - - [04/Nov/2009:04:02:10 +] "GET : /service.php?s=nav&arg[]=&arg[]=home&q=ubercrumb/node%2F20812 : HTTP/1.1" 200 50 "-" "-" : : I want to extract 24 hrs of data based t

Re: [Tutor] Logfile Manipulation

2009-11-09 Thread Stephen Nelson-Smith
Sorry - forgot to include the list. On Mon, Nov 9, 2009 at 9:33 AM, Stephen Nelson-Smith wrote: > On Mon, Nov 9, 2009 at 9:10 AM, ALAN GAULD wrote: >> >>> An apache logfile entry looks like this: >>> >>>89.151.119.196 - - [04/Nov/2009:04:02:10 +] "GET >>> /service.php?s=nav&arg[]=&arg[]=home

Re: [Tutor] Logfile Manipulation

2009-11-09 Thread ALAN GAULD
> An apache logfile entry looks like this: > >89.151.119.196 - - [04/Nov/2009:04:02:10 +] "GET > /service.php?s=nav&arg[]=&arg[]=home&q=ubercrumb/node%2F20812 > HTTP/1.1" 200 50 "-" "-" > >I want to extract 24 hrs of data based timestamps like this: > > [04/Nov/2009:04:02:10 +] OK It lo

Re: [Tutor] Logfile Manipulation

2009-11-09 Thread Stephen Nelson-Smith
On Mon, Nov 9, 2009 at 8:47 AM, Alan Gauld wrote: > I'm not familiar with Apache log files so I'll let somebody else answer, > but I suspect you can either use string.split() or a re.findall(). You might > even be able to use csv. Or if they are in XML you could use ElementTree. > It all depends

Re: [Tutor] Logfile Manipulation

2009-11-09 Thread Alan Gauld
"Stephen Nelson-Smith" wrote * How does Python compare in performance to shell, awk etc in a big pipeline? The shell script kills the CPU Python should be significantly faster than the typical shell script and it should consume less resources, although it will probably still use a fair bit o

[Tutor] Logfile Manipulation

2009-11-08 Thread Stephen Nelson-Smith
I've got a large amount of data in the form of 3 apache and 3 varnish logfiles from 3 different machines. They are rotated at 0400. The logfiles are pretty big - maybe 6G per server, uncompressed. I've got to produce a combined logfile for -2359 for a given day, with a bit of filtering (remo