Jay Paulson (CE CEN) wrote:
> Hello everyone!  I've been given the responsiblity of coding an apache 
> access_log parser.  What my tasks are to do is to return the number of hits 
> for certain file extensions that happen on certain dates with specific IP 
> address.
> 
> As of now I'm only going back 7 days in the log looking for this information 
> and I'm only looking for 5 file types (.doc, .pdf, .html, .php, and .flv).  
> I'm using the fgets() function so I can read the file line by line and do the 
> matches that I need to do and increment the counters as needed.  Right now I 
> have 3 loops looking for everything, which seems to me not to be the best way 
> of doing this.  I've also encountered that a line may have the file extension 
> I want but it's actually the soucre of another file.  (see below for example)
> 
> Log file example:
> I want the first line but not the second line.  The second line has a .css 
> file which was used by the .html file therefore I don't want this line.  I do 
> want the first line that all it has is .html and no other files.
> 
> 10.25.40.64 - - [01/Jan/2006:07:33:18 -0600] "GET /home.html HTTP/1.1" 200 
> 8220 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
> 10.25.40.64 - - [01/Jan/2006:07:33:18 -0600] "GET /styles/redesign.css 
> HTTP/1.1" 200 2381 "http://wfmu.wfm.pvt/home.html"; "Mozilla/4.0 (compatible; 
> MSIE 6.0; Windows NT 5.1; SV1)"
> 
> At any rate, here's some of my psudo code/code for what I'm trying to 
> accomplish.  I know there has to be a better way for this and I'm looking for 
> suggestions!
<snip>

Save yourself a ton of work.  Dump the raw logs into a db, and you can 
do all the queries on the db.  Something like this...

CREATE TABLE `rawLogs` (
   `ipAddress` int(15) NOT NULL default '0',
   `rfcIdentity` varchar(32) NOT NULL default '',
   `apacheUser` varchar(32) NOT NULL default '',
   `date` int(15) NOT NULL default '0',
   `request` longtext NOT NULL,
   `statusCode` varchar(32) NOT NULL default '',
   `sizeBytes` int(11) NOT NULL default '0',
   `referer` longtext NOT NULL,
   `userAgent` longtext NOT NULL,
   KEY `ipAddress` (`ipAddress`),
   FULLTEXT KEY `search` (`request`,`referer`,`userAgent`)
) TYPE=MyISAM;

A few questions with this train of thought.  I can see the advantages of 
putting the raw log file into a database but I would still need to parse the 
file and get the information out of it for each column.  I'm also not quite 
sure what some of your feilds are for 'rfcIdentity'??  What is that?  Why would 
I need an 'apacheUser' also?  Anyway, not too sure how I would get this 
information in an easy way for the massive amounts of inserts I would have to 
do on a 10 meg log file.

jay

Reply via email to