I'm analyzing a daily apache log file. I'd like to get the number of
requests and of visits by hour.
I managed to get the requests, but how do I get the visits?
grunt> RAW_LOGS = LOAD '<log-file>' USING TextLoader() AS (line:chararray);
grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE
FLATTEN(
REGEX_EXTRACT_ALL(line, '(\\S+) (\\S+)
\\[(\\d{2}/\\w{3}/\\d{4})\\:(\\d{2})\\:(\\d{2})\\:(\\d{2}) (\\+\\d{4})\\]
"(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" (\\S+) (\\S+)')
) AS (
client: chararray,
username: chararray,
date: chararray,
hour: chararray,
minute: chararray,
second: chararray,
timeZone: chararray,
request: chararray,
statusCode: int,
bytesSent: chararray,
referer: chararray,
userAgent: chararray,
remoteUser: chararray,
timeTaken: chararray
);
grunt> A = GROUP LOGS_BASE BY hour;
DESCRIBE A;
A: {group: chararray,LOGS_BASE: {(client: chararray,username:
chararray,date: chararray,hour: chararray,minute: chararray,second:
chararray,timeZone: chararray,request: chararray,statusCode: int,bytesSent:
chararray,referer: chararray,userAgent: chararray,remoteUser:
chararray,timeTaken: chararray)}}
grunt> B = FOREACH A GENERATE group AS hour, COUNT( $1 );
grunt> C = ORDER B BY hour; -- requests by hour
How can I now get the distinct count of clients per hour?
Thanks for your help!
--
David Riccitelli
********************************************************************************
InsideOut10 s.r.l.
P.IVA: IT-11381771002
Fax: +39 0110708239
---
LinkedIn: http://it.linkedin.com/in/riccitelli
Twitter: ziodave
---
Layar Partner
Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
********************************************************************************