by_hour_client =
  foreach
    (group logs by (hour, client) parallel $p)
  generate
    flatten(group) as (hour, client),
    COUNT(logs) as num_reqs;

by_hour =
  foreach
    (group by_hour_client by hour parallel $p2)
  generate
    group as hour,
    COUNT(by_hour_client) as num_dist_clients,
    SUM(num_reqs) as total_requests;

You can also do this using a nested distinct, but depending on what your
data looks like, it might be a bad idea, as it can put a lot of pressure on
individual reducers that have to do the inner distinct in memory (although
they do push part of this up to the mappers):

by_hour =
  foreach (group logs by hour) {
   dist_clients = distinct logs.client;
   generate
    group as hour,
    COUNT(dist_clients) as num_dist_clients,
    COUNT(logs) as total_requests;
}

D

On Fri, Aug 19, 2011 at 3:09 AM, David Riccitelli <[email protected]>wrote:

> I'm analyzing a daily apache log file. I'd like to get the number of
> requests and of visits by hour.
>
> I managed to get the requests, but how do I get the visits?
>
> grunt> RAW_LOGS = LOAD '<log-file>' USING TextLoader() AS (line:chararray);
> grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE
>  FLATTEN(
>    REGEX_EXTRACT_ALL(line, '(\\S+) (\\S+)
> \\[(\\d{2}/\\w{3}/\\d{4})\\:(\\d{2})\\:(\\d{2})\\:(\\d{2}) (\\+\\d{4})\\]
> "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" (\\S+) (\\S+)')
>  ) AS (
>    client:   chararray,
>    username: chararray,
>    date: chararray,
>    hour: chararray,
>    minute: chararray,
>    second: chararray,
>    timeZone: chararray,
>    request:  chararray,
>    statusCode: int,
>    bytesSent: chararray,
>    referer:  chararray,
>    userAgent: chararray,
>    remoteUser: chararray,
>    timeTaken: chararray
> );
> grunt> A = GROUP LOGS_BASE BY hour;
> DESCRIBE A;
> A: {group: chararray,LOGS_BASE: {(client: chararray,username:
> chararray,date: chararray,hour: chararray,minute: chararray,second:
> chararray,timeZone: chararray,request: chararray,statusCode: int,bytesSent:
> chararray,referer: chararray,userAgent: chararray,remoteUser:
> chararray,timeTaken: chararray)}}
> grunt> B = FOREACH A GENERATE group AS hour, COUNT( $1 );
> grunt> C = ORDER B BY hour; -- requests by hour
>
> How can I now get the distinct count of clients per hour?
>
> Thanks for your help!
>
> --
> David Riccitelli
>
>
> ********************************************************************************
> InsideOut10 s.r.l.
> P.IVA: IT-11381771002
> Fax: +39 0110708239
> ---
> LinkedIn: http://it.linkedin.com/in/riccitelli
> Twitter: ziodave
> ---
> Layar Partner Network<
> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
> >
>
> ********************************************************************************
>
>
>
>
> --
> David Riccitelli
>
>
> ********************************************************************************
> InsideOut10 s.r.l.
> P.IVA: IT-11381771002
> Fax: +39 0110708239
> ---
> LinkedIn: http://it.linkedin.com/in/riccitelli
> Twitter: ziodave
> ---
> Layar Partner Network<
> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
> >
>
> ********************************************************************************
>

Reply via email to