Right, that should read "by_hour_client.num_reqs". Don't trust relative measurements you get for small data on a single computer in local mode. Things change when you start running on hundreds of gigs with real skew on a cluster.
D On Fri, Aug 19, 2011 at 5:48 AM, David Riccitelli <[email protected]>wrote: > Thanks Dmitriy, > > The second method took less than 26 secs. on my computer (~550.000 lines). > The first method is giving me the following error: > > ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025: > <line 34, column 7> Invalid field projection. Projected field [num_reqs] > does not exist in schema: > > group:chararray,by_hour_client:bag{:tuple(hour:chararray,client:chararray,num_reqs:long)}. > > when I try to set the by_hour (after having set the by_hour_client): > > grunt> by_hour_client = > >> foreach > >> (group logs by (hour, client)) > >> generate > >> flatten(group) as (hour, client), > >> COUNT(logs) as num_reqs; > grunt> by_hour = > >> foreach > >> (group by_hour_client by hour) > >> generate > >> group as hour, > >> COUNT(by_hour_client) as num_dist_clients, > >> SUM(num_reqs) as total_requests; > > If I understood correctly that's because the num_reqs is in the bag, as a > result of the > * (group by_hour_client by hour)* > correct? So I changed the last line to > *SUM(by_hour_client.num_reqs) as total_requests;* > and it worked (it took a little more than 29 seconds). > > Thanks for your help, > David > > > On Fri, Aug 19, 2011 at 2:51 PM, Dmitriy Ryaboy <[email protected]> > wrote: > > > by_hour_client = > > foreach > > (group logs by (hour, client) parallel $p) > > generate > > flatten(group) as (hour, client), > > COUNT(logs) as num_reqs; > > > > by_hour = > > foreach > > (group by_hour_client by hour parallel $p2) > > generate > > group as hour, > > COUNT(by_hour_client) as num_dist_clients, > > SUM(num_reqs) as total_requests; > > > > You can also do this using a nested distinct, but depending on what your > > data looks like, it might be a bad idea, as it can put a lot of pressure > on > > individual reducers that have to do the inner distinct in memory > (although > > they do push part of this up to the mappers): > > > > by_hour = > > foreach (group logs by hour) { > > dist_clients = distinct logs.client; > > generate > > group as hour, > > COUNT(dist_clients) as num_dist_clients, > > COUNT(logs) as total_requests; > > } > > > > D > > > > On Fri, Aug 19, 2011 at 3:09 AM, David Riccitelli <[email protected] > > >wrote: > > > > > I'm analyzing a daily apache log file. I'd like to get the number of > > > requests and of visits by hour. > > > > > > I managed to get the requests, but how do I get the visits? > > > > > > grunt> RAW_LOGS = LOAD '<log-file>' USING TextLoader() AS > > (line:chararray); > > > grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE > > > FLATTEN( > > > REGEX_EXTRACT_ALL(line, '(\\S+) (\\S+) > > > \\[(\\d{2}/\\w{3}/\\d{4})\\:(\\d{2})\\:(\\d{2})\\:(\\d{2}) > (\\+\\d{4})\\] > > > "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" (\\S+) (\\S+)') > > > ) AS ( > > > client: chararray, > > > username: chararray, > > > date: chararray, > > > hour: chararray, > > > minute: chararray, > > > second: chararray, > > > timeZone: chararray, > > > request: chararray, > > > statusCode: int, > > > bytesSent: chararray, > > > referer: chararray, > > > userAgent: chararray, > > > remoteUser: chararray, > > > timeTaken: chararray > > > ); > > > grunt> A = GROUP LOGS_BASE BY hour; > > > DESCRIBE A; > > > A: {group: chararray,LOGS_BASE: {(client: chararray,username: > > > chararray,date: chararray,hour: chararray,minute: chararray,second: > > > chararray,timeZone: chararray,request: chararray,statusCode: > > int,bytesSent: > > > chararray,referer: chararray,userAgent: chararray,remoteUser: > > > chararray,timeTaken: chararray)}} > > > grunt> B = FOREACH A GENERATE group AS hour, COUNT( $1 ); > > > grunt> C = ORDER B BY hour; -- requests by hour > > > > > > How can I now get the distinct count of clients per hour? > > > > > > Thanks for your help! > > > > > > -- > > > David Riccitelli > > > > > > > > > > > > ******************************************************************************** > > > InsideOut10 s.r.l. > > > P.IVA: IT-11381771002 > > > Fax: +39 0110708239 > > > --- > > > LinkedIn: http://it.linkedin.com/in/riccitelli > > > Twitter: ziodave > > > --- > > > Layar Partner Network< > > > > > > http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1 > > > > > > > > > > > > > ******************************************************************************** > > > > > > > > > > > > > > > -- > > > David Riccitelli > > > > > > > > > > > > ******************************************************************************** > > > InsideOut10 s.r.l. > > > P.IVA: IT-11381771002 > > > Fax: +39 0110708239 > > > --- > > > LinkedIn: http://it.linkedin.com/in/riccitelli > > > Twitter: ziodave > > > --- > > > Layar Partner Network< > > > > > > http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1 > > > > > > > > > > > > > ******************************************************************************** > > > > > > > > > -- > David Riccitelli > > > ******************************************************************************** > InsideOut10 s.r.l. > P.IVA: IT-11381771002 > Fax: +39 0110708239 > --- > LinkedIn: http://it.linkedin.com/in/riccitelli > Twitter: ziodave > --- > Layar Partner Network< > http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1 > > > > ******************************************************************************** >
