There's something strage in the results however: (00,129,30096) (01,91,16487) (02,57,11686) (03,41,6041) (04,30,4882) (05,33,4154) (06,65,8031) (07,66,12260) (08,95,17924) (09,131,21187) (10,162,26607) (11,155,28503) (12,146,27863) (13,152,29130) (14,159,32784) (15,150,28898) (16,143,28973) (17,169,29024) (18,199,26585) (19,182,28803) (20,224,32511) (21,232,38584) (22,225,39924) (23,191,33606) (,0,0)
What is the last line: (,0,0) the count is zero, it shouldn't really be there, correct? (Using pig 0.9.0) Thanks, David On Fri, Aug 19, 2011 at 3:58 PM, Dmitriy Ryaboy <[email protected]> wrote: > Right, that should read "by_hour_client.num_reqs". > > Don't trust relative measurements you get for small data on a single > computer in local mode. Things change when you start running on hundreds of > gigs with real skew on a cluster. > > D > > On Fri, Aug 19, 2011 at 5:48 AM, David Riccitelli <[email protected] > >wrote: > > > Thanks Dmitriy, > > > > The second method took less than 26 secs. on my computer (~550.000 > lines). > > The first method is giving me the following error: > > > > ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025: > > <line 34, column 7> Invalid field projection. Projected field [num_reqs] > > does not exist in schema: > > > > > group:chararray,by_hour_client:bag{:tuple(hour:chararray,client:chararray,num_reqs:long)}. > > > > when I try to set the by_hour (after having set the by_hour_client): > > > > grunt> by_hour_client = > > >> foreach > > >> (group logs by (hour, client)) > > >> generate > > >> flatten(group) as (hour, client), > > >> COUNT(logs) as num_reqs; > > grunt> by_hour = > > >> foreach > > >> (group by_hour_client by hour) > > >> generate > > >> group as hour, > > >> COUNT(by_hour_client) as num_dist_clients, > > >> SUM(num_reqs) as total_requests; > > > > If I understood correctly that's because the num_reqs is in the bag, as a > > result of the > > * (group by_hour_client by hour)* > > correct? So I changed the last line to > > *SUM(by_hour_client.num_reqs) as total_requests;* > > and it worked (it took a little more than 29 seconds). > > > > Thanks for your help, > > David > > > > > > On Fri, Aug 19, 2011 at 2:51 PM, Dmitriy Ryaboy <[email protected]> > > wrote: > > > > > by_hour_client = > > > foreach > > > (group logs by (hour, client) parallel $p) > > > generate > > > flatten(group) as (hour, client), > > > COUNT(logs) as num_reqs; > > > > > > by_hour = > > > foreach > > > (group by_hour_client by hour parallel $p2) > > > generate > > > group as hour, > > > COUNT(by_hour_client) as num_dist_clients, > > > SUM(num_reqs) as total_requests; > > > > > > You can also do this using a nested distinct, but depending on what > your > > > data looks like, it might be a bad idea, as it can put a lot of > pressure > > on > > > individual reducers that have to do the inner distinct in memory > > (although > > > they do push part of this up to the mappers): > > > > > > by_hour = > > > foreach (group logs by hour) { > > > dist_clients = distinct logs.client; > > > generate > > > group as hour, > > > COUNT(dist_clients) as num_dist_clients, > > > COUNT(logs) as total_requests; > > > } > > > > > > D > > > > > > On Fri, Aug 19, 2011 at 3:09 AM, David Riccitelli <[email protected] > > > >wrote: > > > > > > > I'm analyzing a daily apache log file. I'd like to get the number of > > > > requests and of visits by hour. > > > > > > > > I managed to get the requests, but how do I get the visits? > > > > > > > > grunt> RAW_LOGS = LOAD '<log-file>' USING TextLoader() AS > > > (line:chararray); > > > > grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE > > > > FLATTEN( > > > > REGEX_EXTRACT_ALL(line, '(\\S+) (\\S+) > > > > \\[(\\d{2}/\\w{3}/\\d{4})\\:(\\d{2})\\:(\\d{2})\\:(\\d{2}) > > (\\+\\d{4})\\] > > > > "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" (\\S+) (\\S+)') > > > > ) AS ( > > > > client: chararray, > > > > username: chararray, > > > > date: chararray, > > > > hour: chararray, > > > > minute: chararray, > > > > second: chararray, > > > > timeZone: chararray, > > > > request: chararray, > > > > statusCode: int, > > > > bytesSent: chararray, > > > > referer: chararray, > > > > userAgent: chararray, > > > > remoteUser: chararray, > > > > timeTaken: chararray > > > > ); > > > > grunt> A = GROUP LOGS_BASE BY hour; > > > > DESCRIBE A; > > > > A: {group: chararray,LOGS_BASE: {(client: chararray,username: > > > > chararray,date: chararray,hour: chararray,minute: chararray,second: > > > > chararray,timeZone: chararray,request: chararray,statusCode: > > > int,bytesSent: > > > > chararray,referer: chararray,userAgent: chararray,remoteUser: > > > > chararray,timeTaken: chararray)}} > > > > grunt> B = FOREACH A GENERATE group AS hour, COUNT( $1 ); > > > > grunt> C = ORDER B BY hour; -- requests by hour > > > > > > > > How can I now get the distinct count of clients per hour? > > > > > > > > Thanks for your help! > > > > > > > > -- > > > > David Riccitelli > > > > > > > > > > > > > > > > > > ******************************************************************************** > > > > InsideOut10 s.r.l. > > > > P.IVA: IT-11381771002 > > > > Fax: +39 0110708239 > > > > --- > > > > LinkedIn: http://it.linkedin.com/in/riccitelli > > > > Twitter: ziodave > > > > --- > > > > Layar Partner Network< > > > > > > > > > > http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1 > > > > > > > > > > > > > > > > > > > ******************************************************************************** > > > > > > > > > > > > > > > > > > > > -- > > > > David Riccitelli > > > > > > > > > > > > > > > > > > ******************************************************************************** > > > > InsideOut10 s.r.l. > > > > P.IVA: IT-11381771002 > > > > Fax: +39 0110708239 > > > > --- > > > > LinkedIn: http://it.linkedin.com/in/riccitelli > > > > Twitter: ziodave > > > > --- > > > > Layar Partner Network< > > > > > > > > > > http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1 > > > > > > > > > > > > > > > > > > > ******************************************************************************** > > > > > > > > > > > > > > > -- > > David Riccitelli > > > > > > > ******************************************************************************** > > InsideOut10 s.r.l. > > P.IVA: IT-11381771002 > > Fax: +39 0110708239 > > --- > > LinkedIn: http://it.linkedin.com/in/riccitelli > > Twitter: ziodave > > --- > > Layar Partner Network< > > > http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1 > > > > > > > > ******************************************************************************** > > > -- David Riccitelli ******************************************************************************** InsideOut10 s.r.l. P.IVA: IT-11381771002 Fax: +39 0110708239 --- LinkedIn: http://it.linkedin.com/in/riccitelli Twitter: ziodave --- Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1> ********************************************************************************
