I noticed that this issue arises only if I load the initial data with the TextLoader() and using the REGEX_EXTRACT_ALL.
If I use the PigStorage (splitting spaces, not using RegExp, i.e. w/o REGEX_EXTRACT_ALL), it works. But I need the REGEX_EXTRACT_ALL in order to correctly parse the lines... Does it make sense? David On Fri, Aug 19, 2011 at 6:02 PM, David Riccitelli <[email protected]>wrote: > I still can't manage to accomplish my objectives. I'm trying to get now the > max time taken so, as a test, I do: > grunt> A = GROUP logs BY client; > > then (timeTaken is long): > B = FOREACH A GENERATE group, MAX( logs.timeTaken ); > > when I dump it, I get the following error: > org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error > while computing max in Initial > (...) > Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to > java.lang.Long > at org.apache.pig.builtin.LongMax$Initial.exec(LongMax.java:76) > > Initially I thought that I had some timeTaken not compatible with long data > type, but I checked and re-checked. I also get the timeTaken as \d+ regular > expression. > > What am I doing wrong? > > Thanks! > David > > On Fri, Aug 19, 2011 at 5:25 PM, David Riccitelli <[email protected]>wrote: > >> I tried with another log file and that does not happen, so I suppose >> there's some 'corrupted' line in the one I was testing. >> >> >> On Fri, Aug 19, 2011 at 4:56 PM, David Riccitelli <[email protected]>wrote: >> >>> There's something strage in the results however: >>> (00,129,30096) >>> (01,91,16487) >>> (02,57,11686) >>> (03,41,6041) >>> (04,30,4882) >>> (05,33,4154) >>> (06,65,8031) >>> (07,66,12260) >>> (08,95,17924) >>> (09,131,21187) >>> (10,162,26607) >>> (11,155,28503) >>> (12,146,27863) >>> (13,152,29130) >>> (14,159,32784) >>> (15,150,28898) >>> (16,143,28973) >>> (17,169,29024) >>> (18,199,26585) >>> (19,182,28803) >>> (20,224,32511) >>> (21,232,38584) >>> (22,225,39924) >>> (23,191,33606) >>> (,0,0) >>> >>> >>> What is the last line: >>> (,0,0) >>> the count is zero, it shouldn't really be there, correct? >>> >>> (Using pig 0.9.0) >>> >>> >>> Thanks, >>> David >>> >>> On Fri, Aug 19, 2011 at 3:58 PM, Dmitriy Ryaboy <[email protected]>wrote: >>> >>>> Right, that should read "by_hour_client.num_reqs". >>>> >>>> Don't trust relative measurements you get for small data on a single >>>> computer in local mode. Things change when you start running on hundreds >>>> of >>>> gigs with real skew on a cluster. >>>> >>>> D >>>> >>>> On Fri, Aug 19, 2011 at 5:48 AM, David Riccitelli <[email protected] >>>> >wrote: >>>> >>>> > Thanks Dmitriy, >>>> > >>>> > The second method took less than 26 secs. on my computer (~550.000 >>>> lines). >>>> > The first method is giving me the following error: >>>> > >>>> > ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025: >>>> > <line 34, column 7> Invalid field projection. Projected field >>>> [num_reqs] >>>> > does not exist in schema: >>>> > >>>> > >>>> group:chararray,by_hour_client:bag{:tuple(hour:chararray,client:chararray,num_reqs:long)}. >>>> > >>>> > when I try to set the by_hour (after having set the by_hour_client): >>>> > >>>> > grunt> by_hour_client = >>>> > >> foreach >>>> > >> (group logs by (hour, client)) >>>> > >> generate >>>> > >> flatten(group) as (hour, client), >>>> > >> COUNT(logs) as num_reqs; >>>> > grunt> by_hour = >>>> > >> foreach >>>> > >> (group by_hour_client by hour) >>>> > >> generate >>>> > >> group as hour, >>>> > >> COUNT(by_hour_client) as num_dist_clients, >>>> > >> SUM(num_reqs) as total_requests; >>>> > >>>> > If I understood correctly that's because the num_reqs is in the bag, >>>> as a >>>> > result of the >>>> > * (group by_hour_client by hour)* >>>> > correct? So I changed the last line to >>>> > *SUM(by_hour_client.num_reqs) as total_requests;* >>>> > and it worked (it took a little more than 29 seconds). >>>> > >>>> > Thanks for your help, >>>> > David >>>> > >>>> > >>>> > On Fri, Aug 19, 2011 at 2:51 PM, Dmitriy Ryaboy <[email protected]> >>>> > wrote: >>>> > >>>> > > by_hour_client = >>>> > > foreach >>>> > > (group logs by (hour, client) parallel $p) >>>> > > generate >>>> > > flatten(group) as (hour, client), >>>> > > COUNT(logs) as num_reqs; >>>> > > >>>> > > by_hour = >>>> > > foreach >>>> > > (group by_hour_client by hour parallel $p2) >>>> > > generate >>>> > > group as hour, >>>> > > COUNT(by_hour_client) as num_dist_clients, >>>> > > SUM(num_reqs) as total_requests; >>>> > > >>>> > > You can also do this using a nested distinct, but depending on what >>>> your >>>> > > data looks like, it might be a bad idea, as it can put a lot of >>>> pressure >>>> > on >>>> > > individual reducers that have to do the inner distinct in memory >>>> > (although >>>> > > they do push part of this up to the mappers): >>>> > > >>>> > > by_hour = >>>> > > foreach (group logs by hour) { >>>> > > dist_clients = distinct logs.client; >>>> > > generate >>>> > > group as hour, >>>> > > COUNT(dist_clients) as num_dist_clients, >>>> > > COUNT(logs) as total_requests; >>>> > > } >>>> > > >>>> > > D >>>> > > >>>> > > On Fri, Aug 19, 2011 at 3:09 AM, David Riccitelli < >>>> [email protected] >>>> > > >wrote: >>>> > > >>>> > > > I'm analyzing a daily apache log file. I'd like to get the number >>>> of >>>> > > > requests and of visits by hour. >>>> > > > >>>> > > > I managed to get the requests, but how do I get the visits? >>>> > > > >>>> > > > grunt> RAW_LOGS = LOAD '<log-file>' USING TextLoader() AS >>>> > > (line:chararray); >>>> > > > grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE >>>> > > > FLATTEN( >>>> > > > REGEX_EXTRACT_ALL(line, '(\\S+) (\\S+) >>>> > > > \\[(\\d{2}/\\w{3}/\\d{4})\\:(\\d{2})\\:(\\d{2})\\:(\\d{2}) >>>> > (\\+\\d{4})\\] >>>> > > > "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" (\\S+) (\\S+)') >>>> > > > ) AS ( >>>> > > > client: chararray, >>>> > > > username: chararray, >>>> > > > date: chararray, >>>> > > > hour: chararray, >>>> > > > minute: chararray, >>>> > > > second: chararray, >>>> > > > timeZone: chararray, >>>> > > > request: chararray, >>>> > > > statusCode: int, >>>> > > > bytesSent: chararray, >>>> > > > referer: chararray, >>>> > > > userAgent: chararray, >>>> > > > remoteUser: chararray, >>>> > > > timeTaken: chararray >>>> > > > ); >>>> > > > grunt> A = GROUP LOGS_BASE BY hour; >>>> > > > DESCRIBE A; >>>> > > > A: {group: chararray,LOGS_BASE: {(client: chararray,username: >>>> > > > chararray,date: chararray,hour: chararray,minute: >>>> chararray,second: >>>> > > > chararray,timeZone: chararray,request: chararray,statusCode: >>>> > > int,bytesSent: >>>> > > > chararray,referer: chararray,userAgent: chararray,remoteUser: >>>> > > > chararray,timeTaken: chararray)}} >>>> > > > grunt> B = FOREACH A GENERATE group AS hour, COUNT( $1 ); >>>> > > > grunt> C = ORDER B BY hour; -- requests by hour >>>> > > > >>>> > > > How can I now get the distinct count of clients per hour? >>>> > > > >>>> > > > Thanks for your help! >>>> > > > >>>> > > > -- >>>> > > > David Riccitelli >>>> > > > >>>> > > > >>>> > > > >>>> > > >>>> > >>>> ******************************************************************************** >>>> > > > InsideOut10 s.r.l. >>>> > > > P.IVA: IT-11381771002 >>>> > > > Fax: +39 0110708239 >>>> > > > --- >>>> > > > LinkedIn: http://it.linkedin.com/in/riccitelli >>>> > > > Twitter: ziodave >>>> > > > --- >>>> > > > Layar Partner Network< >>>> > > > >>>> > > >>>> > >>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1 >>>> > > > > >>>> > > > >>>> > > > >>>> > > >>>> > >>>> ******************************************************************************** >>>> > > > >>>> > > > >>>> > > > >>>> > > > >>>> > > > -- >>>> > > > David Riccitelli >>>> > > > >>>> > > > >>>> > > > >>>> > > >>>> > >>>> ******************************************************************************** >>>> > > > InsideOut10 s.r.l. >>>> > > > P.IVA: IT-11381771002 >>>> > > > Fax: +39 0110708239 >>>> > > > --- >>>> > > > LinkedIn: http://it.linkedin.com/in/riccitelli >>>> > > > Twitter: ziodave >>>> > > > --- >>>> > > > Layar Partner Network< >>>> > > > >>>> > > >>>> > >>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1 >>>> > > > > >>>> > > > >>>> > > > >>>> > > >>>> > >>>> ******************************************************************************** >>>> > > > >>>> > > >>>> > >>>> > >>>> > >>>> > -- >>>> > David Riccitelli >>>> > >>>> > >>>> > >>>> ******************************************************************************** >>>> > InsideOut10 s.r.l. >>>> > P.IVA: IT-11381771002 >>>> > Fax: +39 0110708239 >>>> > --- >>>> > LinkedIn: http://it.linkedin.com/in/riccitelli >>>> > Twitter: ziodave >>>> > --- >>>> > Layar Partner Network< >>>> > >>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1 >>>> > > >>>> > >>>> > >>>> ******************************************************************************** >>>> > >>>> >>> >>> >>> >>> -- >>> David Riccitelli >>> >>> >>> ******************************************************************************** >>> InsideOut10 s.r.l. >>> P.IVA: IT-11381771002 >>> Fax: +39 0110708239 >>> --- >>> LinkedIn: http://it.linkedin.com/in/riccitelli >>> Twitter: ziodave >>> --- >>> Layar Partner >>> Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1> >>> >>> ******************************************************************************** >>> >>> >> >> >> -- >> David Riccitelli >> >> >> ******************************************************************************** >> InsideOut10 s.r.l. >> P.IVA: IT-11381771002 >> Fax: +39 0110708239 >> --- >> LinkedIn: http://it.linkedin.com/in/riccitelli >> Twitter: ziodave >> --- >> Layar Partner >> Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1> >> >> ******************************************************************************** >> >> > > > -- > David Riccitelli > > > ******************************************************************************** > InsideOut10 s.r.l. > P.IVA: IT-11381771002 > Fax: +39 0110708239 > --- > LinkedIn: http://it.linkedin.com/in/riccitelli > Twitter: ziodave > --- > Layar Partner > Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1> > > ******************************************************************************** > > -- David Riccitelli ******************************************************************************** InsideOut10 s.r.l. P.IVA: IT-11381771002 Fax: +39 0110708239 --- LinkedIn: http://it.linkedin.com/in/riccitelli Twitter: ziodave --- Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1> ********************************************************************************
