There's something strage in the results however:
(00,129,30096)
(01,91,16487)
(02,57,11686)
(03,41,6041)
(04,30,4882)
(05,33,4154)
(06,65,8031)
(07,66,12260)
(08,95,17924)
(09,131,21187)
(10,162,26607)
(11,155,28503)
(12,146,27863)
(13,152,29130)
(14,159,32784)
(15,150,28898)
(16,143,28973)
(17,169,29024)
(18,199,26585)
(19,182,28803)
(20,224,32511)
(21,232,38584)
(22,225,39924)
(23,191,33606)
(,0,0)


What is the last line:
 (,0,0)
the count is zero, it shouldn't really be there, correct?

(Using pig 0.9.0)


Thanks,
David

On Fri, Aug 19, 2011 at 3:58 PM, Dmitriy Ryaboy <[email protected]> wrote:

> Right, that should read "by_hour_client.num_reqs".
>
> Don't trust relative measurements you get for small data on a single
> computer in local mode. Things change when you start running on hundreds of
> gigs with real skew on a cluster.
>
> D
>
> On Fri, Aug 19, 2011 at 5:48 AM, David Riccitelli <[email protected]
> >wrote:
>
> > Thanks Dmitriy,
> >
> > The second method took less than 26 secs. on my computer (~550.000
> lines).
> > The first method is giving me the following error:
> >
> > ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025:
> > <line 34, column 7> Invalid field projection. Projected field [num_reqs]
> > does not exist in schema:
> >
> >
> group:chararray,by_hour_client:bag{:tuple(hour:chararray,client:chararray,num_reqs:long)}.
> >
> > when I try to set the by_hour (after having set the by_hour_client):
> >
> > grunt> by_hour_client =
> > >>  foreach
> > >>    (group logs by (hour, client))
> > >>  generate
> > >>    flatten(group) as (hour, client),
> > >>    COUNT(logs) as num_reqs;
> > grunt> by_hour =
> > >>  foreach
> > >>    (group by_hour_client by hour)
> > >>  generate
> > >>    group as hour,
> > >>    COUNT(by_hour_client) as num_dist_clients,
> > >>    SUM(num_reqs) as total_requests;
> >
> > If I understood correctly that's because the num_reqs is in the bag, as a
> > result of the
> > *    (group by_hour_client by hour)*
> > correct? So I changed the last line to
> >    *SUM(by_hour_client.num_reqs) as total_requests;*
> > and it worked (it took a little more than 29 seconds).
> >
> > Thanks for your help,
> > David
> >
> >
> > On Fri, Aug 19, 2011 at 2:51 PM, Dmitriy Ryaboy <[email protected]>
> > wrote:
> >
> > > by_hour_client =
> > >  foreach
> > >    (group logs by (hour, client) parallel $p)
> > >  generate
> > >    flatten(group) as (hour, client),
> > >    COUNT(logs) as num_reqs;
> > >
> > > by_hour =
> > >  foreach
> > >    (group by_hour_client by hour parallel $p2)
> > >  generate
> > >    group as hour,
> > >    COUNT(by_hour_client) as num_dist_clients,
> > >    SUM(num_reqs) as total_requests;
> > >
> > > You can also do this using a nested distinct, but depending on what
> your
> > > data looks like, it might be a bad idea, as it can put a lot of
> pressure
> > on
> > > individual reducers that have to do the inner distinct in memory
> > (although
> > > they do push part of this up to the mappers):
> > >
> > > by_hour =
> > >  foreach (group logs by hour) {
> > >   dist_clients = distinct logs.client;
> > >   generate
> > >    group as hour,
> > >    COUNT(dist_clients) as num_dist_clients,
> > >    COUNT(logs) as total_requests;
> > > }
> > >
> > > D
> > >
> > > On Fri, Aug 19, 2011 at 3:09 AM, David Riccitelli <[email protected]
> > > >wrote:
> > >
> > > > I'm analyzing a daily apache log file. I'd like to get the number of
> > > > requests and of visits by hour.
> > > >
> > > > I managed to get the requests, but how do I get the visits?
> > > >
> > > > grunt> RAW_LOGS = LOAD '<log-file>' USING TextLoader() AS
> > > (line:chararray);
> > > > grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE
> > > >  FLATTEN(
> > > >    REGEX_EXTRACT_ALL(line, '(\\S+) (\\S+)
> > > > \\[(\\d{2}/\\w{3}/\\d{4})\\:(\\d{2})\\:(\\d{2})\\:(\\d{2})
> > (\\+\\d{4})\\]
> > > > "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" (\\S+) (\\S+)')
> > > >  ) AS (
> > > >    client:   chararray,
> > > >    username: chararray,
> > > >    date: chararray,
> > > >    hour: chararray,
> > > >    minute: chararray,
> > > >    second: chararray,
> > > >    timeZone: chararray,
> > > >    request:  chararray,
> > > >    statusCode: int,
> > > >    bytesSent: chararray,
> > > >    referer:  chararray,
> > > >    userAgent: chararray,
> > > >    remoteUser: chararray,
> > > >    timeTaken: chararray
> > > > );
> > > > grunt> A = GROUP LOGS_BASE BY hour;
> > > > DESCRIBE A;
> > > > A: {group: chararray,LOGS_BASE: {(client: chararray,username:
> > > > chararray,date: chararray,hour: chararray,minute: chararray,second:
> > > > chararray,timeZone: chararray,request: chararray,statusCode:
> > > int,bytesSent:
> > > > chararray,referer: chararray,userAgent: chararray,remoteUser:
> > > > chararray,timeTaken: chararray)}}
> > > > grunt> B = FOREACH A GENERATE group AS hour, COUNT( $1 );
> > > > grunt> C = ORDER B BY hour; -- requests by hour
> > > >
> > > > How can I now get the distinct count of clients per hour?
> > > >
> > > > Thanks for your help!
> > > >
> > > > --
> > > > David Riccitelli
> > > >
> > > >
> > > >
> > >
> >
> ********************************************************************************
> > > > InsideOut10 s.r.l.
> > > > P.IVA: IT-11381771002
> > > > Fax: +39 0110708239
> > > > ---
> > > > LinkedIn: http://it.linkedin.com/in/riccitelli
> > > > Twitter: ziodave
> > > > ---
> > > > Layar Partner Network<
> > > >
> > >
> >
> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
> > > > >
> > > >
> > > >
> > >
> >
> ********************************************************************************
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > David Riccitelli
> > > >
> > > >
> > > >
> > >
> >
> ********************************************************************************
> > > > InsideOut10 s.r.l.
> > > > P.IVA: IT-11381771002
> > > > Fax: +39 0110708239
> > > > ---
> > > > LinkedIn: http://it.linkedin.com/in/riccitelli
> > > > Twitter: ziodave
> > > > ---
> > > > Layar Partner Network<
> > > >
> > >
> >
> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
> > > > >
> > > >
> > > >
> > >
> >
> ********************************************************************************
> > > >
> > >
> >
> >
> >
> > --
> > David Riccitelli
> >
> >
> >
> ********************************************************************************
> > InsideOut10 s.r.l.
> > P.IVA: IT-11381771002
> > Fax: +39 0110708239
> > ---
> > LinkedIn: http://it.linkedin.com/in/riccitelli
> > Twitter: ziodave
> > ---
> > Layar Partner Network<
> >
> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
> > >
> >
> >
> ********************************************************************************
> >
>



-- 
David Riccitelli

********************************************************************************
InsideOut10 s.r.l.
P.IVA: IT-11381771002
Fax: +39 0110708239
---
LinkedIn: http://it.linkedin.com/in/riccitelli
Twitter: ziodave
---
Layar Partner 
Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
********************************************************************************

Reply via email to