Sorry for this long sequence of messages, but I'm posting things as I
continue testing/investigating.

May be this relevant to my case?

http://www.mail-archive.com/[email protected]/msg02258.html

Thanks,
David

On Fri, Aug 19, 2011 at 7:31 PM, David Riccitelli <[email protected]>wrote:

> I tried changing this line, from:
> RAW_LOGS = LOAD '/Users/david/Documents/Work/OTT-Tunisiana/access_log_test'
> USING TextLoader() AS (line:chararray);
>
> to:
> RAW_LOGS = LOAD '/Users/david/Documents/Work/OTT-Tunisiana/access_log_test'
> USING PigStorage() AS (line:chararray);
>
> It does not fix the issue, as it is depended from the REGEX_EXTRACT_ALL
> that produces the logs schema.
>
> Is there any incompatibility between the REGEX_EXTRACT_ALL and the MAX
> function?
>
> Thanks for your help,
> David
>
> On Fri, Aug 19, 2011 at 7:28 PM, David Riccitelli <[email protected]>wrote:
>
>> I noticed that this issue arises only if I load the initial data with the
>> TextLoader() and using the REGEX_EXTRACT_ALL.
>>
>> If I use the PigStorage (splitting spaces, not using RegExp, i.e. w/o
>> REGEX_EXTRACT_ALL), it works.
>>
>> But I need the REGEX_EXTRACT_ALL in order to correctly parse the lines...
>>
>> Does it make sense?
>>
>> David
>>
>>
>> On Fri, Aug 19, 2011 at 6:02 PM, David Riccitelli <[email protected]>wrote:
>>
>>> I still can't manage to accomplish my objectives. I'm trying to get now
>>> the max time taken so, as a test, I do:
>>> grunt> A = GROUP logs BY client;
>>>
>>> then (timeTaken is long):
>>> B = FOREACH A GENERATE group, MAX( logs.timeTaken );
>>>
>>> when I dump it, I get the following error:
>>> org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error
>>> while computing max in Initial
>>> (...)
>>> Caused by: java.lang.ClassCastException: java.lang.String cannot be cast
>>> to java.lang.Long
>>> at org.apache.pig.builtin.LongMax$Initial.exec(LongMax.java:76)
>>>
>>> Initially I thought that I had some timeTaken not compatible with long
>>> data type, but I checked and re-checked. I also get the timeTaken as \d+
>>> regular expression.
>>>
>>> What am I doing wrong?
>>>
>>> Thanks!
>>> David
>>>
>>> On Fri, Aug 19, 2011 at 5:25 PM, David Riccitelli <[email protected]>wrote:
>>>
>>>> I tried with another log file and that does not happen, so I suppose
>>>> there's some 'corrupted' line in the one I was testing.
>>>>
>>>>
>>>> On Fri, Aug 19, 2011 at 4:56 PM, David Riccitelli 
>>>> <[email protected]>wrote:
>>>>
>>>>> There's something strage in the results however:
>>>>> (00,129,30096)
>>>>> (01,91,16487)
>>>>> (02,57,11686)
>>>>> (03,41,6041)
>>>>> (04,30,4882)
>>>>> (05,33,4154)
>>>>> (06,65,8031)
>>>>> (07,66,12260)
>>>>> (08,95,17924)
>>>>> (09,131,21187)
>>>>> (10,162,26607)
>>>>> (11,155,28503)
>>>>> (12,146,27863)
>>>>> (13,152,29130)
>>>>> (14,159,32784)
>>>>> (15,150,28898)
>>>>> (16,143,28973)
>>>>> (17,169,29024)
>>>>> (18,199,26585)
>>>>> (19,182,28803)
>>>>> (20,224,32511)
>>>>> (21,232,38584)
>>>>> (22,225,39924)
>>>>> (23,191,33606)
>>>>> (,0,0)
>>>>>
>>>>>
>>>>> What is the last line:
>>>>>  (,0,0)
>>>>> the count is zero, it shouldn't really be there, correct?
>>>>>
>>>>> (Using pig 0.9.0)
>>>>>
>>>>>
>>>>> Thanks,
>>>>> David
>>>>>
>>>>> On Fri, Aug 19, 2011 at 3:58 PM, Dmitriy Ryaboy <[email protected]>wrote:
>>>>>
>>>>>> Right, that should read "by_hour_client.num_reqs".
>>>>>>
>>>>>> Don't trust relative measurements you get for small data on a single
>>>>>> computer in local mode. Things change when you start running on
>>>>>> hundreds of
>>>>>> gigs with real skew on a cluster.
>>>>>>
>>>>>> D
>>>>>>
>>>>>> On Fri, Aug 19, 2011 at 5:48 AM, David Riccitelli <[email protected]
>>>>>> >wrote:
>>>>>>
>>>>>> > Thanks Dmitriy,
>>>>>> >
>>>>>> > The second method took less than 26 secs. on my computer (~550.000
>>>>>> lines).
>>>>>> > The first method is giving me the following error:
>>>>>> >
>>>>>> > ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025:
>>>>>> > <line 34, column 7> Invalid field projection. Projected field
>>>>>> [num_reqs]
>>>>>> > does not exist in schema:
>>>>>> >
>>>>>> >
>>>>>> group:chararray,by_hour_client:bag{:tuple(hour:chararray,client:chararray,num_reqs:long)}.
>>>>>> >
>>>>>> > when I try to set the by_hour (after having set the by_hour_client):
>>>>>> >
>>>>>> > grunt> by_hour_client =
>>>>>> > >>  foreach
>>>>>> > >>    (group logs by (hour, client))
>>>>>> > >>  generate
>>>>>> > >>    flatten(group) as (hour, client),
>>>>>> > >>    COUNT(logs) as num_reqs;
>>>>>> > grunt> by_hour =
>>>>>> > >>  foreach
>>>>>> > >>    (group by_hour_client by hour)
>>>>>> > >>  generate
>>>>>> > >>    group as hour,
>>>>>> > >>    COUNT(by_hour_client) as num_dist_clients,
>>>>>> > >>    SUM(num_reqs) as total_requests;
>>>>>> >
>>>>>> > If I understood correctly that's because the num_reqs is in the bag,
>>>>>> as a
>>>>>> > result of the
>>>>>> > *    (group by_hour_client by hour)*
>>>>>> > correct? So I changed the last line to
>>>>>> >    *SUM(by_hour_client.num_reqs) as total_requests;*
>>>>>> > and it worked (it took a little more than 29 seconds).
>>>>>> >
>>>>>> > Thanks for your help,
>>>>>> > David
>>>>>> >
>>>>>> >
>>>>>> > On Fri, Aug 19, 2011 at 2:51 PM, Dmitriy Ryaboy <[email protected]
>>>>>> >
>>>>>> > wrote:
>>>>>> >
>>>>>> > > by_hour_client =
>>>>>> > >  foreach
>>>>>> > >    (group logs by (hour, client) parallel $p)
>>>>>> > >  generate
>>>>>> > >    flatten(group) as (hour, client),
>>>>>> > >    COUNT(logs) as num_reqs;
>>>>>> > >
>>>>>> > > by_hour =
>>>>>> > >  foreach
>>>>>> > >    (group by_hour_client by hour parallel $p2)
>>>>>> > >  generate
>>>>>> > >    group as hour,
>>>>>> > >    COUNT(by_hour_client) as num_dist_clients,
>>>>>> > >    SUM(num_reqs) as total_requests;
>>>>>> > >
>>>>>> > > You can also do this using a nested distinct, but depending on
>>>>>> what your
>>>>>> > > data looks like, it might be a bad idea, as it can put a lot of
>>>>>> pressure
>>>>>> > on
>>>>>> > > individual reducers that have to do the inner distinct in memory
>>>>>> > (although
>>>>>> > > they do push part of this up to the mappers):
>>>>>> > >
>>>>>> > > by_hour =
>>>>>> > >  foreach (group logs by hour) {
>>>>>> > >   dist_clients = distinct logs.client;
>>>>>> > >   generate
>>>>>> > >    group as hour,
>>>>>> > >    COUNT(dist_clients) as num_dist_clients,
>>>>>> > >    COUNT(logs) as total_requests;
>>>>>> > > }
>>>>>> > >
>>>>>> > > D
>>>>>> > >
>>>>>> > > On Fri, Aug 19, 2011 at 3:09 AM, David Riccitelli <
>>>>>> [email protected]
>>>>>> > > >wrote:
>>>>>> > >
>>>>>> > > > I'm analyzing a daily apache log file. I'd like to get the
>>>>>> number of
>>>>>> > > > requests and of visits by hour.
>>>>>> > > >
>>>>>> > > > I managed to get the requests, but how do I get the visits?
>>>>>> > > >
>>>>>> > > > grunt> RAW_LOGS = LOAD '<log-file>' USING TextLoader() AS
>>>>>> > > (line:chararray);
>>>>>> > > > grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE
>>>>>> > > >  FLATTEN(
>>>>>> > > >    REGEX_EXTRACT_ALL(line, '(\\S+) (\\S+)
>>>>>> > > > \\[(\\d{2}/\\w{3}/\\d{4})\\:(\\d{2})\\:(\\d{2})\\:(\\d{2})
>>>>>> > (\\+\\d{4})\\]
>>>>>> > > > "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" (\\S+) (\\S+)')
>>>>>> > > >  ) AS (
>>>>>> > > >    client:   chararray,
>>>>>> > > >    username: chararray,
>>>>>> > > >    date: chararray,
>>>>>> > > >    hour: chararray,
>>>>>> > > >    minute: chararray,
>>>>>> > > >    second: chararray,
>>>>>> > > >    timeZone: chararray,
>>>>>> > > >    request:  chararray,
>>>>>> > > >    statusCode: int,
>>>>>> > > >    bytesSent: chararray,
>>>>>> > > >    referer:  chararray,
>>>>>> > > >    userAgent: chararray,
>>>>>> > > >    remoteUser: chararray,
>>>>>> > > >    timeTaken: chararray
>>>>>> > > > );
>>>>>> > > > grunt> A = GROUP LOGS_BASE BY hour;
>>>>>> > > > DESCRIBE A;
>>>>>> > > > A: {group: chararray,LOGS_BASE: {(client: chararray,username:
>>>>>> > > > chararray,date: chararray,hour: chararray,minute:
>>>>>> chararray,second:
>>>>>> > > > chararray,timeZone: chararray,request: chararray,statusCode:
>>>>>> > > int,bytesSent:
>>>>>> > > > chararray,referer: chararray,userAgent: chararray,remoteUser:
>>>>>> > > > chararray,timeTaken: chararray)}}
>>>>>> > > > grunt> B = FOREACH A GENERATE group AS hour, COUNT( $1 );
>>>>>> > > > grunt> C = ORDER B BY hour; -- requests by hour
>>>>>> > > >
>>>>>> > > > How can I now get the distinct count of clients per hour?
>>>>>> > > >
>>>>>> > > > Thanks for your help!
>>>>>> > > >
>>>>>> > > > --
>>>>>> > > > David Riccitelli
>>>>>> > > >
>>>>>> > > >
>>>>>> > > >
>>>>>> > >
>>>>>> >
>>>>>> ********************************************************************************
>>>>>> > > > InsideOut10 s.r.l.
>>>>>> > > > P.IVA: IT-11381771002
>>>>>> > > > Fax: +39 0110708239
>>>>>> > > > ---
>>>>>> > > > LinkedIn: http://it.linkedin.com/in/riccitelli
>>>>>> > > > Twitter: ziodave
>>>>>> > > > ---
>>>>>> > > > Layar Partner Network<
>>>>>> > > >
>>>>>> > >
>>>>>> >
>>>>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>>>>> > > > >
>>>>>> > > >
>>>>>> > > >
>>>>>> > >
>>>>>> >
>>>>>> ********************************************************************************
>>>>>> > > >
>>>>>> > > >
>>>>>> > > >
>>>>>> > > >
>>>>>> > > > --
>>>>>> > > > David Riccitelli
>>>>>> > > >
>>>>>> > > >
>>>>>> > > >
>>>>>> > >
>>>>>> >
>>>>>> ********************************************************************************
>>>>>> > > > InsideOut10 s.r.l.
>>>>>> > > > P.IVA: IT-11381771002
>>>>>> > > > Fax: +39 0110708239
>>>>>> > > > ---
>>>>>> > > > LinkedIn: http://it.linkedin.com/in/riccitelli
>>>>>> > > > Twitter: ziodave
>>>>>> > > > ---
>>>>>> > > > Layar Partner Network<
>>>>>> > > >
>>>>>> > >
>>>>>> >
>>>>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>>>>> > > > >
>>>>>> > > >
>>>>>> > > >
>>>>>> > >
>>>>>> >
>>>>>> ********************************************************************************
>>>>>> > > >
>>>>>> > >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > --
>>>>>> > David Riccitelli
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> ********************************************************************************
>>>>>> > InsideOut10 s.r.l.
>>>>>> > P.IVA: IT-11381771002
>>>>>> > Fax: +39 0110708239
>>>>>> > ---
>>>>>> > LinkedIn: http://it.linkedin.com/in/riccitelli
>>>>>> > Twitter: ziodave
>>>>>> > ---
>>>>>> > Layar Partner Network<
>>>>>> >
>>>>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>>>>> > >
>>>>>> >
>>>>>> >
>>>>>> ********************************************************************************
>>>>>> >
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> David Riccitelli
>>>>>
>>>>>
>>>>> ********************************************************************************
>>>>> InsideOut10 s.r.l.
>>>>> P.IVA: IT-11381771002
>>>>> Fax: +39 0110708239
>>>>> ---
>>>>> LinkedIn: http://it.linkedin.com/in/riccitelli
>>>>> Twitter: ziodave
>>>>> ---
>>>>> Layar Partner 
>>>>> Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>>>>
>>>>> ********************************************************************************
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> David Riccitelli
>>>>
>>>>
>>>> ********************************************************************************
>>>> InsideOut10 s.r.l.
>>>> P.IVA: IT-11381771002
>>>> Fax: +39 0110708239
>>>> ---
>>>> LinkedIn: http://it.linkedin.com/in/riccitelli
>>>> Twitter: ziodave
>>>> ---
>>>> Layar Partner 
>>>> Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>>>
>>>> ********************************************************************************
>>>>
>>>>
>>>
>>>
>>> --
>>> David Riccitelli
>>>
>>>
>>> ********************************************************************************
>>> InsideOut10 s.r.l.
>>> P.IVA: IT-11381771002
>>> Fax: +39 0110708239
>>> ---
>>> LinkedIn: http://it.linkedin.com/in/riccitelli
>>> Twitter: ziodave
>>> ---
>>> Layar Partner 
>>> Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>>
>>> ********************************************************************************
>>>
>>>
>>
>>
>> --
>> David Riccitelli
>>
>>
>> ********************************************************************************
>> InsideOut10 s.r.l.
>> P.IVA: IT-11381771002
>> Fax: +39 0110708239
>> ---
>> LinkedIn: http://it.linkedin.com/in/riccitelli
>> Twitter: ziodave
>> ---
>> Layar Partner 
>> Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>
>> ********************************************************************************
>>
>>
>
>
> --
> David Riccitelli
>
>
> ********************************************************************************
> InsideOut10 s.r.l.
> P.IVA: IT-11381771002
> Fax: +39 0110708239
> ---
> LinkedIn: http://it.linkedin.com/in/riccitelli
> Twitter: ziodave
> ---
> Layar Partner 
> Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>
> ********************************************************************************
>
>


-- 
David Riccitelli

********************************************************************************
InsideOut10 s.r.l.
P.IVA: IT-11381771002
Fax: +39 0110708239
---
LinkedIn: http://it.linkedin.com/in/riccitelli
Twitter: ziodave
---
Layar Partner 
Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
********************************************************************************

Reply via email to