Re: How to access to the tuple items of REGEX_EXTRACT_ALL ?

brice lecomte Thu, 28 Feb 2013 06:48:51 -0800

good news:
need to cast export from REGEX to be used by FLATTEN and then named
items such as :
LOGS_BASE = FOREACH RAW_LOGS GENERATE
FLATTEN((tuple(CHARARRAY,int,CHARARRAY,CHARARRAY,CHARARRAY,CHARARRAY))REGEX_EXTRACT_ALL(line,
'([a-zA-Z]{3,3}) ([0-9]{1,2})
([0-2]{1}[0-9]{1}:[0-5]{1}[0-9]{1}:[0-5]{1}[0-9]{1}) ([a-zA-Z0-9-_]+)
([a-zA-Z]+)\\[[0-9]+\\]: (.*)')) as (m:chararray, d:int, time:chararray,
hostname:chararray, service:chararray, info:chararray);



Le 28/02/2013 11:27, brice lecomte a écrit :
> Hi Johnny,
> bad things,
>
> grunt> REGISTER json-simple-1.1.1.jar
> grunt> REGISTER lib/jackson-core-asl-1.8.8.jar
> grunt> REGISTER lib/jackson-mapper-asl-1.8.8.jar
> grunt> REGISTER /usr/local/pig-0.10.1-src/build/ivy/lib/Pig/avro-1.5.3.jar
> grunt> REGISTER
> /usr/local/pig-0.10.1-src/contrib/piggybank/java/piggybank.jar
> grunt> logs = LOAD 'auth.log' as (f1:chararray);
> grunt> c = foreach logs  generate REGEX_EXTRACT_ALL(f1, '([a-zA-Z]{3,3})
> ([0-9]{1,2}) ([0-2]{1}[0-9]{1}:[0-5]{1}[0-9]{1}:[0-5]{1}[0-9]{1})
> ([a-zA-Z0-9-_]+) ([a-zA-Z]+)\\[[0-9]+\\]: (.*)');
> grunt> df = GROUP c by ($1, $4);
> 2013-02-28 10:57:32,630 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1000:
> <line 3, column 17> Out of bound access. Trying to access non-existent
> column: 1. Schema org.apache.pig.builtin.regex_extract_all_f1_4:tuple()
> *has 1 column(s)*.
> Details at logfile: /home/hduser/pigtmp/pig_1362045005909.log
> grunt> dump c;
> [...]
>
> *((Feb,28,10:50:13,hadoop-master,sshd,debug1: session_input_channel_req:
> session 0 req window-change))*
>
> => looks like a tuple of tuple ?
>
> grunt> df = GROUP c by ($1, $4);
> 2013-02-28 10:57:59,274 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1000:
> <line 3, column 17> Out of bound access. Trying to access non-existent
> column: 1. Schema org.apache.pig.builtin.regex_extract_all_f1_10:tuple()
> has 1 column(s).
> Details at logfile: /home/hduser/pigtmp/pig_1362045005909.log
> grunt> df = GROUP c by (c.$1, c.$4);
> 2013-02-28 10:58:06,873 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1200: Pig script failed to parse:
> <line 3, column 17> Invalid scalar projection: c
> Details at logfile: /home/hduser/pigtmp/pig_1362045005909.log
> grunt> df = GROUP c by (c.$0.$1, c.$0.$4);
> grunt> dump df;
> [...]
>
> 2013-02-28 10:58:46,781 [Thread-16] WARN 
> org.apache.hadoop.mapred.LocalJobRunner - job_local_0003
> org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar
> has more than one row in the output. 1st :
> ((Feb,24,07:39:01,hadoop-master,CRON,pam_unix(cron:session): session
> opened for user root by (uid=0))), 2nd
> :((Feb,24,07:39:01,hadoop-master,CRON,pam_unix(cron:session): session
> closed for user root))
>         at
> org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:111)
> [...]
>
> grunt> df = GROUP c by (FLATTEN(c.$1), FLATTEN(c.$4));
> 2013-02-28 10:59:31,187 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1200: Pig script failed to parse:
> <line 4, column 25> Invalid scalar projection: c
> Details at logfile: /home/hduser/pigtmp/pig_1362045005909.log
>
> grunt> df = GROUP c by (FLATTEN(c).$1, FLATTEN(c).$4);
> 2013-02-28 10:59:51,062 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1200: Pig script failed to parse:
> <line 4, column 25> Invalid scalar projection: c : A column needs to be
> projected from a relation for it to be used as a scalar
> Details at logfile: /home/hduser/pigtmp/pig_1362045005909.log
>
> grunt> df = GROUP c by (FLATTEN(c.$0).$1, FLATTEN(c.$0).$4);
> 2013-02-28 11:17:46,744 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1070: Could not resolve FLATTEN using imports: [,
> org.apache.pig.builtin., org.apache.pig.impl.builtin.]
> Details at logfile: /home/hduser/pigtmp/pig_1362045005909.log
>
> even tried the perl way:
> grunt> (m:chararray, d:int, time:chararray, hostname:chararray,
> service:chararray, info:chararray) = foreach logs  generate
> REGEX_EXTRACT_ALL(f1, '([a-zA-Z]{3,3}) ([0-9]{1,2})
> ([0-2]{1}[0-9]{1}:[0-5]{1}[0-9]{1}:[0-5]{1}[0-9]{1}) ([a-zA-Z0-9-_]+)
> ([a-zA-Z]+)\\[[0-9]+\\]: (.*)');
> 2013-02-28 11:23:47,995 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1000: Error during parsing. Lexical error at line 1, column 1. 
> Encountered: "(" (40), after : ""
>
> :(
>
> Le 27/02/2013 20:26, Johnny Zhang a écrit :
>> Hi, Brice:
>> Instead of save&reload it, can you try 'dump c;' first then use c.$0 ?
>>
>> Johnny
>>
>>
>> On Wed, Feb 27, 2013 at 8:49 AM, brice lecomte <[email protected]> wrote:
>>
>>> Hello,
>>> --Pig 0.10.0--
>>> I'd like to access straitght forward to the result of:
>>> grunt> c = foreach logs  generate REGEX_EXTRACT_ALL(f1, '([a-zA-Z]{3,3})
>>> ([0-9]{1,2}) ([0-2]{1}[0-9]{1}:[0-5]{1}[0-9]{1}:[0-5]{1}[0-9]{1})
>>> ([a-zA-Z0-9-_]+) ([a-zA-Z]+)\\[[0-9]+\\]: (.*)');
>>> grunt> illustrate c;
>>>
>>>
>>> -------------------------------------------------------------------------------------------------------------
>>> | logs     |
>>> f1:chararray
>>> |
>>>
>>> -------------------------------------------------------------------------------------------------------------
>>> |          | Feb 24 20:09:01 hadoop-master CRON[3574]:
>>> pam_unix(cron:session): session closed for user root |
>>>
>>> -------------------------------------------------------------------------------------------------------------
>>>
>>> ----------------------------------------------------------------------------
>>> | c     | org.apache.pig.builtin.regex_extract_all_f1_178:tuple()
>>>  |
>>>
>>> ----------------------------------------------------------------------------
>>> |       | (Feb, ..., pam_unix(cron:session): session closed for user root)
>>> |
>>>
>>> ----------------------------------------------------------------------------
>>>
>>> but the only way I found is to save&reload it:
>>>
>>> grunt> store c into 'pig/AUTH.result';
>>> grunt> auth = LOAD 'pig/AUTH.result/part-m-00000' USING PigStorage(',')
>>> AS (m:chararray, d:int, time:chararray, hostname:chararray,
>>> service:chararray, info:chararray);
>>> grunt> day_frequency = GROUP auth by (d,service);
>>> ...
>>>
>>> is there a way to name the tuple items or to access them such as c.$0 or
>>> FLATTEN(c).$0.... ??
>>>
>>> Thanks,
>>> Brice
>>>
>>>
>

Re: How to access to the tuple items of REGEX_EXTRACT_ALL ?

Reply via email to