siddhartha Pattanaik created PIG-3119:
-----------------------------------------
Summary: REGEX_EXTRACT_ALL custom with aggregation function
Key: PIG-3119
URL: https://issues.apache.org/jira/browse/PIG-3119
Project: Pig
Issue Type: Bug
Components: build, grunt
Affects Versions: 0.9.1
Environment: OS -version
================================
Linux version 2.6.18-194.3.1.el5 ([email protected]) (gcc version
4.1.2 20080704 (Red Hat 4.1.2-48))
software installed
=======================
hadoop-1.0.4
pig-0.9.1
Hardware details
====================================
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU X5560 @ 2.80GHz
stepping : 4
cpu MHz : 2800.098
cache size : 8192 KB
fpu : yes
fpu_exception : yes
cpuid level : 11
Reporter: siddhartha Pattanaik
Priority: Critical
Fix For: 0.9.1
Hi ,
I have a use case in my project requirement,
The i/p file consist of the following pattern:-
192.168.90.36 - - [16/May/2012:16:00:11 -0700] "GET
/img/explore/encyclopedia/characters/yoda_card.jpg HTTP/1.1" 200 22620
"http://www.starwars.com/explore/encyclopedia/characters/2/featured/"
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0)"
"Wookie-Cookie=474ca6b302a46696a1ec55f4b656f8c3;
__utma=181359608.119611689.1337206567.1337206567.1337206567.1;
__utmb=181359608.79.9.1337209104786; __utmc=181359608;
__utmz=181359608.1337206567.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none);
JSESSIONID=aHX_NQheRq08" "-" 0
I want to run a aggregate function along with regex_extract_all to extract the
desired data.
Even though the i/p file is parsing.I have issue with aggregate function
working on it.
Please find the below pig script:-
***************Ip_adress-count************************
Ip_adress_count.pig
A = LOAD 'starwar_log1' USING TextLoader AS (line:chararray);
B = FOREACH A GENERATE FLATTEN (REGEX_EXTRACT_ALL(line,'^(\\S+) (\\S+) (\\S+)
\\[([\\w:/]+\\s[+\\-]\\d{4})\\] "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)"
"([^"]*)" (\\S+) ') ) AS
(
remoteAddr: chararray,
remoteLogname: chararray,
user: chararray,
time: chararray,
request: chararray,
status: int,
bytes_string: chararray,
referrer: chararray,
Mozilla: chararray,
wookie_cookie: chararray,
browser3: chararray,
acess_status:int
);
C = group B by remoteAddr;
D = foreach C generate COUNT(B) as ip_adress_count;
E = order D by ip_adress_count;
F = STORE E INTO ‘ip_adress_count/' using PigStorage(',');
Expected O/p
===========================
ip_adress_count
remoteAddr,ip_adress_count
192.168.90.36,19
192.168.90.37,1
There is no parsing issue but the aggregate function count() is not working
over the regex_extract_all function for regular expression.
Please do the need.The requirement is I need the count of the ip adresses from
the ip data.
thanks,
siddharth
contact -8763666372
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira