HBaseStorage do not care about delimiter in STORE
-------------------------------------------------

                 Key: PIG-2289
                 URL: https://issues.apache.org/jira/browse/PIG-2289
             Project: Pig
          Issue Type: Bug
          Components: internal-udfs
    Affects Versions: 0.9.1, 0.10
         Environment: Hadoop, Hbase, zookeeper from cdh3u1
Pig from github (version 0.9.1 then trunk:0.10)
            Reporter: Damien Hardy


I want to store in Hbase a set of tupple generated by pig streaming (inspired 
by http://www.cloudera.com/blog/2009/06/analyzing-apache-logs-with-pig/ )

Here is my script :
set debug 'off'
DEFINE iplookup `wrapper.sh GeoIP`
ship ('wrapper.sh')
cache('/GeoIP/GeoIPcity.dat#GeoIP');

A = load 'log' using 
org.apache.pig.backend.hadoop.hbase.HBaseStorage('default:body','-gt=_f:squid_t:201109161405
 -lte=_f:squid_t:201109161410 -loadKey') AS (rowkey, data);
B = LIMIT A 10;
C = FOREACH B {
        t = 
REGEX_EXTRACT(data,'([0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}):([0-9]+)
 ',1);
        generate rowkey, t;
}
D = STREAM C THROUGH iplookup AS (rowkey, ip, country_code, country, state, 
city);
DESCRIBE D;
-- DUMP D;
STORE D INTO 'geoip_pig' USING 
org.apache.pig.backend.hadoop.hbase.HBaseStorage('location:ip 
location:country_code location:country location:state location:city') ;


The "DESCRIBE D;" show :
D: {rowkey: bytearray,ip: bytearray,country_code: bytearray,country: 
bytearray,state: bytearray,city: bytearray}
as expected

Store juste get the rowkey and put the rest of the tuple in the first column 
(location:ip) as you can see :
hbase(main):033:0> get 'geoip_pig', 
"_f:squid_t:20110916140500_b:squid_s:200-1VPVjbVwywTpNtLA4mHl+A=="
COLUMN                                               CELL                       
                                                                                
                                               
 location:city         timestamp=1316180980265, value=
 location:country      timestamp=1316180980265, value=
 location:country_code timestamp=1316180980265, value=
 location:ip           timestamp=1316180980265, 
value=90.9.213.170,FR,France,A9,Llupia
 location:state        timestamp=1316180980265, value=
5 row(s) in 0.0150 seconds

I tried also with option '-delim=,' without more effect.



--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to