Hello,

First of all, I'm new at Pig and NoSQL so I hope you'll forgive stupid
questions ;-)

So, I'm playing with OpenTSDB (software layer on top of HBase to handle
timeseries data) and now I'd like to run some data mining queries on top of
my timestamped data. I found that Pig could be a solution so I tried to make
it working on top of the openTSDB data in hbase, it neraly works but I'm
still confused.

OpenTSDB schema :
hbase(main):011:0> describe 'tsdb-uid'
DESCRIPTION
                                                           ENABLED
 {NAME => 'tsdb-uid', FAMILIES => [{NAME => 'id', BLOOMFILTER => 'NONE',
REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS =>  true
 '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false',
BLOCKCACHE => 'true'}, {NAME => 'name', BLOOMFILTER => 'NONE',
 REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '3', TTL =>
'2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BL
 OCKCACHE => 'true'}]}

hbase(main):012:0> describe 'tsdb'
DESCRIPTION
                                                           ENABLED
 {NAME => 'tsdb', FAMILIES => [{NAME => 't', BLOOMFILTER => 'NONE',
REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '3',  true
 TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE
=> 'true'}]}

So sample uid data are :
hbase(main):014:0> scan 'tsdb-uid'
ROW                                                   COLUMN+CELL
 \x00\x00\x01                                         column=name:metrics,
timestamp=1314801674803, value=proc.loadavg.1m
 \x00\x00\x01                                         column=name:tagk,
timestamp=1314801684953, value=validity
 \x00\x00\x01                                         column=name:tagv,
timestamp=1314801685000, value=true
 \x00\x00\x02                                         column=name:metrics,
timestamp=1314801674849, value=proc.loadavg.5m
 \x00\x00\x02                                         column=name:tagk,
timestamp=1314801685049, value=device
 \x00\x00\x02                                         column=name:tagv,
timestamp=1314801685096, value=Device1
 \x00\x00\x03                                         column=name:metrics,
timestamp=1314801674898, value=Measurement_1
 \x00\x00\x03                                         column=name:tagk,
timestamp=1314801685144, value=accuracy
 \x00\x00\x03                                         column=name:tagv,
timestamp=1314801693030, value=Device2
 \x00\x00\x04                                         column=name:metrics,
timestamp=1314801674947, value=Measurement_2
 \x00\x00\x05                                         column=name:metrics,
timestamp=1314801674994, value=Measurement_3
 Device1                                              column=id:tagv,
timestamp=1314801685097, value=\x00\x00\x02
 Device2                                              column=id:tagv,
timestamp=1314801693031, value=\x00\x00\x03
 Measurement_1                                        column=id:metrics,
timestamp=1314801674899, value=\x00\x00\x03
 Measurement_2                                        column=id:metrics,
timestamp=1314801674948, value=\x00\x00\x04
 Measurement_3                                        column=id:metrics,
timestamp=1314801674995, value=\x00\x00\x05
 accuracy                                             column=id:tagk,
timestamp=1314801685145, value=\x00\x00\x03
 device                                               column=id:tagk,
timestamp=1314801685050, value=\x00\x00\x02
 proc.loadavg.1m                                      column=id:metrics,
timestamp=1314801674804, value=\x00\x00\x01
 proc.loadavg.5m                                      column=id:metrics,
timestamp=1314801674850, value=\x00\x00\x02
 true                                                 column=id:tagv,
timestamp=1314801685002, value=\x00\x00\x01
 validity                                             column=id:tagk,
timestamp=1314801684955, value=\x00\x00\x01

Here are the metrics (timestamp data type id:metrics) and the tag defining
the data (tagk and tagv for value, ex:  validity = true)

So from Pig when I want to retrieve only the metrics and their value (= id
for the data table) I do :
tsd_metrics     = LOAD 'hbase://tsdb-uid' using
org.apache.pig.backend.hadoop.hbase.HBaseStorage('id:metrics', '-loadKey
true') AS (metrics:bytearray);
dump tsd_metrics;

HadoopVersion   PigVersion      UserId  StartedAt       FinishedAt
 Features
0.20.2  0.8.1-SNAPSHOT  opentsdb        2011-09-06 13:39:27     2011-09-06
13:39:34     UNKNOWN
Success!
Job Stats (time in seconds):
JobId   Alias   Feature Outputs
job_local_0004  tsd_metrics     MAP_ONLY
 file:/tmp/temp-1850282462/tmp1589556736,
Input(s):
Successfully read records from: "hbase://tsdb-uid"
Output(s):
Successfully stored records in: "file:/tmp/temp-1850282462/tmp1589556736"
Job DAG:
job_local_0004

(Measurement_1,)
(Measurement_2,)
(Measurement_3,)
(proc.loadavg.1m,)
(proc.loadavg.5m,)

so that's nealy ok except that the value (= id) displayed is null instead
of \x00\x00\x03 for example in the case of Measurement_1

Any idea ?

thx !

shazz

Reply via email to