Hello, First of all, I'm new at Pig and NoSQL so I hope you'll forgive stupid questions ;-)
So, I'm playing with OpenTSDB (software layer on top of HBase to handle timeseries data) and now I'd like to run some data mining queries on top of my timestamped data. I found that Pig could be a solution so I tried to make it working on top of the openTSDB data in hbase, it neraly works but I'm still confused. OpenTSDB schema : hbase(main):011:0> describe 'tsdb-uid' DESCRIPTION ENABLED {NAME => 'tsdb-uid', FAMILIES => [{NAME => 'id', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => true '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'name', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BL OCKCACHE => 'true'}]} hbase(main):012:0> describe 'tsdb' DESCRIPTION ENABLED {NAME => 'tsdb', FAMILIES => [{NAME => 't', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '3', true TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]} So sample uid data are : hbase(main):014:0> scan 'tsdb-uid' ROW COLUMN+CELL \x00\x00\x01 column=name:metrics, timestamp=1314801674803, value=proc.loadavg.1m \x00\x00\x01 column=name:tagk, timestamp=1314801684953, value=validity \x00\x00\x01 column=name:tagv, timestamp=1314801685000, value=true \x00\x00\x02 column=name:metrics, timestamp=1314801674849, value=proc.loadavg.5m \x00\x00\x02 column=name:tagk, timestamp=1314801685049, value=device \x00\x00\x02 column=name:tagv, timestamp=1314801685096, value=Device1 \x00\x00\x03 column=name:metrics, timestamp=1314801674898, value=Measurement_1 \x00\x00\x03 column=name:tagk, timestamp=1314801685144, value=accuracy \x00\x00\x03 column=name:tagv, timestamp=1314801693030, value=Device2 \x00\x00\x04 column=name:metrics, timestamp=1314801674947, value=Measurement_2 \x00\x00\x05 column=name:metrics, timestamp=1314801674994, value=Measurement_3 Device1 column=id:tagv, timestamp=1314801685097, value=\x00\x00\x02 Device2 column=id:tagv, timestamp=1314801693031, value=\x00\x00\x03 Measurement_1 column=id:metrics, timestamp=1314801674899, value=\x00\x00\x03 Measurement_2 column=id:metrics, timestamp=1314801674948, value=\x00\x00\x04 Measurement_3 column=id:metrics, timestamp=1314801674995, value=\x00\x00\x05 accuracy column=id:tagk, timestamp=1314801685145, value=\x00\x00\x03 device column=id:tagk, timestamp=1314801685050, value=\x00\x00\x02 proc.loadavg.1m column=id:metrics, timestamp=1314801674804, value=\x00\x00\x01 proc.loadavg.5m column=id:metrics, timestamp=1314801674850, value=\x00\x00\x02 true column=id:tagv, timestamp=1314801685002, value=\x00\x00\x01 validity column=id:tagk, timestamp=1314801684955, value=\x00\x00\x01 Here are the metrics (timestamp data type id:metrics) and the tag defining the data (tagk and tagv for value, ex: validity = true) So from Pig when I want to retrieve only the metrics and their value (= id for the data table) I do : tsd_metrics = LOAD 'hbase://tsdb-uid' using org.apache.pig.backend.hadoop.hbase.HBaseStorage('id:metrics', '-loadKey true') AS (metrics:bytearray); dump tsd_metrics; HadoopVersion PigVersion UserId StartedAt FinishedAt Features 0.20.2 0.8.1-SNAPSHOT opentsdb 2011-09-06 13:39:27 2011-09-06 13:39:34 UNKNOWN Success! Job Stats (time in seconds): JobId Alias Feature Outputs job_local_0004 tsd_metrics MAP_ONLY file:/tmp/temp-1850282462/tmp1589556736, Input(s): Successfully read records from: "hbase://tsdb-uid" Output(s): Successfully stored records in: "file:/tmp/temp-1850282462/tmp1589556736" Job DAG: job_local_0004 (Measurement_1,) (Measurement_2,) (Measurement_3,) (proc.loadavg.1m,) (proc.loadavg.5m,) so that's nealy ok except that the value (= id) displayed is null instead of \x00\x00\x03 for example in the case of Measurement_1 Any idea ? thx ! shazz