[
https://issues.apache.org/jira/browse/PIG-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987839#action_12987839
]
Bill Graham commented on PIG-1782:
----------------------------------
Assigning this to myself, since I've got a working patch, but the design needs
to be vetted out further with this approach.
One issue is that the number of columns per family per row is not constant, so
with a sparse table you'd have no idea what column names go with each value of
the tuple returned. Another issue is that the column name is actually dynamic
descriptive data often times in HBase and there can be multiple timestamped
values for a cell.
* Option A:
Instead of returning a tuple of values the load can return a tuple of tuples.
Each inner tuple is a two-tuple that contains the column descriptor and the
most recent value. This data structure would be returned if a 'cf:' style
column exists in the column list, but default behavior exists with explicit
column names. This is the simplest approach.
* Option B:
Build out an even more rich (and complex) data structure that also takes into
account multiple values and their timestamps. A tuple of tuple of tuple of
tuples to capture the entire HBase KeyValue data structure. Something like this:
{code}
(
( column name, ( (value, ts), ... ) ), ...
)
{code}
Either way, the variable length tuples returned for each row containing
additional variable length tuples would probably require a number of custom
UDFs to do anything useful with variable name columns and multiple timestamped
values.
I guess I lean towards option B so we can support more use cases down the road
with this refactor. Other opinions?
> Add ability to load data by column family in HBaseStorage
> ---------------------------------------------------------
>
> Key: PIG-1782
> URL: https://issues.apache.org/jira/browse/PIG-1782
> Project: Pig
> Issue Type: New Feature
> Environment: Java 6, Mac OS X 10.6
> Reporter: Eric Yang
> Assignee: Bill Graham
>
> It would be nice to load all columns in the column family by using short hand
> syntax like:
> {noformat}
> CpuMetrics = load 'hbase://SystemMetrics' USING
> org.apache.pig.backend.hadoop.hbase.HBaseStorage('cpu:','-loadKey');
> {noformat}
> Assuming there are columns cpu: sys.0, cpu:sys.1, cpu:user.0, cpu:user.1, in
> cpu column family.
> CpuMetrics would contain something like:
> {noformat}
> (rowKey, cpu:sys.0, cpu:sys.1, cpu:user.0, cpu:user.1)
> {noformat}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.