[jira] Commented: (PIG-1782) Add ability to load data by column family in HBaseStorage

Bill Graham (JIRA) Fri, 28 Jan 2011 00:25:14 -0800

    [ 
https://issues.apache.org/jira/browse/PIG-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987981#action_12987981
 ]


Bill Graham commented on PIG-1782:
----------------------------------

I was also thinking about a map, but I thought we might want to preserve the 
ordering of the fields specified when explicit fields are requested, as well as 
CFs, like Dmitriy's example. We'd get the CF fields in the natural ordering 
that Hbase stores them in too. The more I think about it though, I don't think 
this is that useful and I think a map approach seems the way to go. 

@Eric: Yes pig doesn't have any ts control upon writes currently (and that 
should be improved), but that shouldn't rule out the ability to read them. I 
can see many use cases where some non-Pig process is populating HBase, but Pig 
is used for queries.

@Dmitriy: I prototyped that exact use case using tuples of tuples, but ran into 
the downsides you point out. Also each row read has a variable length of 
tuples, which would seem really difficult to work with. 

I like this approach when reading all columns in a family:

{code}
( rowKey, { col1 => ((val1, ts), ..), col2 => ((val2, ts), ..) } ) 
{code}

For Dymitriy's use case, having the same schema returned (alwaya a map) 
regardless of how the column families are specified (i.e., 'cf1: cf2:foo' vs 
'cf1:' vs 'cf2:foo cf2:bar') is one option. Another is to return a map for CFs 
and a ((val1, ts), ..) for explicit columns. I'm not sure which approach would 
make life easier on the script writer.


> Add ability to load data by column family in HBaseStorage
> ---------------------------------------------------------
>
>                 Key: PIG-1782
>                 URL: https://issues.apache.org/jira/browse/PIG-1782
>             Project: Pig
>          Issue Type: New Feature
>         Environment: Java 6, Mac OS X 10.6
>            Reporter: Eric Yang
>            Assignee: Bill Graham
>
> It would be nice to load all columns in the column family by using short hand 
> syntax like:
> {noformat}
> CpuMetrics = load 'hbase://SystemMetrics' USING 
> org.apache.pig.backend.hadoop.hbase.HBaseStorage('cpu:','-loadKey');
> {noformat}
> Assuming there are columns cpu: sys.0, cpu:sys.1, cpu:user.0, cpu:user.1,  in 
> cpu column family.
> CpuMetrics would contain something like:
> {noformat}
> (rowKey, cpu:sys.0, cpu:sys.1, cpu:user.0, cpu:user.1)
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1782) Add ability to load data by column family in HBaseStorage

Reply via email to