Re: [jira] Commented: (PIG-1782) Add ability to load data by column family in HBaseStorage

Corbin Hoenes Thu, 27 Jan 2011 18:04:55 -0800

What about option a but return a map?

Sent from my iPhone


On Jan 27, 2011, at 5:01 PM, "Bill Graham (JIRA)" <j...@apache.org> wrote:

> 
>    [ 
> https://issues.apache.org/jira/browse/PIG-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987839#action_12987839
>  ] 
> 
> Bill Graham commented on PIG-1782:
> ----------------------------------
> 
> Assigning this to myself, since I've got a working patch, but the design 
> needs to be vetted out further with this approach.
> 
> One issue is that the number of columns per family per row is not constant, 
> so with a sparse table you'd have no idea what column names go with each 
> value of the tuple returned. Another issue is that the column name is 
> actually dynamic descriptive data often times in HBase and there can be 
> multiple timestamped values for a cell.
> 
> * Option A:
> Instead of returning a tuple of values the load can return a tuple of tuples. 
> Each inner tuple is a two-tuple that contains the column descriptor and the 
> most recent value. This data structure would be returned if a 'cf:' style 
> column exists in the column list, but default behavior exists with explicit 
> column names. This is the simplest approach.
> 
> * Option B:
> Build out an even more rich (and complex) data structure that also takes into 
> account multiple values and their timestamps. A tuple of tuple of tuple of 
> tuples to capture the entire HBase KeyValue data structure. Something like 
> this:
> 
> {code}
> (
> ( column name, ( (value, ts), ... ) ), ...
> )
> {code}
> 
> Either way, the variable length tuples returned for each row containing 
> additional variable length tuples would probably require a number of custom 
> UDFs to do anything useful with variable name columns and multiple 
> timestamped values. 
> 
> I guess I lean towards option B so we can support more use cases down the 
> road with this refactor. Other opinions?
> 
>> Add ability to load data by column family in HBaseStorage
>> ---------------------------------------------------------
>> 
>>                Key: PIG-1782
>>                URL: https://issues.apache.org/jira/browse/PIG-1782
>>            Project: Pig
>>         Issue Type: New Feature
>>        Environment: Java 6, Mac OS X 10.6
>>           Reporter: Eric Yang
>>           Assignee: Bill Graham
>> 
>> It would be nice to load all columns in the column family by using short 
>> hand syntax like:
>> {noformat}
>> CpuMetrics = load 'hbase://SystemMetrics' USING 
>> org.apache.pig.backend.hadoop.hbase.HBaseStorage('cpu:','-loadKey');
>> {noformat}
>> Assuming there are columns cpu: sys.0, cpu:sys.1, cpu:user.0, cpu:user.1,  
>> in cpu column family.
>> CpuMetrics would contain something like:
>> {noformat}
>> (rowKey, cpu:sys.0, cpu:sys.1, cpu:user.0, cpu:user.1)
>> {noformat}
> 
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>

Re: [jira] Commented: (PIG-1782) Add ability to load data by column family in HBaseStorage

Reply via email to