What about option a but return a map? Sent from my iPhone
On Jan 27, 2011, at 5:01 PM, "Bill Graham (JIRA)" <j...@apache.org> wrote: > > [ > https://issues.apache.org/jira/browse/PIG-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987839#action_12987839 > ] > > Bill Graham commented on PIG-1782: > ---------------------------------- > > Assigning this to myself, since I've got a working patch, but the design > needs to be vetted out further with this approach. > > One issue is that the number of columns per family per row is not constant, > so with a sparse table you'd have no idea what column names go with each > value of the tuple returned. Another issue is that the column name is > actually dynamic descriptive data often times in HBase and there can be > multiple timestamped values for a cell. > > * Option A: > Instead of returning a tuple of values the load can return a tuple of tuples. > Each inner tuple is a two-tuple that contains the column descriptor and the > most recent value. This data structure would be returned if a 'cf:' style > column exists in the column list, but default behavior exists with explicit > column names. This is the simplest approach. > > * Option B: > Build out an even more rich (and complex) data structure that also takes into > account multiple values and their timestamps. A tuple of tuple of tuple of > tuples to capture the entire HBase KeyValue data structure. Something like > this: > > {code} > ( > ( column name, ( (value, ts), ... ) ), ... > ) > {code} > > Either way, the variable length tuples returned for each row containing > additional variable length tuples would probably require a number of custom > UDFs to do anything useful with variable name columns and multiple > timestamped values. > > I guess I lean towards option B so we can support more use cases down the > road with this refactor. Other opinions? > >> Add ability to load data by column family in HBaseStorage >> --------------------------------------------------------- >> >> Key: PIG-1782 >> URL: https://issues.apache.org/jira/browse/PIG-1782 >> Project: Pig >> Issue Type: New Feature >> Environment: Java 6, Mac OS X 10.6 >> Reporter: Eric Yang >> Assignee: Bill Graham >> >> It would be nice to load all columns in the column family by using short >> hand syntax like: >> {noformat} >> CpuMetrics = load 'hbase://SystemMetrics' USING >> org.apache.pig.backend.hadoop.hbase.HBaseStorage('cpu:','-loadKey'); >> {noformat} >> Assuming there are columns cpu: sys.0, cpu:sys.1, cpu:user.0, cpu:user.1, >> in cpu column family. >> CpuMetrics would contain something like: >> {noformat} >> (rowKey, cpu:sys.0, cpu:sys.1, cpu:user.0, cpu:user.1) >> {noformat} > > -- > This message is automatically generated by JIRA. > - > You can reply to this email to add a comment to the issue online. >