[jira] Commented: (HIVE-705) Let Hive can analyse hbase's tables
[ https://issues.apache.org/jira/browse/HIVE-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844584#action_12844584 ] Namit Jain commented on HIVE-705: - +1 will commit if the tests pass > Let Hive can analyse hbase's tables > --- > > Key: HIVE-705 > URL: https://issues.apache.org/jira/browse/HIVE-705 > Project: Hadoop Hive > Issue Type: New Feature >Affects Versions: 0.6.0 >Reporter: Samuel Guo >Assignee: John Sichi > Fix For: 0.6.0 > > Attachments: hbase-0.19.3-test.jar, hbase-0.19.3.jar, > hbase-0.20.3-test.jar, hbase-0.20.3.jar, HIVE-705.1.patch, HIVE-705.2.patch, > HIVE-705.3.patch, HIVE-705.4.patch, HIVE-705.5.patch, HIVE-705.6.patch, > HIVE-705.7.patch, HIVE-705_draft.patch, HIVE-705_revision806905.patch, > HIVE-705_revision883033.patch, zookeeper-3.2.2.jar > > > Add a serde over the hbase's tables, so that hive can analyse the data stored > in hbase easily. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-705) Let Hive can analyse hbase's tables
[ https://issues.apache.org/jira/browse/HIVE-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844293#action_12844293 ] John Sichi commented on HIVE-705: - Latest patch hits a test failure with latest trunk. I'll upload a new patch soon to fix it. > Let Hive can analyse hbase's tables > --- > > Key: HIVE-705 > URL: https://issues.apache.org/jira/browse/HIVE-705 > Project: Hadoop Hive > Issue Type: New Feature >Affects Versions: 0.6.0 >Reporter: Samuel Guo >Assignee: John Sichi > Fix For: 0.6.0 > > Attachments: hbase-0.19.3-test.jar, hbase-0.19.3.jar, > hbase-0.20.3-test.jar, hbase-0.20.3.jar, HIVE-705.1.patch, HIVE-705.2.patch, > HIVE-705.3.patch, HIVE-705.4.patch, HIVE-705.5.patch, HIVE-705.6.patch, > HIVE-705_draft.patch, HIVE-705_revision806905.patch, > HIVE-705_revision883033.patch, zookeeper-3.2.2.jar > > > Add a serde over the hbase's tables, so that hive can analyse the data stored > in hbase easily. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-705) Let Hive can analyse hbase's tables
[ https://issues.apache.org/jira/browse/HIVE-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12843840#action_12843840 ] John Sichi commented on HIVE-705: - Use HIVE-705.6.patch. > Let Hive can analyse hbase's tables > --- > > Key: HIVE-705 > URL: https://issues.apache.org/jira/browse/HIVE-705 > Project: Hadoop Hive > Issue Type: New Feature >Affects Versions: 0.6.0 >Reporter: Samuel Guo >Assignee: John Sichi > Fix For: 0.6.0 > > Attachments: hbase-0.19.3-test.jar, hbase-0.19.3.jar, > hbase-0.20.3-test.jar, hbase-0.20.3.jar, HIVE-705.1.patch, HIVE-705.2.patch, > HIVE-705.3.patch, HIVE-705.4.patch, HIVE-705.5.patch, HIVE-705.6.patch, > HIVE-705_draft.patch, HIVE-705_revision806905.patch, > HIVE-705_revision883033.patch, zookeeper-3.2.2.jar > > > Add a serde over the hbase's tables, so that hive can analyse the data stored > in hbase easily. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-705) Let Hive can analyse hbase's tables
[ https://issues.apache.org/jira/browse/HIVE-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12843826#action_12843826 ] Namit Jain commented on HIVE-705: - [ivy:retrieve] :: problems summary :: [ivy:retrieve] WARNINGS [ivy:retrieve] module not found: hadoop#hbase;${hbase.version} [ivy:retrieve] hadoop-source: tried [ivy:retrieve]-- artifact hadoop#hbase;${hbase.version}!hbase.tar.gz(source): [ivy:retrieve] http://mirror.facebook.net/facebook/hive-deps/hadoop/core/hbase-${hbase.version}/hbase-${hbase.version}.tar.gz [ivy:retrieve] apache-snapshot: tried [ivy:retrieve] https://repository.apache.org/content/repositories/snapshots/hadoop/hbase/${hbase.version}/hbase-${hbase.version}.pom [ivy:retrieve]-- artifact hadoop#hbase;${hbase.version}!hbase.tar.gz(source): [ivy:retrieve] https://repository.apache.org/content/repositories/snapshots/hadoop/hbase/${hbase.version}/hbase-${hbase.version}.tar.gz [ivy:retrieve] maven2: tried [ivy:retrieve] http://repo1.maven.org/maven2/hadoop/hbase/${hbase.version}/hbase-${hbase.version}.pom [ivy:retrieve]-- artifact hadoop#hbase;${hbase.version}!hbase.tar.gz(source): [ivy:retrieve] http://repo1.maven.org/maven2/hadoop/hbase/${hbase.version}/hbase-${hbase.version}.tar.gz [ivy:retrieve] :: [ivy:retrieve] :: UNRESOLVED DEPENDENCIES :: [ivy:retrieve] :: [ivy:retrieve] :: hadoop#hbase;${hbase.version}: not found [ivy:retrieve] :: [ivy:retrieve] [ivy:retrieve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS I am getting the following errors when I compile > Let Hive can analyse hbase's tables > --- > > Key: HIVE-705 > URL: https://issues.apache.org/jira/browse/HIVE-705 > Project: Hadoop Hive > Issue Type: New Feature >Affects Versions: 0.6.0 >Reporter: Samuel Guo >Assignee: John Sichi > Fix For: 0.6.0 > > Attachments: hbase-0.19.3-test.jar, hbase-0.19.3.jar, > hbase-0.20.3-test.jar, hbase-0.20.3.jar, HIVE-705.1.patch, HIVE-705.2.patch, > HIVE-705.3.patch, HIVE-705.4.patch, HIVE-705.5.patch, HIVE-705_draft.patch, > HIVE-705_revision806905.patch, HIVE-705_revision883033.patch, > zookeeper-3.2.2.jar > > > Add a serde over the hbase's tables, so that hive can analyse the data stored > in hbase easily. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-705) Let Hive can analyse hbase's tables
[ https://issues.apache.org/jira/browse/HIVE-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12843773#action_12843773 ] John Sichi commented on HIVE-705: - Followup JIRA issues have been logged and linked to this one as related. > Let Hive can analyse hbase's tables > --- > > Key: HIVE-705 > URL: https://issues.apache.org/jira/browse/HIVE-705 > Project: Hadoop Hive > Issue Type: New Feature >Affects Versions: 0.6.0 >Reporter: Samuel Guo >Assignee: John Sichi > Fix For: 0.6.0 > > Attachments: hbase-0.19.3-test.jar, hbase-0.19.3.jar, > hbase-0.20.3-test.jar, hbase-0.20.3.jar, HIVE-705.1.patch, HIVE-705.2.patch, > HIVE-705.3.patch, HIVE-705.4.patch, HIVE-705.5.patch, HIVE-705_draft.patch, > HIVE-705_revision806905.patch, HIVE-705_revision883033.patch, > zookeeper-3.2.2.jar > > > Add a serde over the hbase's tables, so that hive can analyse the data stored > in hbase easily. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-705) Let Hive can analyse hbase's tables
[ https://issues.apache.org/jira/browse/HIVE-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12843702#action_12843702 ] John Sichi commented on HIVE-705: - @Jonathan: I haven't seen any patch uploaded for HIVE-806. The comments indicate that they have a way to customize the serialization per column in HBase, which could be interesting, but it's non-essential. Once HIVE-705 gets committed, I'll post a comment on HIVE-806 and ask whether they want to keep it open or abandon it. @Namit: will do. > Let Hive can analyse hbase's tables > --- > > Key: HIVE-705 > URL: https://issues.apache.org/jira/browse/HIVE-705 > Project: Hadoop Hive > Issue Type: New Feature >Affects Versions: 0.6.0 >Reporter: Samuel Guo >Assignee: John Sichi > Fix For: 0.6.0 > > Attachments: hbase-0.19.3-test.jar, hbase-0.19.3.jar, > hbase-0.20.3-test.jar, hbase-0.20.3.jar, HIVE-705.1.patch, HIVE-705.2.patch, > HIVE-705.3.patch, HIVE-705.4.patch, HIVE-705.5.patch, HIVE-705_draft.patch, > HIVE-705_revision806905.patch, HIVE-705_revision883033.patch, > zookeeper-3.2.2.jar > > > Add a serde over the hbase's tables, so that hive can analyse the data stored > in hbase easily. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-705) Let Hive can analyse hbase's tables
[ https://issues.apache.org/jira/browse/HIVE-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12843680#action_12843680 ] Namit Jain commented on HIVE-705: - John, can you file the follow-up jiras ? > Let Hive can analyse hbase's tables > --- > > Key: HIVE-705 > URL: https://issues.apache.org/jira/browse/HIVE-705 > Project: Hadoop Hive > Issue Type: New Feature >Affects Versions: 0.6.0 >Reporter: Samuel Guo >Assignee: John Sichi > Fix For: 0.6.0 > > Attachments: hbase-0.19.3-test.jar, hbase-0.19.3.jar, > hbase-0.20.3-test.jar, hbase-0.20.3.jar, HIVE-705.1.patch, HIVE-705.2.patch, > HIVE-705.3.patch, HIVE-705.4.patch, HIVE-705.5.patch, HIVE-705_draft.patch, > HIVE-705_revision806905.patch, HIVE-705_revision883033.patch, > zookeeper-3.2.2.jar > > > Add a serde over the hbase's tables, so that hive can analyse the data stored > in hbase easily. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-705) Let Hive can analyse hbase's tables
[ https://issues.apache.org/jira/browse/HIVE-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12843594#action_12843594 ] Jonathan Ellis commented on HIVE-705: - Thanks John, I read your wiki notes and it does look like this will work fine for Cassandra at least at the conceptual level. Is HIVE-806 redundant w/ your latest patchset now? > Let Hive can analyse hbase's tables > --- > > Key: HIVE-705 > URL: https://issues.apache.org/jira/browse/HIVE-705 > Project: Hadoop Hive > Issue Type: New Feature >Affects Versions: 0.6.0 >Reporter: Samuel Guo >Assignee: John Sichi > Fix For: 0.6.0 > > Attachments: hbase-0.19.3-test.jar, hbase-0.19.3.jar, > hbase-0.20.3-test.jar, hbase-0.20.3.jar, HIVE-705.1.patch, HIVE-705.2.patch, > HIVE-705.3.patch, HIVE-705.4.patch, HIVE-705.5.patch, HIVE-705_draft.patch, > HIVE-705_revision806905.patch, HIVE-705_revision883033.patch, > zookeeper-3.2.2.jar > > > Add a serde over the hbase's tables, so that hive can analyse the data stored > in hbase easily. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-705) Let Hive can analyse hbase's tables
[ https://issues.apache.org/jira/browse/HIVE-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840993#action_12840993 ] John Sichi commented on HIVE-705: - While testing, found a few bugs in HBaseSerDe.serialize for the case where a Hive map is being converted into an HBase column family; I'll fix these together with whatever comes out of review. > Let Hive can analyse hbase's tables > --- > > Key: HIVE-705 > URL: https://issues.apache.org/jira/browse/HIVE-705 > Project: Hadoop Hive > Issue Type: New Feature >Affects Versions: 0.6.0 >Reporter: Samuel Guo >Assignee: John Sichi > Fix For: 0.6.0 > > Attachments: hbase-0.19.3-test.jar, hbase-0.19.3.jar, > hbase-0.20.3-test.jar, hbase-0.20.3.jar, HIVE-705.1.patch, HIVE-705.2.patch, > HIVE-705_draft.patch, HIVE-705_revision806905.patch, > HIVE-705_revision883033.patch, zookeeper-3.2.2.jar > > > Add a serde over the hbase's tables, so that hive can analyse the data stored > in hbase easily. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-705) Let Hive can analyse hbase's tables
[ https://issues.apache.org/jira/browse/HIVE-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840848#action_12840848 ] John Sichi commented on HIVE-705: - Prasad, the MetaHook interface is defined that way so that if a handler wants to, it can carry out the operation in a stateful fashion (e.g. if its underlying catalog supports transactions), but there is no requirement for it to keep state, and in fact the HBaseStorageHandler implementation is itself stateless (and has a NOP for three of its method implementations). Alter table: yes, I'm planning to create a followup task for this. The original patch had alter table support in the meta hook interface too, but I trimmed it down for now to limit the scope of the first commit. > Let Hive can analyse hbase's tables > --- > > Key: HIVE-705 > URL: https://issues.apache.org/jira/browse/HIVE-705 > Project: Hadoop Hive > Issue Type: New Feature >Affects Versions: 0.6.0 >Reporter: Samuel Guo >Assignee: John Sichi > Fix For: 0.6.0 > > Attachments: hbase-0.19.3-test.jar, hbase-0.19.3.jar, > hbase-0.20.3-test.jar, hbase-0.20.3.jar, HIVE-705.1.patch, HIVE-705.2.patch, > HIVE-705_draft.patch, HIVE-705_revision806905.patch, > HIVE-705_revision883033.patch, zookeeper-3.2.2.jar > > > Add a serde over the hbase's tables, so that hive can analyse the data stored > in hbase easily. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-705) Let Hive can analyse hbase's tables
[ https://issues.apache.org/jira/browse/HIVE-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840832#action_12840832 ] Prasad Chakka commented on HIVE-705: John, Why are pre, commit, rollback functions needed in MetaHook? Isn't it enough just to drop table as a rollback for create, and do the drop table after hive drop table? With the current definition the MetaHook implementation needs to keep state around which Hive itself doesn't do. Also alter table on external tables should be allowed since underlying storage format for external tables is not managed by Hive itself. In such cases alter table is just changing metadata in side Hive. > Let Hive can analyse hbase's tables > --- > > Key: HIVE-705 > URL: https://issues.apache.org/jira/browse/HIVE-705 > Project: Hadoop Hive > Issue Type: New Feature >Affects Versions: 0.6.0 >Reporter: Samuel Guo >Assignee: John Sichi > Fix For: 0.6.0 > > Attachments: hbase-0.19.3-test.jar, hbase-0.19.3.jar, > hbase-0.20.3-test.jar, hbase-0.20.3.jar, HIVE-705.1.patch, HIVE-705.2.patch, > HIVE-705_draft.patch, HIVE-705_revision806905.patch, > HIVE-705_revision883033.patch, zookeeper-3.2.2.jar > > > Add a serde over the hbase's tables, so that hive can analyse the data stored > in hbase easily. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-705) Let Hive can analyse hbase's tables
[ https://issues.apache.org/jira/browse/HIVE-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12836941#action_12836941 ] John Sichi commented on HIVE-705: - BTW, the new STORED BY 'storage-handler-class' should make it easy to plug in Cassandra. > Let Hive can analyse hbase's tables > --- > > Key: HIVE-705 > URL: https://issues.apache.org/jira/browse/HIVE-705 > Project: Hadoop Hive > Issue Type: New Feature >Reporter: Samuel Guo >Assignee: John Sichi > Attachments: hbase-0.19.3-test.jar, hbase-0.19.3.jar, > HIVE-705.1.patch, HIVE-705_draft.patch, HIVE-705_revision806905.patch, > HIVE-705_revision883033.patch > > > Add a serde over the hbase's tables, so that hive can analyse the data stored > in hbase easily. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-705) Let Hive can analyse hbase's tables
[ https://issues.apache.org/jira/browse/HIVE-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12836828#action_12836828 ] John Sichi commented on HIVE-705: - Jonathan, thanks for the input. I think we should be able to come up with a mapping feature which encompasses what you've proposed plus what's in HIVE-806 so that it will be up to the user to decide how to map a particular set of HBase tables into Hive. We can do this by allowing the HBase table name to be specified as part of mapping it into Hive. That way, you can have Hive t1(c1, c2) -> HBase t.cf1(c1, c2) Hive t2(c3, c4) -> HBase t.cf2(c3, c4) or Hive t(c1,c2,c3,c4) -> HBase t(cf1(c1, c2), cf2(c3, c4)) or Hive t(cf1map, cf2map) -> HBase t(cf1, cf2) or variations. I'm going to write up a proposal in the Hive wiki and solicit feedback. > Let Hive can analyse hbase's tables > --- > > Key: HIVE-705 > URL: https://issues.apache.org/jira/browse/HIVE-705 > Project: Hadoop Hive > Issue Type: New Feature >Reporter: Samuel Guo >Assignee: John Sichi > Attachments: hbase-0.19.3-test.jar, hbase-0.19.3.jar, > HIVE-705.1.patch, HIVE-705_draft.patch, HIVE-705_revision806905.patch, > HIVE-705_revision883033.patch > > > Add a serde over the hbase's tables, so that hive can analyse the data stored > in hbase easily. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-705) Let Hive can analyse hbase's tables
[ https://issues.apache.org/jira/browse/HIVE-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12836323#action_12836323 ] Jonathan Ellis commented on HIVE-705: - ISTM that merging the HBase columnfamilies into a single Hive table is the wrong approach and could lead to poor performance; rather, each HBase CF should be its own Hive table, which may of course be joined with others as necessary. (I think using the word "table" for HBase's "collection of CFs" is unfortunate in the first place since they are different animals; fundamentally, the basic unit of data access in HBase is the CF.) I'm interested because Cassandra is also looking at adding Hive support, and we also implement a ColumnFamily data model. > Let Hive can analyse hbase's tables > --- > > Key: HIVE-705 > URL: https://issues.apache.org/jira/browse/HIVE-705 > Project: Hadoop Hive > Issue Type: New Feature >Reporter: Samuel Guo >Assignee: John Sichi > Attachments: hbase-0.19.3-test.jar, hbase-0.19.3.jar, > HIVE-705.1.patch, HIVE-705_draft.patch, HIVE-705_revision806905.patch, > HIVE-705_revision883033.patch > > > Add a serde over the hbase's tables, so that hive can analyse the data stored > in hbase easily. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-705) Let Hive can analyse hbase's tables
[ https://issues.apache.org/jira/browse/HIVE-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12806193#action_12806193 ] John Sichi commented on HIVE-705: - I'm going to start working on getting this ready for submission against latest trunk. > Let Hive can analyse hbase's tables > --- > > Key: HIVE-705 > URL: https://issues.apache.org/jira/browse/HIVE-705 > Project: Hadoop Hive > Issue Type: New Feature >Reporter: Samuel Guo >Assignee: John Sichi > Attachments: hbase-0.19.3-test.jar, hbase-0.19.3.jar, > HIVE-705_draft.patch, HIVE-705_revision806905.patch, > HIVE-705_revision883033.patch > > > Add a serde over the hbase's tables, so that hive can analyse the data stored > in hbase easily. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-705) Let Hive can analyse hbase's tables
[ https://issues.apache.org/jira/browse/HIVE-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752340#action_12752340 ] stephen xie commented on HIVE-705: -- thanks very much for Samuel's help. The issue above has been resolved. In the distributed test environment, running hive command must be added the parameter --auxpath hive_contrib.jar,hbase.jar. > Let Hive can analyse hbase's tables > --- > > Key: HIVE-705 > URL: https://issues.apache.org/jira/browse/HIVE-705 > Project: Hadoop Hive > Issue Type: New Feature >Reporter: Samuel Guo >Assignee: Samuel Guo > Attachments: hbase-0.19.3-test.jar, hbase-0.19.3.jar, > HIVE-705_draft.patch, HIVE-705_revision806905.patch > > > Add a serde over the hbase's tables, so that hive can analyse the data stored > in hbase easily. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-705) Let Hive can analyse hbase's tables
[ https://issues.apache.org/jira/browse/HIVE-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752325#action_12752325 ] Samuel Guo commented on HIVE-705: - @stephen I have run the patch on my notebook. But I did not encounter the NullPointerException mentioned in your comment. Can you send me the hive log and the userlogs of the mr job 'FROM src INSERT OVERWRITE TABLE hbase_table_1 SELECT *;' ? Thanks. > Let Hive can analyse hbase's tables > --- > > Key: HIVE-705 > URL: https://issues.apache.org/jira/browse/HIVE-705 > Project: Hadoop Hive > Issue Type: New Feature >Reporter: Samuel Guo >Assignee: Samuel Guo > Attachments: hbase-0.19.3-test.jar, hbase-0.19.3.jar, > HIVE-705_draft.patch, HIVE-705_revision806905.patch > > > Add a serde over the hbase's tables, so that hive can analyse the data stored > in hbase easily. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-705) Let Hive can analyse hbase's tables
[ https://issues.apache.org/jira/browse/HIVE-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12751898#action_12751898 ] stephen xie commented on HIVE-705: -- Hi, Samuel Before the testing, I have set the configuration parameter ' "hive.othermetadata.handlers" the same as your said. Thanks. > Let Hive can analyse hbase's tables > --- > > Key: HIVE-705 > URL: https://issues.apache.org/jira/browse/HIVE-705 > Project: Hadoop Hive > Issue Type: New Feature >Reporter: Samuel Guo >Assignee: Samuel Guo > Attachments: hbase-0.19.3-test.jar, hbase-0.19.3.jar, > HIVE-705_draft.patch, HIVE-705_revision806905.patch > > > Add a serde over the hbase's tables, so that hive can analyse the data stored > in hbase easily. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-705) Let Hive can analyse hbase's tables
[ https://issues.apache.org/jira/browse/HIVE-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12751876#action_12751876 ] Samuel Guo commented on HIVE-705: - @stephen: Did you set the configuration parameter ' "hive.othermetadata.handlers" : "org.apache.hadoop.hive.contrib.hbase.HiveHBaseTableInputFormat:org.apache.hadoop.hive.contrib.hbase.HBaseMetadataHandler" '? I am sorry that I have other things to handle these days. I will fix the bug immediately if I have time. > Let Hive can analyse hbase's tables > --- > > Key: HIVE-705 > URL: https://issues.apache.org/jira/browse/HIVE-705 > Project: Hadoop Hive > Issue Type: New Feature >Reporter: Samuel Guo >Assignee: Samuel Guo > Attachments: hbase-0.19.3-test.jar, hbase-0.19.3.jar, > HIVE-705_draft.patch, HIVE-705_revision806905.patch > > > Add a serde over the hbase's tables, so that hive can analyse the data stored > in hbase easily. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-705) Let Hive can analyse hbase's tables
[ https://issues.apache.org/jira/browse/HIVE-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12751283#action_12751283 ] stephen xie commented on HIVE-705: -- Hi, Samuel Thankx very much for your new patch. There are some problem when i used it as the following, 1. create table src(key int, value string); ok 2. LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE src; ok 3. CREATE TABLE hbase_table_1(key int, value string) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.hbase.HBaseSerDe' WITH SERDEPROPERTIES ("hbase.columns.mapping" = "cf:string" ) STORED AS HBASETABLE; ok 4.FROM src INSERT OVERWRITE TABLE hbase_table_1 SELECT *; FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.ExecDriver I found error in the m/r map process, just as the following, java.lang.RuntimeException: Map operator initialization failed at org.apache.hadoop.hive.ql.exec.ExecMapper.configure(ExecMapper.java:110) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83) at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:338) at org.apache.hadoop.mapred.Child.main(Child.java:158) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.FileSinkOperator.initializeOp(FileSinkOperator.java:165) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:308) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:345) at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:330) at org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:58) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:308) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:345) at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:330) at org.apache.hadoop.hive.ql.exec.Operator.initializeOp(Operator.java:316) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:308) at org.apache.hadoop.hive.ql.exec.MapOperator.initializeOp(MapOperator.java:289) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:308) at org.apache.hadoop.hive.ql.exec.ExecMapper.configure(ExecMapper.java:82) ... 7 more Caused by: java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.FileSinkOperator.initializeOp(FileSinkOperator.java:88) ... 19 more > Let Hive can analyse hbase's tables > --- > > Key: HIVE-705 > URL: https://issues.apache.org/jira/browse/HIVE-705 > Project: Hadoop Hive > Issue Type: New Feature >Reporter: Samuel Guo > Attachments: hbase-0.19.3-test.jar, hbase-0.19.3.jar, > HIVE-705_draft.patch, HIVE-705_revision806905.patch > > > Add a serde over the hbase's tables, so that hive can analyse the data stored > in hbase easily. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [jira] Commented: (HIVE-705) Let Hive can analyse hbase's tables
This can be easily achieved via a lookup UDF in hive. See https://issues.apache.org/jira/browse/HIVE-758 on how hive and hbase can interact without having to write a serde. On 8/23/09 7:43 PM, "Matt Pestritto" wrote: > Hi All. I see a lot of good work being done on HBase/Hive integration > especially around how to express hbase metadata in hive and how to load data > from/to hbase/hive. > > Has any thought be been put into how to use HBase data as lookup data in a > query and not load all of the data as a normal hive query ? > > My use case is as follows: I have a table < users > with 50m users. I have > a 5gb daily clickstream file that only touchs 150k of those users on a daily > basis. It would be much more efficient if I didn't have to load all of the > data in HBase to a hive table and write a traditional hive query but just do > 150k lookups in the map ( or reduce ) phase of the MR job. If the hbase > lookups were done in realtime it would be much faster than sourcing the > original user table with 50m rows. > > Thoughts ? > > Thanks > -Matt > > > On Sun, Aug 23, 2009 at 8:20 AM, Samuel Guo (JIRA) wrote: > >> >>[ >> https://issues.apache.org/jira/browse/HIVE-705?page=com.atlassian.jira.plugin >> .system.issuetabpanels:comment-tabpanel&focusedCommentId=12746592#action_1274 >> 6592] >> >> Samuel Guo commented on HIVE-705: >> - >> >> Attach a new patch. >> >> 1) move the related hbase code to the contrib package, as hbase just an >> optional storage for hive, not neccessary. >> I have tried to avoid modifying the hive original code and just add a hbase >> serde to connect hive with hbase. But the hbase storage model is quite >> different with file storage model. For example, a loadwork is used to >> rename/copy files from temp dir to the target table's dir if a query's >> target is a hive table. But in a hbased hive table, we can't rename a table >> now. So it's hard to let a hbased hive table to follow the logic of a normal >> file-based hive table. So I add some code(HiveFormatUtils) to distinguish a >> file-based table from a not-file-based table. >> >> 2) fix some bugs in the draft patch, such as "select *" return nothing. >> >> >> - >> - >> >> How to use the hbase as hive's storage? >> >> 1) remember to add the contrib jar and the hbase jar in the hive's auxPath, >> so m/r can populate the neccessary hbase-related jars to the whole hadoop >> m/r cluster. >> >>> $HIVE_HOME/bin/hive -auxPath ${contrib_jar},${hbase_jar} >> >> 2) modify the configuration to add the following configuration parameters. >> >> "hbase.master" : pointer to the hbase's master. >> "hive.othermetadata.handlers" : >> "org.apache.hadoop.hive.contrib.hbase.HiveHBaseTableInputFormat:org.apache.ha >> doop.hive.contrib.hbase.HBaseMetadataHandler" >> >> "hive.othermetadata.handlers" collects the metadata handlers to handle the >> other metadata operations in the not-file-based hive tables. Take hbase as >> an example. HBaseMetadataHandler will create the neccessary hbase table and >> its family columns when we create a hbased hive table from hive's client. It >> also drop the hbase table when we drop the hive table. >> >> The metastore read the registered handlers map from the configuration file >> during initialization. The registered handlers map is formated as >> "table_format_classname:table_metadata_handler_classname,table_format_classna >> me:table_metadata_handler_classname,...". >> >> 3) enjoy "hive over hbase"! >> >> >> >> Other problems. >> >> 1) Altering a hased-hive table is not supported now. :( >> renaming a table in hbase is not supported now, so I just do not support >> rename operation. ( maybe if we rename a hive table, we do not need to >> rename the base hbase table.) >> >> adding/replacing cloumns. >> Now we need to specify the schema mapping in the SerDe properties >> explicitly. If we want to adding columns, we need to call 'alter' twice to >> adding columns: change the serde properties and the hive columns. Either >> change the serde properties first or change the hive columns first will fail >> now, because we validate the schema mapping during SerDe initialization. One >> of the hbase serde validation is to check the counts of hive columns and >> hbase mapping columns. If we first change the hive columns, the number of >> hive columns will be more than hbase mapping columns, the HBase Serde >> initialization will fail this alter operation. (maybe we need to remove the >> validation code from HBaseSerDe initialization and do it in other place?) >> >> 2) more flexible schema mapping? >> As Schubert metioned before, more flexible schema mapping will be useful >> for user. This feature will be added later. >> >> >> welcome for comments~ >> >> >> >> >>> Let Hive can analy
Re: [jira] Commented: (HIVE-705) Let Hive can analyse hbase's tables
Hi All. I see a lot of good work being done on HBase/Hive integration especially around how to express hbase metadata in hive and how to load data from/to hbase/hive. Has any thought be been put into how to use HBase data as lookup data in a query and not load all of the data as a normal hive query ? My use case is as follows: I have a table < users > with 50m users. I have a 5gb daily clickstream file that only touchs 150k of those users on a daily basis. It would be much more efficient if I didn't have to load all of the data in HBase to a hive table and write a traditional hive query but just do 150k lookups in the map ( or reduce ) phase of the MR job. If the hbase lookups were done in realtime it would be much faster than sourcing the original user table with 50m rows. Thoughts ? Thanks -Matt On Sun, Aug 23, 2009 at 8:20 AM, Samuel Guo (JIRA) wrote: > >[ > https://issues.apache.org/jira/browse/HIVE-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12746592#action_12746592] > > Samuel Guo commented on HIVE-705: > - > > Attach a new patch. > > 1) move the related hbase code to the contrib package, as hbase just an > optional storage for hive, not neccessary. > I have tried to avoid modifying the hive original code and just add a hbase > serde to connect hive with hbase. But the hbase storage model is quite > different with file storage model. For example, a loadwork is used to > rename/copy files from temp dir to the target table's dir if a query's > target is a hive table. But in a hbased hive table, we can't rename a table > now. So it's hard to let a hbased hive table to follow the logic of a normal > file-based hive table. So I add some code(HiveFormatUtils) to distinguish a > file-based table from a not-file-based table. > > 2) fix some bugs in the draft patch, such as "select *" return nothing. > > > -- > > How to use the hbase as hive's storage? > > 1) remember to add the contrib jar and the hbase jar in the hive's auxPath, > so m/r can populate the neccessary hbase-related jars to the whole hadoop > m/r cluster. > > > $HIVE_HOME/bin/hive -auxPath ${contrib_jar},${hbase_jar} > > 2) modify the configuration to add the following configuration parameters. > > "hbase.master" : pointer to the hbase's master. > "hive.othermetadata.handlers" : > "org.apache.hadoop.hive.contrib.hbase.HiveHBaseTableInputFormat:org.apache.hadoop.hive.contrib.hbase.HBaseMetadataHandler" > > "hive.othermetadata.handlers" collects the metadata handlers to handle the > other metadata operations in the not-file-based hive tables. Take hbase as > an example. HBaseMetadataHandler will create the neccessary hbase table and > its family columns when we create a hbased hive table from hive's client. It > also drop the hbase table when we drop the hive table. > > The metastore read the registered handlers map from the configuration file > during initialization. The registered handlers map is formated as > "table_format_classname:table_metadata_handler_classname,table_format_classname:table_metadata_handler_classname,...". > > 3) enjoy "hive over hbase"! > > > > Other problems. > > 1) Altering a hased-hive table is not supported now. :( > renaming a table in hbase is not supported now, so I just do not support > rename operation. ( maybe if we rename a hive table, we do not need to > rename the base hbase table.) > > adding/replacing cloumns. > Now we need to specify the schema mapping in the SerDe properties > explicitly. If we want to adding columns, we need to call 'alter' twice to > adding columns: change the serde properties and the hive columns. Either > change the serde properties first or change the hive columns first will fail > now, because we validate the schema mapping during SerDe initialization. One > of the hbase serde validation is to check the counts of hive columns and > hbase mapping columns. If we first change the hive columns, the number of > hive columns will be more than hbase mapping columns, the HBase Serde > initialization will fail this alter operation. (maybe we need to remove the > validation code from HBaseSerDe initialization and do it in other place?) > > 2) more flexible schema mapping? > As Schubert metioned before, more flexible schema mapping will be useful > for user. This feature will be added later. > > > welcome for comments~ > > > > > > Let Hive can analyse hbase's tables > > --- > > > > Key: HIVE-705 > > URL: https://issues.apache.org/jira/browse/HIVE-705 > > Project: Hadoop Hive > > Issue Type: New Feature > >Reporter: Samuel Guo > > Attachments: hbase-0.19.3-test.jar, hbase-0.19.3.jar, > HIVE-705_draft.patch, HIVE-705_revision806905.pa
[jira] Commented: (HIVE-705) Let Hive can analyse hbase's tables
[ https://issues.apache.org/jira/browse/HIVE-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12746592#action_12746592 ] Samuel Guo commented on HIVE-705: - Attach a new patch. 1) move the related hbase code to the contrib package, as hbase just an optional storage for hive, not neccessary. I have tried to avoid modifying the hive original code and just add a hbase serde to connect hive with hbase. But the hbase storage model is quite different with file storage model. For example, a loadwork is used to rename/copy files from temp dir to the target table's dir if a query's target is a hive table. But in a hbased hive table, we can't rename a table now. So it's hard to let a hbased hive table to follow the logic of a normal file-based hive table. So I add some code(HiveFormatUtils) to distinguish a file-based table from a not-file-based table. 2) fix some bugs in the draft patch, such as "select *" return nothing. -- How to use the hbase as hive's storage? 1) remember to add the contrib jar and the hbase jar in the hive's auxPath, so m/r can populate the neccessary hbase-related jars to the whole hadoop m/r cluster. > $HIVE_HOME/bin/hive -auxPath ${contrib_jar},${hbase_jar} 2) modify the configuration to add the following configuration parameters. "hbase.master" : pointer to the hbase's master. "hive.othermetadata.handlers" : "org.apache.hadoop.hive.contrib.hbase.HiveHBaseTableInputFormat:org.apache.hadoop.hive.contrib.hbase.HBaseMetadataHandler" "hive.othermetadata.handlers" collects the metadata handlers to handle the other metadata operations in the not-file-based hive tables. Take hbase as an example. HBaseMetadataHandler will create the neccessary hbase table and its family columns when we create a hbased hive table from hive's client. It also drop the hbase table when we drop the hive table. The metastore read the registered handlers map from the configuration file during initialization. The registered handlers map is formated as "table_format_classname:table_metadata_handler_classname,table_format_classname:table_metadata_handler_classname,...". 3) enjoy "hive over hbase"! Other problems. 1) Altering a hased-hive table is not supported now. :( renaming a table in hbase is not supported now, so I just do not support rename operation. ( maybe if we rename a hive table, we do not need to rename the base hbase table.) adding/replacing cloumns. Now we need to specify the schema mapping in the SerDe properties explicitly. If we want to adding columns, we need to call 'alter' twice to adding columns: change the serde properties and the hive columns. Either change the serde properties first or change the hive columns first will fail now, because we validate the schema mapping during SerDe initialization. One of the hbase serde validation is to check the counts of hive columns and hbase mapping columns. If we first change the hive columns, the number of hive columns will be more than hbase mapping columns, the HBase Serde initialization will fail this alter operation. (maybe we need to remove the validation code from HBaseSerDe initialization and do it in other place?) 2) more flexible schema mapping? As Schubert metioned before, more flexible schema mapping will be useful for user. This feature will be added later. welcome for comments~ > Let Hive can analyse hbase's tables > --- > > Key: HIVE-705 > URL: https://issues.apache.org/jira/browse/HIVE-705 > Project: Hadoop Hive > Issue Type: New Feature >Reporter: Samuel Guo > Attachments: hbase-0.19.3-test.jar, hbase-0.19.3.jar, > HIVE-705_draft.patch, HIVE-705_revision806905.patch > > > Add a serde over the hbase's tables, so that hive can analyse the data stored > in hbase easily. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-705) Let Hive can analyse hbase's tables
[ https://issues.apache.org/jira/browse/HIVE-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742720#action_12742720 ] Samuel Guo commented on HIVE-705: - @kula @stephen Thank you all for your comments. 1) As stephen methioned, the NullPointerException is thrown out because the COLUMN_LIST is set in the wrong job configuration. I will fixed it in the new path. 2) It seems that "select *" statement is buggy now. I will find out the problem and fix it. > Let Hive can analyse hbase's tables > --- > > Key: HIVE-705 > URL: https://issues.apache.org/jira/browse/HIVE-705 > Project: Hadoop Hive > Issue Type: New Feature >Reporter: Samuel Guo > Attachments: hbase-0.19.3-test.jar, hbase-0.19.3.jar, > HIVE-705_draft.patch > > > Add a serde over the hbase's tables, so that hive can analyse the data stored > in hbase easily. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-705) Let Hive can analyse hbase's tables
[ https://issues.apache.org/jira/browse/HIVE-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742704#action_12742704 ] stephen xie commented on HIVE-705: -- Hi Samuel, Also, I found the same problem as Kula. I changed one line in the method HiveInputFormat::getSplits, --- newjob.set(TableInputFormat.COLUMN_LIST, hbaseColumns); +++ job.set(TableInputFormat.COLUMN_LIST, hbaseColumns); Then the above java exception disappered, select is ok. But when I tested more than 2 columns, the query returned nothing. CREATE TABLE hbase_table_2(key int, value1 string, value2 int) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.hbase.HBaseSerDe' WITH SERDEPROPERTIES ("hbase.columns.mapping" = "cf:value1, cf:value2" ) STORED AS HBASETABLE; FROM src2 INSERT OVERWRITE TABLE hbase_table_2 SELECT *; The following 2 queries both returned nothing. select * from hbase_table_2 where value > '0'; select * from hbase_table2; > Let Hive can analyse hbase's tables > --- > > Key: HIVE-705 > URL: https://issues.apache.org/jira/browse/HIVE-705 > Project: Hadoop Hive > Issue Type: New Feature >Reporter: Samuel Guo > Attachments: hbase-0.19.3-test.jar, hbase-0.19.3.jar, > HIVE-705_draft.patch > > > Add a serde over the hbase's tables, so that hive can analyse the data stored > in hbase easily. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-705) Let Hive can analyse hbase's tables
[ https://issues.apache.org/jira/browse/HIVE-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742697#action_12742697 ] Kula Liao commented on HIVE-705: Hi Samuel, Thanks for your great job. I found some error when testing your patch. The sql statements are from the file : "ql/src/test/queries/clienthbase/hbase_queries.q". I created a table named "hbase_table_1" using the following statement: CREATE TABLE hbase_table_1(key int, value string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.hbase.HBaseSerDe' WITH SERDEPROPERTIES ( "hbase.columns.mapping" = "cf:string" ) STORED AS HBASETABLE; OK. Then I inserted data into "hbase_table_1". hive> FROM src INSERT OVERWRITE TABLE hbase_table_1 SELECT *; Total MapReduce jobs = 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_200908131113_0002, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_200908131113_0002 Kill Command = /home/stephen/hadoop-0.19.2/bin/../bin/hadoop job -Dmapred.job.tracker=localhost:9001 -kill job_200908131113_0002 2009-08-13 11:17:07,162 map = 0%, reduce =0% 2009-08-13 11:17:14,200 map = 50%, reduce =0% 2009-08-13 11:17:15,215 map = 100%, reduce =0% Ended Job = job_200908131113_0002 500 Rows loaded to hbase_table_1 OK When I tried to do some queries. I found the following error message: hive> select * from hbase_table_1 where value > '0'; Total MapReduce jobs = 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_200908131113_0003, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_200908131113_0003 Kill Command = /home/stephen/hadoop-0.19.2/bin/../bin/hadoop job -Dmapred.job.tracker=localhost:9001 -kill job_200908131113_0003 2009-08-13 11:18:24,019 map = 0%, reduce =0% 2009-08-13 11:18:42,146 map = 100%, reduce =100% Ended Job = job_200908131113_0003 with errors FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.ExecDriver The following message is found in the mapreduce log: java.lang.NullPointerException at org.apache.hadoop.hbase.mapred.TableInputFormat.configure(TableInputFormat.java:52) at org.apache.hadoop.hive.ql.io.HiveHBaseTableInputFormat.configure(HiveHBaseTableInputFormat.java:36) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83) at org.apache.hadoop.hive.ql.io.HiveInputFormat.getInputFormatFromCache(HiveInputFormat.java:184) at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:211) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:331) at org.apache.hadoop.mapred.Child.main(Child.java:158) There is another query, nothing returned. hive> select * from hbase_table_1; OK Time taken: 2.952 seconds > Let Hive can analyse hbase's tables > --- > > Key: HIVE-705 > URL: https://issues.apache.org/jira/browse/HIVE-705 > Project: Hadoop Hive > Issue Type: New Feature >Reporter: Samuel Guo > Attachments: hbase-0.19.3-test.jar, hbase-0.19.3.jar, > HIVE-705_draft.patch > > > Add a serde over the hbase's tables, so that hive can analyse the data stored > in hbase easily. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-705) Let Hive can analyse hbase's tables
[ https://issues.apache.org/jira/browse/HIVE-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742191#action_12742191 ] Samuel Guo commented on HIVE-705: - @Ashish Thank you for your comment. It is difficult to infer the columns list from a sparse column hbase table, we do not know exactly how many columns in a given hbase table. We just know all the column families of a given hbase table. Also, the data in hbase are all raw bytes. If we do not explicitly stat the schema mapping, we will lose the information how to serialize/deserialize the data from raw bytes. > Let Hive can analyse hbase's tables > --- > > Key: HIVE-705 > URL: https://issues.apache.org/jira/browse/HIVE-705 > Project: Hadoop Hive > Issue Type: New Feature >Reporter: Samuel Guo > Attachments: hbase-0.19.3-test.jar, hbase-0.19.3.jar, > HIVE-705_draft.patch > > > Add a serde over the hbase's tables, so that hive can analyse the data stored > in hbase easily. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-705) Let Hive can analyse hbase's tables
[ https://issues.apache.org/jira/browse/HIVE-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742106#action_12742106 ] Ashish Thusoo commented on HIVE-705: The data model mapping works. I have one suggestion though. Can we infer the columns list of the hive table from the hbase table instead of explicitly stating it in the create command. My concerns is that an addition of a column family in hbase will require an alter table on hive and if we can avoid it that would be great. > Let Hive can analyse hbase's tables > --- > > Key: HIVE-705 > URL: https://issues.apache.org/jira/browse/HIVE-705 > Project: Hadoop Hive > Issue Type: New Feature >Reporter: Samuel Guo > Attachments: hbase-0.19.3-test.jar, hbase-0.19.3.jar, > HIVE-705_draft.patch > > > Add a serde over the hbase's tables, so that hive can analyse the data stored > in hbase easily. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-705) Let Hive can analyse hbase's tables
[ https://issues.apache.org/jira/browse/HIVE-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12741672#action_12741672 ] Samuel Guo commented on HIVE-705: - @schubert, Thank you for you comment. >> In you patch, we found many java files are modified, it is really a big >> effort. I don't know if there is any way to avoid such a big modification. A HBase Table is quite different with a file in HDFS. The original Hive code is based on files. For example, when outputting the reduce results to the target table, Hive uses a FileSinkOperator to output the results to the temp file in the HDFS, and uses a MoveTask to rename the temp files in the HDFS to the target table dir. But when the the target table is based on a HBase Table, we do not need to deal with these file operations, and just output to the target HBase Table. The modification of the original java files is to tell hive to deal with a hbase table in a differnt way. I will try to look into the code and find a way to avoid the modification. >> 2. The performance is not good when we maped SQL columns to HBase columns in >> our past experience. For example, we have a table with 20 columns, then, >> each read or write of a row will comprise 20 key-value operations. It is >> ineffective. A good point. The schema mapping does not effect the peformance during creating a hive table. The performance is effected if we get all the mapping columns out of hbase table in an actual query operation. Some code will be added to do the column-prune during hbase table scanning. For example, an hbase table (cf1:(co1, col2, col3), cf2:(col4,col5,col6), ... , cfn:(colk,colj,coll)) is mapping to a hive table (column1, column2, column3, column4, ... ,column n). If a query "select column3, column4 from hbasedhivetable" is invoked, we should not let hbase scan out all the columns. We know all the hive columns used in the query, map back to the hbase column, and get the scanning list "cf1:col3 cf2:col4". We set the scanning list "cf1:col3 cf2:col4" in the HBaseInputFormat to let HBase just scan out the useful columns. The code will be added in the new patch. >> cf2: => {(col3, col5, col6), Default SerDe} Cool. Let different SerDe work on different hbase column. I will try it in the new patch. >> Look forward to have more communication with you in Chinese, by your >> convenience. My Gtalk is : sijie0...@gmail.com > Let Hive can analyse hbase's tables > --- > > Key: HIVE-705 > URL: https://issues.apache.org/jira/browse/HIVE-705 > Project: Hadoop Hive > Issue Type: New Feature >Reporter: Samuel Guo > Attachments: hbase-0.19.3-test.jar, hbase-0.19.3.jar, > HIVE-705_draft.patch > > > Add a serde over the hbase's tables, so that hive can analyse the data stored > in hbase easily. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-705) Let Hive can analyse hbase's tables
[ https://issues.apache.org/jira/browse/HIVE-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12741655#action_12741655 ] He Yongqiang commented on HIVE-705: --- Samuel, i am now in ShangHai attending a meeting. I will talk with you on phone asap when i get back. Thanks for the quick fix. > Let Hive can analyse hbase's tables > --- > > Key: HIVE-705 > URL: https://issues.apache.org/jira/browse/HIVE-705 > Project: Hadoop Hive > Issue Type: New Feature >Reporter: Samuel Guo > Attachments: hbase-0.19.3-test.jar, hbase-0.19.3.jar, > HIVE-705_draft.patch > > > Add a serde over the hbase's tables, so that hive can analyse the data stored > in hbase easily. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-705) Let Hive can analyse hbase's tables
[ https://issues.apache.org/jira/browse/HIVE-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12741431#action_12741431 ] schubert zhang commented on HIVE-705: - Hi Samuel, Thanks for your great job. In you patch, we found many java files are modified, it is really a big effort. I don't know if there is any way to avoid such a big modification. Regards the schema mapping between HBase table and Hive SQL table, I have following consideration. 1. We just want to use HBase as a scalable structure data store, or key-value store. 2. The performance is not good when we maped SQL columns to HBase columns in our past experience. For example, we have a table with 20 columns, then, each read or write of a row will comprise 20 key-value operations. It is ineffective. How about consider more flexible schema mapping: 1. one HBase column can map to multiple hive-SQL columns with a SerDe. e.g. cf1:q1 => {(col1, col2, col3), Default SerDe} 2. one HBase column family can map to multiple hive-SQL columns with a SerDe. e.g. cf2: => {(col3, col5, col6), Default SerDe} 3. your MAP column (in Hive table) for sparse column family. [Optional] Since Hive is a structured data analysis front-end, we can omit this feature at the beginning. For example: CREATE EXTERNAL TABLE hive_table (pkey STRING, col1 STRING, col2 INT, col2, STRING, col3 INT, col4 STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.MyHBaseSerDe' WITH SERDEPROPERTIES ( "hbase.columns.mapping" = "cf1:(col1,col2,col3) with DefaultSerDe, cf2:c1 (col4) with DefaultSerDe", ) STORED AS HBASETABLE LOCATION '' Usually, we want a more advanced data store backend than HDFS, to achieve more flexible data placement and indexing. HBase's data model is very good to meet this requirement, but we may need not the full fearures of HBase here. -- Look forward to have more communication with you in Chinese, by your convenience. Schubert > Let Hive can analyse hbase's tables > --- > > Key: HIVE-705 > URL: https://issues.apache.org/jira/browse/HIVE-705 > Project: Hadoop Hive > Issue Type: New Feature >Reporter: Samuel Guo > Attachments: hbase-0.19.3-test.jar, hbase-0.19.3.jar, > HIVE-705_draft.patch > > > Add a serde over the hbase's tables, so that hive can analyse the data stored > in hbase easily. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-705) Let Hive can analyse hbase's tables
[ https://issues.apache.org/jira/browse/HIVE-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12737453#action_12737453 ] Samuel Guo commented on HIVE-705: - The key problem to let hive analyse hbase's tables is how to map the hbase's data model to hive's sql data model. As we know, the hbase's data is accessed by . so a meta-data mapping should be recorded in hive's metadata, as below: --- hbase's tablename -> hive's tablename hbase's columns -> hive's columns hbase's key -> hive's first column hbase's timestamp -> hive's second column --- The key and timestamp of hbase table will be mapped to *first two default columns* in hive's table automatically. So the hbased-hive table may be like <.key, .timestamp, ..., other columns defined by users>. For example, a hbase table 'webpages', has columns . There are 2 column families, "contents" and "anchors". The content of table 'webpages' is stored in column 'contents:page_content', the data is dense. And the anchors of a specified page will varied between different pages, so the data in 'anchros:' will be sparse. The columns of hbase' table will be mapped manually be programmers : we can map a full column in hbase to a *primitive_type* column in hive, while mapping a column family in hbase to a *map_type* column in hive. So the hbase table webpages' hive schema will be (.key, .timestamp, page_content, anchors). Setting up schema mapping between hbase table and hive table, we need to consider how to record the shema mapping, serialize the hive object to hbase table and deserialize hbase's data to hive object. The proposal is to add a new HbaseSerDe for recording the schema mapping in SerDe properties. So the SerDe can use its schema mapping to serialize the hive object to hbase's table and deserialize hbase's data to hive object. The properties in HBaseSerDe will be: 1) "hbase.key.type" : the type of .key column in hive table, defining how to deserialize the .key field from hbase's key. (the hbase key is a bytes array) 2) "hbase.schema.mapping" : a string separated by comma, defining the shema mapping. The schema will be mapped in order one by one. These properites should be provided during creating a hbased-hive table. If the "hbase.key.type" is not defined, we treat it as a string. But if the "hbase.schema.mapping" is not defined, we should fail the table creation because we do not how to deserialize hive object from hbase raw bytes data. A hbased-hive table's operations are showed as below: *1. Using existed hbase table as an external table in hive* The 'create' command will be as below: - CREATE EXTERNAL TABLE webpages(page_content STRING, anchors MAP) COMMENT 'This is the pages table' ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.HBaseSerDe' WITH SERDEPROPERTIES ( "hbase.key.type" = "string", "hbase.columns.mapping" = "contents:page_content,anchors:", ) STORED AS HBASETABLE LOCATION '' - Here the hbase_table_location will identify the location of hbase and the hbase table name, such as "hbase:/hbase_master:port/hbase_tablename". And after creating an external table using an existing hbase table, we can do analysis over the table like normal hive table. A. Get all the urls and their pages that added after a specified time t1. SELECT .key, page_content FROM webpages WHERE .timestamp > t1; B. Get the revisions of a specified url from a specified time t1 to a specified time t2. SELECT page_content FROM webpages WHERE .timestamp > t1 AND .timestamp < t2 AND .key = 'www.apache.org'; *2. Creating a new hbase table as a hive table.* The 'create' command will be as below: - CREATE TABLE webpages(page_content STRING, anchors MAP) COMMENT 'This is the pages table' ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.HBaseSerDe' WITH SERDEPROPERTIES ( "hbase.key.type" = "string", "hbase.columns.mapping" = "contents:page_content,anchors:", ) STORED AS HBASETABLE LOCATION '' - After invoking the 'create' command, the hive client will also create a hbase table in the specified hbase cluster. And the created hbase table will have two column families defined in HBaseSerDe properties, "contents:" and "anchros:". *3. Loading data into tables.* As we have two default hidden column (.key, .timestamp) in hbased-hive table, we must count these two columns in during inserting data. We can eigth load data into hbased-hive table by inserting data from other tables or loading data from local filesystem. *A. Inserting data from other tables.* for example, we have a 'crawled_pages' table collecting all the pages crawled from the internet. the 'crawled_pages' is simple: . I. If we want to l
[jira] Commented: (HIVE-705) Let Hive can analyse hbase's tables
[ https://issues.apache.org/jira/browse/HIVE-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12736728#action_12736728 ] Ashish Thusoo commented on HIVE-705: Also would be great if you could comment on how you plan to map the hbase data model to the sql data model (i.e. tables, columns etc.) This will be a cool contribution SerDe would be the right way to go... Thanks, Ashish > Let Hive can analyse hbase's tables > --- > > Key: HIVE-705 > URL: https://issues.apache.org/jira/browse/HIVE-705 > Project: Hadoop Hive > Issue Type: New Feature >Reporter: Samuel Guo > > Add a serde over the hbase's tables, so that hive can analyse the data stored > in hbase easily. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-705) Let Hive can analyse hbase's tables
[ https://issues.apache.org/jira/browse/HIVE-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12736477#action_12736477 ] Samuel Guo commented on HIVE-705: - I will add more detail about this issue late. > Let Hive can analyse hbase's tables > --- > > Key: HIVE-705 > URL: https://issues.apache.org/jira/browse/HIVE-705 > Project: Hadoop Hive > Issue Type: New Feature >Reporter: Samuel Guo > > Add a serde over the hbase's tables, so that hive can analyse the data stored > in hbase easily. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-705) Let Hive can analyse hbase's tables
[ https://issues.apache.org/jira/browse/HIVE-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12736467#action_12736467 ] He Yongqiang commented on HIVE-705: --- Do we need to add a new serde for this? can you add more in the description? > Let Hive can analyse hbase's tables > --- > > Key: HIVE-705 > URL: https://issues.apache.org/jira/browse/HIVE-705 > Project: Hadoop Hive > Issue Type: New Feature >Reporter: Samuel Guo > > Add a serde over the hbase's tables, so that hive can analyse the data stored > in hbase easily. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.