[ 
https://issues.apache.org/jira/browse/HIVE-806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752877#action_12752877
 ] 

Schubert Zhang commented on HIVE-806:
-------------------------------------

@Zheng, we are in desigining and coding now.  and we had a talk with Samuel 
days ago.  Because this is involved in one of our ongoing project, I am sorry 
the update will be not so quick.
I describe something of out consideration bellow, and will update when we 
complete our implementation and verification.

1. A new HBaseInputFormat.

The current TableInputFormat always scan the whole HBase HTable, it is usually 
unnecessary, especially when we know one or more row-range.
A new HBaseInputFormat will be implemented to provide more parameters to 
control the behavior of HTable scan. e.g.:
(1) row-ranges (one or more startRow and endRow paires)
(2) column list (some times we need not read all columns, HBase is a 
column-oriented store)
(3) filter tree (predicate pushdow, filter rows/columns at region server)
(4) maybe, we can do some computing on region server. (optional)

2. SerDe

We use more flexible SerDe for engineering practice. 
(1) we will support the MAP data type to map to HBase's (sparse) column 
family:column qualifers. This is a rigid mapping between Hive table schema and 
HTable schema, and sometimes it is not so effective for structurized data.
(2) use a nested SerDe to implement the codec of RowKey and Columns. Since 
usually, the rowkey in HTable are a combination of more than one hive-columns; 
and we support do store a column list in to a HTable column family but do not 
use HBase's column quailfer feature, but the columns in a column family are 
self-coded (such as use of comma delimiter).
      RowSerDe { RowKeySerDe,  ColumnSerDe}

This is example of above SerDe design.

CREATE TABLE t1(rowkey1 int, rowkey2 string, value1 string, valuer2 int, value3 
long, valuer string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.hbase.HBaseRowSerDe'
WITH SERDEPROPERTIES (

"rowkey.serde.class"="org.apache.hadoop.hive.serde2.hbase.SimpleRowkeySerDe" 
//this will be a build-in SerDe for rowkey
"rowkey.columns"="rowkey2,rowkey1"  //the rowkey in HTable is a combination of 
tow hive-columns.
"rowkey.column.lengths"="12,2"             //the lengths of the two 
hive-columns in rowkey
"rowkey.column.delimiter"=","                 //the delimiter in rowkey (it may 
be omit if not be defined)

"column.families"="cf1:(value1,value2); cf2:(value3,value4)"  //there two 
column families in HTable, cf1 and cf2 have tow column respectively
"column.family.serde.class"="cf1:org.apache.hadoop.hive.serde2.hbase.SimpleColumnSerDe;
cf2:org.apache.hadoop.hive.serde2.hbase.ColumnSerDe1" //cf1 and cf2 can use 
different SerDe
"column.family.cf1.delimiter"=","

) STORED AS HBASETABLE;

(Note: we have complete above code and verified)

We shall also support the rigid mapping (MAP) like HIVE-705, e.g.

CREATE TABLE hbase_table_1(rowkey1 int, rowkey2 string, value1 string, valuer2 
int,  abcd MAP<string, string>)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.hbase.HBaseRowSerDe'
WITH SERDEPROPERTIES (

"rowkey.serde.class"="org.apache.hadoop.hive.serde2.hbase.SimpleRowkeySerDe"
"rowkey.columns"="rowkey2,rowkey1"
"rowkey.column.lengths"="12,2"
"rowkey.column.delimiter"=","

"column.families"="cf1:(value1,value2); cf2:=abcd"
"column.family.serde.class"="cf1:org.apache.hadoop.hive.serde2.hbase.SimpleColumnSerDe;
cf2:org.apache.hadoop.hive.serde2.hbase.QualiferColumnSerDe"
"column.family.cf1.delimiter"=","

) STORED AS HBASETABLE;

3. To support direct query (scan or get) from HBase HTable

Some straightforward query target to HTable need not use mapreduce,  we can 
difectly scan or get rows from HTable, since HTable is a global indexed store. 
We can use some features of HBase to improve the performance.
(1) rowkey or rowkey ranges
(2) column list
(3) filter tree (predicate pushdow)
(4) .....

(Note: we have complete above code and verified)

4. other...




> Hive with HBase as data store to support MapReduce and direct query
> -------------------------------------------------------------------
>
>                 Key: HIVE-806
>                 URL: https://issues.apache.org/jira/browse/HIVE-806
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Schubert Zhang
>
> Current Hive uses only HDFS as the underlayer data store, it can query and 
> analyze files in HDFS via MapReduce.
> But in some engineering cases, our data are stored/organised/indexed in HBase 
> or other data stores. This jira-issue will implement hive to use HBase as 
> data store.  And except for supporting MapReduce on HBase, we will support 
> direct query on HBase.
> This is a brother jira-issue of HIVE-705 (Let Hive can analyse hbase's 
> tables, https://issues.apache.org/jira/browse/HIVE-705). Because this 
> implementation and use cases have some differences from HIVE-705, this 
> jira-issue is created to avoid confusions. It is possible to combine the two 
> issues in the future.
> Initial developers: Kula Liao, Stephen Xie, Tao Jiang and Schubert Zhang.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to