[ 
https://issues.apache.org/jira/browse/SPARK-15874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-15874.
-------------------------------
    Resolution: Not A Problem

> HBase rowkey optimization support for Hbase-Storage-handler
> -----------------------------------------------------------
>
>                 Key: SPARK-15874
>                 URL: https://issues.apache.org/jira/browse/SPARK-15874
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>            Reporter: Weichen Xu
>   Original Estimate: 720h
>  Remaining Estimate: 720h
>
> Currently, Spark-SQL use  `org.apache.hadoop.hive.hbase.HBaseStorageHandler` 
> for Hbase table support, which has poor optimization. for example, query such 
> as
> select * from hbase_tab1 where rowkey_col = 'abc';
> will cause full table scan(each table region turn into a scan split and do 
> full region scan).
> In fact, it is easy to implement the following optimization:
> 1.
> SQL such as
> `select * from hbase_tab1 where rowkey_col = 'abc';`
> or
> `select * from hbase_tab1 where rowkey_col = 'abc' or rowkey_col = 'abd' or 
> ...;`
> can use hbase rowkey `Get`/`multiGet` API to execute efficiently.
> 2.
> SQL such as
> `select * from hbase_tab1 where rowkey_col = 'abc%';`
> can use hbase rowkey `Scan` API to execute efficiently.
> Higher-level SQL optimization will benefit from such optimization, for 
> example, there is a very small table(such as incremental Data) `small_tab1`,
> SQL such as
> `select * from small_tab1 join hbase_tab1 on small_tab1.key1 = 
> hbase_tab1.rowkey_col`
> can use classic small-table driven join optimization:
> loop each record of small_tab1, and exact each small_tab1.key1 as 
> hbase_tab1's rowkey, and use hbase Get API, the join will execute efficiently.
> The scenario described above is very common, manay business system may have 
> several tables which has major-key such as userID, and they often store them 
> in HBase. But, several times people have requirement to do some analysis with 
> SQL, and these SQL will have good optimization if the SQL execution plan has 
> a good support to HBase rowkey.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to