[ https://issues.apache.org/jira/browse/SPARK-15874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Reynold Xin closed SPARK-15874. ------------------------------- Resolution: Not A Problem > HBase rowkey optimization support for Hbase-Storage-handler > ----------------------------------------------------------- > > Key: SPARK-15874 > URL: https://issues.apache.org/jira/browse/SPARK-15874 > Project: Spark > Issue Type: New Feature > Components: SQL > Reporter: Weichen Xu > Original Estimate: 720h > Remaining Estimate: 720h > > Currently, Spark-SQL use `org.apache.hadoop.hive.hbase.HBaseStorageHandler` > for Hbase table support, which has poor optimization. for example, query such > as > select * from hbase_tab1 where rowkey_col = 'abc'; > will cause full table scan(each table region turn into a scan split and do > full region scan). > In fact, it is easy to implement the following optimization: > 1. > SQL such as > `select * from hbase_tab1 where rowkey_col = 'abc';` > or > `select * from hbase_tab1 where rowkey_col = 'abc' or rowkey_col = 'abd' or > ...;` > can use hbase rowkey `Get`/`multiGet` API to execute efficiently. > 2. > SQL such as > `select * from hbase_tab1 where rowkey_col = 'abc%';` > can use hbase rowkey `Scan` API to execute efficiently. > Higher-level SQL optimization will benefit from such optimization, for > example, there is a very small table(such as incremental Data) `small_tab1`, > SQL such as > `select * from small_tab1 join hbase_tab1 on small_tab1.key1 = > hbase_tab1.rowkey_col` > can use classic small-table driven join optimization: > loop each record of small_tab1, and exact each small_tab1.key1 as > hbase_tab1's rowkey, and use hbase Get API, the join will execute efficiently. > The scenario described above is very common, manay business system may have > several tables which has major-key such as userID, and they often store them > in HBase. But, several times people have requirement to do some analysis with > SQL, and these SQL will have good optimization if the SQL execution plan has > a good support to HBase rowkey. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org