[ 
https://issues.apache.org/jira/browse/HIVE-24833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17291282#comment-17291282
 ] 

David Mollitor commented on HIVE-24833:
---------------------------------------

[~gopalv] I have proved a trivial reproduction. Thanks!

My expectation is that anything the can be serviced with a [Get 
Request|https://hbase.apache.org/2.2/apidocs/org/apache/hadoop/hbase/client/Get.html]
 should be condensed for a FETCH task.  Anything that is going to require a 
[Scan 
Request|https://hbase.apache.org/2.2/apidocs/org/apache/hadoop/hbase/client/Scan.html]
 should not be (Tez should handle it), but even then, it should have all 
available predicates pushed down into HBase.

Right now, only STRING and BINARY types are pushed down.  Would be nice to see 
INT data types pushed as well.  Looking at the history in [HIVE-1643], it seems 
like there was some concern about doing "needle in the hay stack" searches 
across HBase (Scan many rows, return very few), because it put load on HBase 
and there was risk of timeout if the scans went on for a long time without 
returning rows.  But, it seems like we should just configure long time-outs as 
it's most efficient to do the filtering within HBase and allow folks to disable 
the predicate on the storage handler, through the normal configurations, if 
HBase is really overheating by servicing many filters.

> Hive Should Only Pushdown EQ Predicate on HBaseStorageHandler Fetch Task
> ------------------------------------------------------------------------
>
>                 Key: HIVE-24833
>                 URL: https://issues.apache.org/jira/browse/HIVE-24833
>             Project: Hive
>          Issue Type: Improvement
>          Components: HBase Handler
>            Reporter: David Mollitor
>            Priority: Major
>
> I believe that a Hive query with an HBase Storage Handler is incorrectly 
> applies a predicate pushdown into the storage handler.
> I observed a FETCH optimization that took a long time to complete because it 
> was performing a table scan across the entire HBase table.
> The only case in which a predicate should be pushed down the storage layer is 
> for
> {code:sql}
> SELECT * FROM TABLE my_hbase_table WHERE row_key=?
> {code}
> This would be appropriate (EQ on the row key).  Anything else will involve a 
> scan of the table and there is no way to easily calculate how small a scan it 
> will require and therefore should always be passed to the compute engine 
> (Tez).
> {code:none}
> beeline> CREATE EXTERNAL TABLE t_hbase(key STRING,
>                      tinyint_col TINYINT,
>                      smallint_col SMALLINT,
>                      int_col INT,
>                      bigint_col BIGINT,
>                      float_col FLOAT,
>                      double_col DOUBLE,
>                      boolean_col BOOLEAN)
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> WITH SERDEPROPERTIES ("hbase.columns.mapping" = 
> "cf:binarykey#-,cf:binarybyte#-,cf:binaryshort#-,:key#-,cf:binarylong#-,cf:binaryfloat#-,cf:binarydouble#-,cf:binaryboolean#-")
> TBLPROPERTIES ("hbase.table.name" = "t_hive",
>                "hbase.table.default.storage.type" = "binary",
>                "external.table.purge" = "true")
>                
>                
>                
> beeline> insert into table t_hbase values ('user1', 1, 11, 10, 1, 1.0, 1.0, 
> true);
> beeline> explain select * from t_hbase where int_col=1;
> Explain
>   Plan optimized by CBO.
>   Stage-0
>   Fetch Operator
>     limit:-1
>     Select Operator [SEL_2]
>       Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7"]
>     TableScan [TS_0]
>    Output: 
>    
> ["key","tinyint_col","smallint_col","bigint_col","float_col","double_col","boolean_col"]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to