[jira] [Commented] (HIVE-24833) Hive Uses Fetch Task For HBaseStorageHandler Scans

David Mollitor (Jira) Fri, 26 Feb 2021 06:29:04 -0800


    [ 
https://issues.apache.org/jira/browse/HIVE-24833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17291680#comment-17291680
 ]


David Mollitor commented on HIVE-24833:
---------------------------------------

[~gopalv]

{code:java}
        if (table.isNonNative()) {
          return true; // nothing can be done
        }
{code}

Ah, I see.  I was reading it backwards.  I read "nothing can be done" to mean 
that "no optimization can be performed," but it actually returns true (enable 
the optimization).  It seems like this is backwards, the more conservative 
thing here would be to *not* perform an optimization like this on a table that 
Hive does not have visibility into.

Providing an estimator for HBase would not be straight forward since it's an 
external source and there's no way to look up size information.  For simplicity 
sake, it would be better to allow the Storage Handler to return a boolean.  In 
this scenario, a GET request would return true and a SCAN would return false.  
Just my 2 cents.

> Hive Uses Fetch Task For HBaseStorageHandler Scans
> --------------------------------------------------
>
>                 Key: HIVE-24833
>                 URL: https://issues.apache.org/jira/browse/HIVE-24833
>             Project: Hive
>          Issue Type: Improvement
>          Components: HBase Handler
>            Reporter: David Mollitor
>            Priority: Major
>
> I believe that a Hive query with an HBase Storage Handler is incorrectly 
> applies a predicate pushdown into the storage handler.
> I observed a FETCH optimization that took a long time to complete because it 
> was performing a table scan across the entire HBase table.
> The only case in which a predicate should be pushed down the storage layer is 
> for
> {code:sql}
> SELECT * FROM TABLE my_hbase_table WHERE row_key=?
> {code}
> This would be appropriate (EQ on the row key).  Anything else will involve a 
> scan of the table and there is no way to easily calculate how small a scan it 
> will require and therefore should always be passed to the compute engine 
> (Tez).
> {code:none}
> beeline> CREATE EXTERNAL TABLE t_hbase(key STRING,
>                      tinyint_col TINYINT,
>                      smallint_col SMALLINT,
>                      int_col INT,
>                      bigint_col BIGINT,
>                      float_col FLOAT,
>                      double_col DOUBLE,
>                      boolean_col BOOLEAN)
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> WITH SERDEPROPERTIES ("hbase.columns.mapping" = 
> "cf:binarykey#-,cf:binarybyte#-,cf:binaryshort#-,:key#-,cf:binarylong#-,cf:binaryfloat#-,cf:binarydouble#-,cf:binaryboolean#-")
> TBLPROPERTIES ("hbase.table.name" = "t_hive",
>                "hbase.table.default.storage.type" = "binary",
>                "external.table.purge" = "true")
>                
>                
>                
> beeline> insert into table t_hbase values ('user1', 1, 11, 10, 1, 1.0, 1.0, 
> true);
> beeline> explain select * from t_hbase where int_col=10;
> Explain
>   Plan optimized by CBO.
>   Stage-0
>   Fetch Operator
>     limit:-1
>     Select Operator [SEL_2]
>       Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7"]
>     TableScan [TS_0]
>    Output: 
>    
> ["key","tinyint_col","smallint_col","bigint_col","float_col","double_col","boolean_col"]
> {code}
> You can see for this trivial example that this is an {{external}} table, Hive 
> has no idea what's in it, but it is doing a Fetch + TableScan



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-24833) Hive Uses Fetch Task For HBaseStorageHandler Scans

Reply via email to