[ 
https://issues.apache.org/jira/browse/HIVE-24833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated HIVE-24833:
----------------------------------
    Description: 
I believe that a Hive query with an HBase Storage Handler is incorrectly 
applies a predicate pushdown into the storage handler.

I observed a FETCH optimization that took a long time to complete because it 
was performing a table scan across the entire HBase table.

The only case in which a predicate should be pushed down the storage layer is 
for

{code:sql}
SELECT * FROM TABLE my_hbase_table WHERE row_key=?
{code}

This would be appropriate (EQ on the row key).  Anything else will involve a 
scan of the table and there is no way to easily calculate how small a scan it 
will require and therefore should always be passed to the compute engine (Tez).

{code:none}
beeline> CREATE EXTERNAL TABLE t_hbase(key STRING,
                     tinyint_col TINYINT,
                     smallint_col SMALLINT,
                     int_col INT,
                     bigint_col BIGINT,
                     float_col FLOAT,
                     double_col DOUBLE,
                     boolean_col BOOLEAN)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = 
"cf:binarykey#-,cf:binarybyte#-,cf:binaryshort#-,:key#-,cf:binarylong#-,cf:binaryfloat#-,cf:binarydouble#-,cf:binaryboolean#-")
TBLPROPERTIES ("hbase.table.name" = "t_hive",
               "hbase.table.default.storage.type" = "binary",
               "external.table.purge" = "true")
               
               
               
beeline> insert into table t_hbase values ('user1', 1, 11, 10, 1, 1.0, 1.0, 
true);

beeline> explain select * from t_hbase where int_col=10;

Explain
  Plan optimized by CBO.
  Stage-0
  Fetch Operator
    limit:-1
    Select Operator [SEL_2]
      Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7"]
    TableScan [TS_0]
   Output: 
   
["key","tinyint_col","smallint_col","bigint_col","float_col","double_col","boolean_col"]
{code}

You can see for this trivial example that this is an {{external}} table, Hive 
has no idea what's in it, but it is doing a Fetch + TableScan

  was:
I believe that a Hive query with an HBase Storage Handler is incorrectly 
applies a predicate pushdown into the storage handler.

I observed a FETCH optimization that took a long time to complete because it 
was performing a table scan across the entire HBase table.

The only case in which a predicate should be pushed down the storage layer is 
for

{code:sql}
SELECT * FROM TABLE my_hbase_table WHERE row_key=?
{code}

This would be appropriate (EQ on the row key).  Anything else will involve a 
scan of the table and there is no way to easily calculate how small a scan it 
will require and therefore should always be passed to the compute engine (Tez).

{code:none}
beeline> CREATE EXTERNAL TABLE t_hbase(key STRING,
                     tinyint_col TINYINT,
                     smallint_col SMALLINT,
                     int_col INT,
                     bigint_col BIGINT,
                     float_col FLOAT,
                     double_col DOUBLE,
                     boolean_col BOOLEAN)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = 
"cf:binarykey#-,cf:binarybyte#-,cf:binaryshort#-,:key#-,cf:binarylong#-,cf:binaryfloat#-,cf:binarydouble#-,cf:binaryboolean#-")
TBLPROPERTIES ("hbase.table.name" = "t_hive",
               "hbase.table.default.storage.type" = "binary",
               "external.table.purge" = "true")
               
               
               
beeline> insert into table t_hbase values ('user1', 1, 11, 10, 1, 1.0, 1.0, 
true);

beeline> explain select * from t_hbase where int_col=1;

Explain
  Plan optimized by CBO.
  Stage-0
  Fetch Operator
    limit:-1
    Select Operator [SEL_2]
      Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7"]
    TableScan [TS_0]
   Output: 
   
["key","tinyint_col","smallint_col","bigint_col","float_col","double_col","boolean_col"]
{code}


> Hive Should Only Pushdown EQ Predicate on HBaseStorageHandler Fetch Task
> ------------------------------------------------------------------------
>
>                 Key: HIVE-24833
>                 URL: https://issues.apache.org/jira/browse/HIVE-24833
>             Project: Hive
>          Issue Type: Improvement
>          Components: HBase Handler
>            Reporter: David Mollitor
>            Priority: Major
>
> I believe that a Hive query with an HBase Storage Handler is incorrectly 
> applies a predicate pushdown into the storage handler.
> I observed a FETCH optimization that took a long time to complete because it 
> was performing a table scan across the entire HBase table.
> The only case in which a predicate should be pushed down the storage layer is 
> for
> {code:sql}
> SELECT * FROM TABLE my_hbase_table WHERE row_key=?
> {code}
> This would be appropriate (EQ on the row key).  Anything else will involve a 
> scan of the table and there is no way to easily calculate how small a scan it 
> will require and therefore should always be passed to the compute engine 
> (Tez).
> {code:none}
> beeline> CREATE EXTERNAL TABLE t_hbase(key STRING,
>                      tinyint_col TINYINT,
>                      smallint_col SMALLINT,
>                      int_col INT,
>                      bigint_col BIGINT,
>                      float_col FLOAT,
>                      double_col DOUBLE,
>                      boolean_col BOOLEAN)
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> WITH SERDEPROPERTIES ("hbase.columns.mapping" = 
> "cf:binarykey#-,cf:binarybyte#-,cf:binaryshort#-,:key#-,cf:binarylong#-,cf:binaryfloat#-,cf:binarydouble#-,cf:binaryboolean#-")
> TBLPROPERTIES ("hbase.table.name" = "t_hive",
>                "hbase.table.default.storage.type" = "binary",
>                "external.table.purge" = "true")
>                
>                
>                
> beeline> insert into table t_hbase values ('user1', 1, 11, 10, 1, 1.0, 1.0, 
> true);
> beeline> explain select * from t_hbase where int_col=10;
> Explain
>   Plan optimized by CBO.
>   Stage-0
>   Fetch Operator
>     limit:-1
>     Select Operator [SEL_2]
>       Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7"]
>     TableScan [TS_0]
>    Output: 
>    
> ["key","tinyint_col","smallint_col","bigint_col","float_col","double_col","boolean_col"]
> {code}
> You can see for this trivial example that this is an {{external}} table, Hive 
> has no idea what's in it, but it is doing a Fetch + TableScan



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to