[ https://issues.apache.org/jira/browse/HIVE-24833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17291887#comment-17291887 ]
Panagiotis Garefalakis commented on HIVE-24833: ----------------------------------------------- Hey [~belugabehr] thanks for investigating this! Like Gopal mentioned, implementing an Estimator for each storage Handler would be the most appropriate solution (even though not the simplest one). However, I stumbled upon the non-native check above that seems to be a remainder of HIVE-3990 code where the method was just calculating the input length (0 in the case of a Non-Native one). Tbh I would expect to be more conservative here and avoid FETCH tasks on NonNative input as it could cause more harm (query running for a long time on HS2, utilizing shared resources) thoughts? > Hive Uses Fetch Task For HBaseStorageHandler Scans > -------------------------------------------------- > > Key: HIVE-24833 > URL: https://issues.apache.org/jira/browse/HIVE-24833 > Project: Hive > Issue Type: Improvement > Components: HBase Handler > Reporter: David Mollitor > Priority: Major > > I believe that a Hive query with an HBase Storage Handler is incorrectly > applies a predicate pushdown into the storage handler. > I observed a FETCH optimization that took a long time to complete because it > was performing a table scan across the entire HBase table. > The only case in which a predicate should be pushed down the storage layer is > for > {code:sql} > SELECT * FROM TABLE my_hbase_table WHERE row_key=? > {code} > This would be appropriate (EQ on the row key). Anything else will involve a > scan of the table and there is no way to easily calculate how small a scan it > will require and therefore should always be passed to the compute engine > (Tez). > {code:none} > beeline> CREATE EXTERNAL TABLE t_hbase(key STRING, > tinyint_col TINYINT, > smallint_col SMALLINT, > int_col INT, > bigint_col BIGINT, > float_col FLOAT, > double_col DOUBLE, > boolean_col BOOLEAN) > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' > WITH SERDEPROPERTIES ("hbase.columns.mapping" = > "cf:binarykey#-,cf:binarybyte#-,cf:binaryshort#-,:key#-,cf:binarylong#-,cf:binaryfloat#-,cf:binarydouble#-,cf:binaryboolean#-") > TBLPROPERTIES ("hbase.table.name" = "t_hive", > "hbase.table.default.storage.type" = "binary", > "external.table.purge" = "true") > > > > beeline> insert into table t_hbase values ('user1', 1, 11, 10, 1, 1.0, 1.0, > true); > beeline> explain select * from t_hbase where int_col=10; > Explain > Plan optimized by CBO. > Stage-0 > Fetch Operator > limit:-1 > Select Operator [SEL_2] > Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7"] > TableScan [TS_0] > Output: > > ["key","tinyint_col","smallint_col","bigint_col","float_col","double_col","boolean_col"] > {code} > You can see for this trivial example that this is an {{external}} table, Hive > has no idea what's in it, but it is doing a Fetch + TableScan -- This message was sent by Atlassian Jira (v8.3.4#803005)