Witold Drabicki created HIVE-25949:
--------------------------------------
Summary: predicate pushdown does not seem to work correctly for
external tables pointing to MySQL
Key: HIVE-25949
URL: https://issues.apache.org/jira/browse/HIVE-25949
Project: Hive
Issue Type: Bug
Components: Hive
Affects Versions: 2.3.7
Environment: Hive runs on *GCP Dataproc,* image version is
*1.5.56-debian10* (Hive v {*}2.3.7{*})
*MySQL* server version is {*}5.7.36{*}.
The following jars are used:
{code:java}
add jar gs://d-test-bucket-1/commons-pool-1.6.jar;
add jar gs://d-test-bucket-1/hive-jdbc-handler-2.3.7.jar;
add jar gs://d-test-bucket-1/commons-dbcp-1.4.jar;
add jar gs://d-test-bucket-1/mysql-connector-java-8.0.27.jar; (identical
behavior when using mysql-connector-java-5.1.49){code}
Reporter: Witold Drabicki
The following external table has been created:
{code:java}
CREATE EXTERNAL TABLE test_table_ext (
col1 integer,
col2 integer,
col3 bigint,
col4 decimal(38,15),
...
col13 decimal(38,15),
partition_column integer
)
STORED BY 'org.apache.hive.storage.jdbc.JdbcStorageHandler'
TBLPROPERTIES (
"hive.sql.database.type" = "MYSQL",
"hive.sql.jdbc.driver" = "com.mysql.cj.jdbc.Driver",
"hive.sql.jdbc.url" = "jdbc:mysql://<ip>/<schema-name>",
"hive.sql.dbcp.username" = "<name>",
"hive.sql.dbcp.password" = "<password>",
"hive.sql.table" = "TEST_TABLE",
"hive.sql.schema" = "<schema-name>",
"hive.sql.query" = "select col1, col2, col3, col4, ..., col13,
partition_column from <schema-name>.TEST_TABLE",
"hive.sql.dbcp.maxActive" = "1",
"hive.sql.numPartitions" = "1"
);
{code}
Corresponding MySQL table:
{code:java}
CREATE TABLE TEST_TABLE (
col1 int(11),
col2 int(11),
col3 bigint(20),
col4 decimal(38,15),
...
col13 decimal(38,15),
partition_column int(11)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
PARTITION BY LIST (partition_column)
{code}
The MySQL table has *470678* rows and *363* partitions.
Running *ANALYZE TABLE COMPUTE STATISTICS* calculates the number of rows in the
table as {*}11303242{*}, which is incorrect.
The following 2 queries have been tested:
1) *select count(*) from test_table_ext where col2 = 7602*
2) *select count(*) from test_table_ext where partition_column = 20220207 col2
= 7602*
*hive.optimize.ppd* setting is set to *true*
Query plan for query #2 is the following:
{code:java}
Plan optimized by CBO.
Vertex dependency in root stage
Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE)
Stage-0
Fetch Operator
limit:-1
Stage-1
Reducer 2
File Output Operator [FS_7]
Group By Operator [GBY_5] (rows=1 width=8)
Output:["_col0"],aggregations:["count(VALUE._col0)"]
<-Map 1 [CUSTOM_SIMPLE_EDGE]
PARTITION_ONLY_SHUFFLE [RS_4]
Group By Operator [GBY_3] (rows=1 width=8)
Output:["_col0"],aggregations:["count()"]
Select Operator [SEL_2] (rows=1 width=0)
Filter Operator [FIL_8] (rows=1 width=0)
predicate:((UDFToDouble(partition_column) = 2.0220207E7)
and (UDFToDouble(col2) = 7602.0))
TableScan [TS_0] (rows=11303242 width=0)
wd@test_table_ext,test_table_ext,Tbl:PARTIAL,Col:NONE,Output:["col2","partition_column"]
{code}
Since query #2 is specifying partition value, expected behavior is that it
should be noticeably faster that query #1 which has to scan all partitions, but
these 2 queries take the same amount of time (150 seconds). This has been
tested multiple times.
When the same queries are executed directly in MySQL, query #2 runs noticeably
faster than query #1 (0.01 vs 0.26 seconds).
--
This message was sent by Atlassian Jira
(v8.20.1#820001)