Witold Drabicki created HIVE-25946:
--------------------------------------
Summary: select from external table pointing to MySQL returns
multiple copies of the same row
Key: HIVE-25946
URL: https://issues.apache.org/jira/browse/HIVE-25946
Project: Hive
Issue Type: Bug
Components: Hive
Affects Versions: 2.3.7
Environment: Hive runs on *GCP Dataproc,* image version is
*1.5.56-debian10* (Hive v {*}2.3.7{*})
*MySQL* server version is {*}5.7.36{*}.
The following jars are used:
{code:java}
add jar gs://d-test-bucket-1/commons-pool-1.6.jar;
add jar gs://d-test-bucket-1/hive-jdbc-handler-2.3.7.jar;
add jar gs://d-test-bucket-1/commons-dbcp-1.4.jar;
add jar gs://d-test-bucket-1/mysql-connector-java-8.0.27.jar; (identical
behavior when using mysql-connector-java-5.1.49){code}
Reporter: Witold Drabicki
The following table has been created in Hive:
{code:java}
CREATE EXTERNAL TABLE table_with_4_rows
(
col1 varchar(100),
col2 varchar(15),
col3 TIMESTAMP,
col4 TIMESTAMP
)
STORED BY 'org.apache.hive.storage.jdbc.JdbcStorageHandler'
TBLPROPERTIES (
"hive.sql.database.type" = "MYSQL",
"hive.sql.jdbc.driver" = "com.mysql.cj.jdbc.Driver",
"hive.sql.jdbc.url" = "jdbc:mysql://<ip>/<schema-name>",
"hive.sql.dbcp.username" = "<name>",
"hive.sql.dbcp.password" = "<password>",
"hive.sql.table" = "TABLE_WITH_4_ROWS",
"hive.sql.schema" = "schema-name",
"hive.sql.query" = "select col1, col2, col3, col4 from
schema-name.TABLE_WITH_4_ROWS",
"hive.sql.numPartitions" = "1",
"hive.sql.dbcp.maxActive" = "1"
);{code}
The table in MySQL has just 4 rows, and is defined as:
{code:java}
CREATE TABLE `TABLE_WITH_4_ROWS` (
`col1` varchar(100) NOT NULL DEFAULT '',
`col2` varchar(15) DEFAULT NULL,
`col3` datetime DEFAULT NULL,
`col4` datetime DEFAULT NULL,
PRIMARY KEY (`col1`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;{code}
When cluster is *not 100% busy* and has idle containers, running *select col1,
col2 from table_with_4_rows* results in a job that uses 49 mappers and no
reducers, and returns 187 rows, instead of 4 (each original row is duplicated
multiple times in the results).
Running the same select but with *WHERE col1 = 'specific-value'* also uses 49
mappers and instead of returning 1 row also returns duplicated data (46 to 48
rows, depending on the value).
When cluster is *100% busy* and the job needs to reclaim containers from other
jobs, the above queries use just 1 mapper and *return correct data* (4 and 1
row, correspondingly).
Running *ANALYZE TABLE table_with_4_rows COMPUTE STATISTICS* does not change
the results, however, it also works incorrectly as it sets +numRows+ in the
table's metadata also to 187.
There's *ArrayIndexOutOfBoundsException* *Error during condition build*
exception thrown during the query execution. Here's the output from the *log*
file:
{code:java}
2022-02-08 20:43:39 Running Dag: dag_1644267138354_0004_1
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not
allowed in prolog.
Continuing ...
2022-02-08 20:44:03 Completed Dag: dag_1644267138354_0004_1
2022-02-08 20:43:39,898 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|:
Num tasks is -1. Expecting VertexManager/InputInitializers/1-1 split to set
#tasks for the vertex vertex_1644267138354_0004_1_00 [Map 1]
2022-02-08 20:43:39,898 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|:
Vertex will initialize from input initializer. vertex_1644267138354_0004_1_00
[Map 1]
2022-02-08 20:43:39,900 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|:
Starting 1 inputInitializers for vertex vertex_1644267138354_0004_1_00 [Map 1]
2022-02-08 20:43:39,921 [INFO] [Dispatcher thread {Central}]
|Configuration.deprecation|: mapred.committer.job.setup.cleanup.needed is
deprecated. Instead, use mapreduce.job.committer.setup.cleanup.needed
2022-02-08 20:43:39,998 [INFO] [Dispatcher thread {Central}] |conf.HiveConf|:
Found configuration file null
2022-02-08 20:43:40,002 [INFO] [Dispatcher thread {Central}]
|tez.HiveSplitGenerator|: SplitGenerator using llap affinitized locations: false
2022-02-08 20:43:40,002 [INFO] [Dispatcher thread {Central}]
|tez.HiveSplitGenerator|: SplitLocationProvider:
org.apache.hadoop.hive.ql.exec.tez.Utils$1@565d6567
2022-02-08 20:43:40,115 [INFO] [Dispatcher thread {Central}] |exec.Utilities|:
PLAN PATH =
hdfs://.../var/tmp/hive-scratch/wdrabicki/07e003af-375b-4bcf-9cb2-6ec15c67e5dd/hive_2022-02-08_20-43-31_574_1039773130311933277-1/wdrabicki/_tez_scratch_dir/dda4a9d6-af45-4a8c-8a48-c1ddea2ef318/map.xml
2022-02-08 20:43:40,125 [INFO] [Dispatcher thread {Central}]
|exec.SerializationUtilities|: Deserializing MapWork using kryo
2022-02-08 20:43:40,267 [INFO] [Dispatcher thread {Central}] |exec.Utilities|:
Deserialized plan (via RPC) - name: Map 1 size: 3.61KB
2022-02-08 20:43:40,275 [INFO] [InputInitializer {Map 1} #0]
|dag.RootInputInitializerManager|: Starting InputInitializer for Input:
table_with_4_rows on vertex vertex_1644267138354_0004_1_00 [Map 1]
2022-02-08 20:43:40,277 [INFO] [InputInitializer {Map 1} #0]
|tez.HiveSplitGenerator|: GenerateConsistentSplitsInHive=true
2022-02-08 20:43:40,287 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|:
vertex_1644267138354_0004_1_00 [Map 1] transitioned from NEW to INITIALIZING
due to event V_INIT
2022-02-08 20:43:40,288 [INFO] [InputInitializer {Map 1} #0]
|tez.HiveSplitGenerator|: The preferred split size is 52428800
2022-02-08 20:43:40,289 [INFO] [InputInitializer {Map 1} #0] |exec.Utilities|:
PLAN PATH =
hdfs://.../var/tmp/hive-scratch/wdrabicki/07e003af-375b-4bcf-9cb2-6ec15c67e5dd/hive_2022-02-08_20-43-31_574_1039773130311933277-1/wdrabicki/_tez_scratch_dir/dda4a9d6-af45-4a8c-8a48-c1ddea2ef318/map.xml
2022-02-08 20:43:40,290 [INFO] [InputInitializer {Map 1} #0] |exec.Utilities|:
Processing alias table_with_4_rows
2022-02-08 20:43:40,290 [INFO] [InputInitializer {Map 1} #0] |exec.Utilities|:
Adding 1 inputs; the first input is
hdfs://.../user/wdrabicki/hive/database/table_with_4_rows
2022-02-08 20:43:40,299 [INFO] [InputInitializer {Map 1} #0]
|io.HiveInputFormat|: hive.io.file.readcolumn.ids = 0,1
2022-02-08 20:43:40,299 [INFO] [InputInitializer {Map 1} #0]
|io.HiveInputFormat|: hive.io.file.readcolumn.names = col1,col2
2022-02-08 20:43:40,300 [INFO] [InputInitializer {Map 1} #0]
|io.HiveInputFormat|: Generating splits for dirs:
hdfs://.../user/wdrabicki/hive/database/table_with_4_rows
2022-02-08 20:43:40,642 [ERROR] [InputInitializer {Map 1} #0]
|jdbc.QueryConditionBuilder|: Error during condition build
java.lang.ArrayIndexOutOfBoundsException: 0
at java.beans.XMLDecoder.readObject(XMLDecoder.java:250)
at
org.apache.hive.storage.jdbc.QueryConditionBuilder.createConditionString(QueryConditionBuilder.java:125)
at
org.apache.hive.storage.jdbc.QueryConditionBuilder.buildCondition(QueryConditionBuilder.java:74)
at
org.apache.hive.storage.jdbc.conf.JdbcStorageConfigManager.getQueryToExecute(JdbcStorageConfigManager.java:84)
at
org.apache.hive.storage.jdbc.dao.GenericJdbcDatabaseAccessor.getTotalNumberOfRecords(GenericJdbcDatabaseAccessor.java:98)
at
org.apache.hive.storage.jdbc.JdbcInputFormat.getSplits(JdbcInputFormat.java:70)
at
org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:442)
at
org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:561)
at
org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:196)
at
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
at
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1926)
at
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)
at
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2022-02-08 20:43:40,709 [INFO] [InputInitializer {Map 1} #0]
|io.HiveInputFormat|: number of splits 49
2022-02-08 20:43:40,709 [INFO] [InputInitializer {Map 1} #0]
|tez.HiveSplitGenerator|: Number of input splits: 49. 29 available slots, 1.7
waves. Input format is: org.apache.hadoop.hive.ql.io.HiveInputFormat
2022-02-08 20:43:40,710 [INFO] [InputInitializer {Map 1} #0] |exec.Utilities|:
PLAN PATH =
hdfs://.../var/tmp/hive-scratch/wdrabicki/07e003af-375b-4bcf-9cb2-6ec15c67e5dd/hive_2022-02-08_20-43-31_574_1039773130311933277-1/wdrabicki/_tez_scratch_dir/dda4a9d6-af45-4a8c-8a48-c1ddea2ef318/map.xml
2022-02-08 20:43:40,710 [INFO] [InputInitializer {Map 1} #0]
|exec.SerializationUtilities|: Deserializing MapWork using kryo
2022-02-08 20:43:40,714 [INFO] [InputInitializer {Map 1} #0] |exec.Utilities|:
Deserialized plan (via RPC) - name: Map 1 size: 3.61KB
2022-02-08 20:43:40,720 [INFO] [InputInitializer {Map 1} #0]
|tez.SplitGrouper|: # Src groups for split generation: 2
2022-02-08 20:43:40,723 [INFO] [InputInitializer {Map 1} #0]
|tez.SplitGrouper|: Estimated number of tasks: 49 for bucket 1
2022-02-08 20:43:40,725 [INFO] [InputInitializer {Map 1} #0]
|grouper.TezSplitGrouper|: Grouping splits in Tez
2022-02-08 20:43:40,726 [INFO] [InputInitializer {Map 1} #0]
|grouper.TezSplitGrouper|: Desired splits: 49 too small. Desired splitLength:
1972178859 Max splitLength: 1073741824 New desired splits: 90 Total length:
96636764119 Original splits: 49
2022-02-08 20:43:40,726 [INFO] [InputInitializer {Map 1} #0]
|grouper.TezSplitGrouper|: Using original number of splits: 49 desired splits:
90
2022-02-08 20:43:40,727 [INFO] [InputInitializer {Map 1} #0]
|tez.SplitGrouper|: Original split count is 49 grouped split count is 49, for
bucket: 1
2022-02-08 20:43:40,729 [INFO] [InputInitializer {Map 1} #0]
|tez.HiveSplitGenerator|: Number of split groups: 49
{code}
When testing, I also created another external table pointing to another MySQL
table with 470559 rows. The MySQL table uses partitions. All queries always use
1 mapper and I have not seen duplicated rows yet, but COMPUTE STATISTICS also
works incorrectly - it sets +numRows+ to 11303242.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)