Jitender Kumar created HIVE-24819: ------------------------------------- Summary: CombineHiveInputFormat format seems to be returning row count in the multiple of Maps Key: HIVE-24819 URL: https://issues.apache.org/jira/browse/HIVE-24819 Project: Hive Issue Type: Bug Environment: Apache Hive (version 3.1.0.3.1.0.0-78) Driver: Hive JDBC (version 3.1.0.3.1.0.0-78) Transaction isolation: TRANSACTION_REPEATABLE_READ Beeline version 3.1.0.3.1.0.0-78 by Apache Hive Reporter: Jitender Kumar
Hi Team, This is the first time I am writing a bug using apache Jira, so pardon me if I am unintentionally breaking any protocols. I am facing the following issue (on a multi-node cluster) when I set hive.tez.input.format to org.apache.hadoop.hive.ql.io.CombineHiveInputFormat. Just for demonstration purposes, I will be executing the following query for multiple cases. _select count(1) from dbname.personal_data_rc tablesample(1000 rows);_ *Case1* mapred.map.tasks=2 hive.tez.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat *Output* 1000 *Case 2* mapred.map.tasks=2 hive.tez.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat *Output* 2000 *Case 3* mapred.map.tasks=3 hive.tez.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat *Output* 3000 After 3 maps set as default, out remains same, i.e multiple of 3. Can you help me understand why if I have TABLESAMPLE set to 1000 rows, it is giving me more number of rows? Is there any other property that must be used with CombineHiveInputFormat or is it an issue with CombineHiveInputFormat only? I have tried to look for a solution but in the end i had to come here. Please share your inputs ASAP as one of our client is looking for a solution or explaination regarding this? For now as a workaround we have changed it to following. *hive.tez.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat* -- This message was sent by Atlassian Jira (v8.3.4#803005)