Arina Ielchiieva created DRILL-5941: ---------------------------------------
Summary: Skip header / footer logic works incorrectly for Hive tables when file has several input splits Key: DRILL-5941 URL: https://issues.apache.org/jira/browse/DRILL-5941 Project: Apache Drill Issue Type: Bug Components: Storage - Hive Affects Versions: 1.11.0 Reporter: Arina Ielchiieva Assignee: Arina Ielchiieva Fix For: 1.12.0 *To reproduce* 1. Create csv file with two columns (key, value) for 3000029 rows, where first row is a header. The data file has size of should be greater than chunk size of 256 MB. Copy file to the distributed file system. 2. Create table in Hive: {noformat} CREATE EXTERNAL TABLE `h_table`( `key` bigint, `value` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'maprfs:/tmp/h_table' TBLPROPERTIES ( 'skip.header.line.count'='1'); {noformat} 3. Execute query {{select * from hive.h_table}} in Drill (query data using Hive plugin). The result will return less rows then expected. Expected result is 3000028 (total count minus one row as header). *The root cause* Since file is greater than default chunk size, it's split into several fragments, known as input splits. For example: {noformat} maprfs:/tmp/h_table/h_table.csv:0+268435456 maprfs:/tmp/h_table/h_table.csv:268435457+492782112 {noformat} TextHiveReader is responsible for handling skip header and / or footer logic. Currently Drill creates reader [for each input split|https://github.com/apache/drill/blob/master/contrib/storage-hive/core/src/main/java/org/apache/drill/exec/store/hive/HiveScanBatchCreator.java#L84] and skip header and /or footer logic is applied for each input splits, though ideally the above mentioned input splits should have been read by one reader, so skip / header footer logic was applied correctly. -- This message was sent by Atlassian JIRA (v6.4.14#64029)