[ https://issues.apache.org/jira/browse/DRILL-5941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16263508#comment-16263508 ]
ASF GitHub Bot commented on DRILL-5941: --------------------------------------- Github user asfgit closed the pull request at: https://github.com/apache/drill/pull/1030 > Skip header / footer logic works incorrectly for Hive tables when file has > several input splits > ----------------------------------------------------------------------------------------------- > > Key: DRILL-5941 > URL: https://issues.apache.org/jira/browse/DRILL-5941 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Hive > Affects Versions: 1.11.0 > Reporter: Arina Ielchiieva > Assignee: Arina Ielchiieva > Labels: ready-to-commit > Fix For: 1.12.0 > > > *To reproduce* > 1. Create csv file with two columns (key, value) for 3000029 rows, where > first row is a header. > The data file has size of should be greater than chunk size of 256 MB. Copy > file to the distributed file system. > 2. Create table in Hive: > {noformat} > CREATE EXTERNAL TABLE `h_table`( > `key` bigint, > `value` string) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > STORED AS INPUTFORMAT > 'org.apache.hadoop.mapred.TextInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' > LOCATION > 'maprfs:/tmp/h_table' > TBLPROPERTIES ( > 'skip.header.line.count'='1'); > {noformat} > 3. Execute query {{select * from hive.h_table}} in Drill (query data using > Hive plugin). The result will return less rows then expected. Expected result > is 3000028 (total count minus one row as header). > *The root cause* > Since file is greater than default chunk size, it's split into several > fragments, known as input splits. For example: > {noformat} > maprfs:/tmp/h_table/h_table.csv:0+268435456 > maprfs:/tmp/h_table/h_table.csv:268435457+492782112 > {noformat} > TextHiveReader is responsible for handling skip header and / or footer logic. > Currently Drill creates reader [for each input > split|https://github.com/apache/drill/blob/master/contrib/storage-hive/core/src/main/java/org/apache/drill/exec/store/hive/HiveScanBatchCreator.java#L84] > and skip header and /or footer logic is applied for each input splits, > though ideally the above mentioned input splits should have been read by one > reader, so skip / header footer logic was applied correctly. -- This message was sent by Atlassian JIRA (v6.4.14#64029)