[ 
https://issues.apache.org/jira/browse/DRILL-5941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16248259#comment-16248259
 ] 

ASF GitHub Bot commented on DRILL-5941:
---------------------------------------

Github user ppadma commented on the issue:

    https://github.com/apache/drill/pull/1030
  
    @arina-ielchiieva I have a question. If the splits for a file are spread 
across multiple fragments, does this logic work ?



> Skip header / footer logic works incorrectly for Hive tables when file has 
> several input splits
> -----------------------------------------------------------------------------------------------
>
>                 Key: DRILL-5941
>                 URL: https://issues.apache.org/jira/browse/DRILL-5941
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Hive
>    Affects Versions: 1.11.0
>            Reporter: Arina Ielchiieva
>            Assignee: Arina Ielchiieva
>             Fix For: Future
>
>
> *To reproduce*
> 1. Create csv file with two columns (key, value) for 3000029 rows, where 
> first row is a header.
> The data file has size of should be greater than chunk size of 256 MB. Copy 
> file to the distributed file system.
> 2. Create table in Hive:
> {noformat}
> CREATE EXTERNAL TABLE `h_table`(
>   `key` bigint,
>   `value` string)
> ROW FORMAT DELIMITED
>   FIELDS TERMINATED BY ','
> STORED AS INPUTFORMAT
>   'org.apache.hadoop.mapred.TextInputFormat'
> OUTPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
> LOCATION
>   'maprfs:/tmp/h_table'
> TBLPROPERTIES (
>  'skip.header.line.count'='1');
> {noformat}
> 3. Execute query {{select * from hive.h_table}} in Drill (query data using 
> Hive plugin). The result will return less rows then expected. Expected result 
> is 3000028 (total count minus one row as header).
> *The root cause*
> Since file is greater than default chunk size, it's split into several 
> fragments, known as input splits. For example:
> {noformat}
> maprfs:/tmp/h_table/h_table.csv:0+268435456
> maprfs:/tmp/h_table/h_table.csv:268435457+492782112
> {noformat}
> TextHiveReader is responsible for handling skip header and / or footer logic.
> Currently Drill creates reader [for each input 
> split|https://github.com/apache/drill/blob/master/contrib/storage-hive/core/src/main/java/org/apache/drill/exec/store/hive/HiveScanBatchCreator.java#L84]
>  and skip header and /or footer logic is applied for each input splits, 
> though ideally the above mentioned input splits should have been read by one 
> reader, so skip / header footer logic was applied correctly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to