[jira] [Commented] (DRILL-5941) Skip header / footer logic works incorrectly for Hive tables when file has several input splits

ASF GitHub Bot (JIRA) Thu, 09 Nov 2017 06:01:33 -0800

    [ 
https://issues.apache.org/jira/browse/DRILL-5941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245686#comment-16245686
 ]


ASF GitHub Bot commented on DRILL-5941:
---------------------------------------

GitHub user arina-ielchiieva opened a pull request:

    https://github.com/apache/drill/pull/1030

    DRILL-5941: Skip header / footer improvements for Hive storage plugin

    Overview:
    1. When table has header / footer process input splits fo the same file in 
one reader (bug fix for DRILL-5941).
    2. Apply skip header logic during reader initialization only once to avoid 
checks during reading the data (DRILL-5106).
    3. Apply skip footer logic only when footer is more then 0, otherwise 
default processing will be done without buffering data in queue (DRIL-5106).
    
    Code changes:
    1. AbstractReadersInitializer was introduced to factor out common logic 
during readers intialization.
    It will have three implementations:
    a. Default (each input split gets its own reader);
    b. Empty (for empty tables);
    c. InputSplitGroups (applied when table has header / footer and input 
splits of the same file should be processed together).
    
    2. AbstractRecordsInspector was introduced to improve performance when 
table has footer is less or equals to 0.
    It will have two implementations:
    a. Default (records will be processed one by one without buffering);
    b. SkipFooter (queue will be used to buffer N records that should be 
skipped in the end of file processing).
    
    3. Allow HiveAbstractReader to have multiple input splits.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/arina-ielchiieva/drill DRILL-5941

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/1030.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1030
    
----

----


> Skip header / footer logic works incorrectly for Hive tables when file has 
> several input splits
> -----------------------------------------------------------------------------------------------
>
>                 Key: DRILL-5941
>                 URL: https://issues.apache.org/jira/browse/DRILL-5941
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Hive
>    Affects Versions: 1.11.0
>            Reporter: Arina Ielchiieva
>            Assignee: Arina Ielchiieva
>             Fix For: 1.12.0
>
>
> *To reproduce*
> 1. Create csv file with two columns (key, value) for 3000029 rows, where 
> first row is a header.
> The data file has size of should be greater than chunk size of 256 MB. Copy 
> file to the distributed file system.
> 2. Create table in Hive:
> {noformat}
> CREATE EXTERNAL TABLE `h_table`(
>   `key` bigint,
>   `value` string)
> ROW FORMAT DELIMITED
>   FIELDS TERMINATED BY ','
> STORED AS INPUTFORMAT
>   'org.apache.hadoop.mapred.TextInputFormat'
> OUTPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
> LOCATION
>   'maprfs:/tmp/h_table'
> TBLPROPERTIES (
>  'skip.header.line.count'='1');
> {noformat}
> 3. Execute query {{select * from hive.h_table}} in Drill (query data using 
> Hive plugin). The result will return less rows then expected. Expected result 
> is 3000028 (total count minus one row as header).
> *The root cause*
> Since file is greater than default chunk size, it's split into several 
> fragments, known as input splits. For example:
> {noformat}
> maprfs:/tmp/h_table/h_table.csv:0+268435456
> maprfs:/tmp/h_table/h_table.csv:268435457+492782112
> {noformat}
> TextHiveReader is responsible for handling skip header and / or footer logic.
> Currently Drill creates reader [for each input 
> split|https://github.com/apache/drill/blob/master/contrib/storage-hive/core/src/main/java/org/apache/drill/exec/store/hive/HiveScanBatchCreator.java#L84]
>  and skip header and /or footer logic is applied for each input splits, 
> though ideally the above mentioned input splits should have been read by one 
> reader, so skip / header footer logic was applied correctly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (DRILL-5941) Skip header / footer logic works incorrectly for Hive tables when file has several input splits

Reply via email to