Gergely Nagy created HIVE-12718:
-----------------------------------

             Summary: skip.footer.line.count misbehaves on larger text files
                 Key: HIVE-12718
                 URL: https://issues.apache.org/jira/browse/HIVE-12718
             Project: Hive
          Issue Type: Bug
    Affects Versions: 1.1.0
         Environment: The bug was discovered and reproduced on a Cloudera 
Hadoop 5.4 distribution running on CentOS 6.4.
            Reporter: Gergely Nagy
            Priority: Minor


We noticed that when working on a table backed by a larger (large enough to 
require splitting) text file, the {{skip.footer.line.count}} property of the 
table misbehaves: the footer is not being ignored.

To reproduce, follow these steps:

1) Create a large file: {{for i in $(seq 1 100); do cat /usr/share/dict/words; 
done >large.txt}}
2) Upload it to HDFS (eg, as {{/tmp/words}})
3) Create an external table with {{skip.footer.line.count}} set: 

{quote}
CREATE EXTERNAL TABLE ext_words (word STRING)
  ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
  LINES TERMINATED BY '\n'
  STORED AS TEXTFILE LOCATION '/tmp/words'
  tblproperties("skip.header.line.count"="1", "skip.footer.line.count"="1");
{quote}
4) Count the number of times the last line (in this example, I assume that to 
be {{ZZZ}}) appears: {{SELECT COUNT( * ) FROM ext_words WHERE word = 'ZZZ';}}
5) Observe that it returns 100 instead of 99.

Investigation showed that this happens when there are more than one mappers 
used for the job. If we increase the split size, to force using one mapper 
only, the problem did not occur.

There may be other related issues as well, like the wrong line being skipped -- 
but we did not reproduce those yet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to