[ https://issues.apache.org/jira/browse/HIVE-11583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14743864#comment-14743864 ]
Illya Yalovyy commented on HIVE-11583: -------------------------------------- I have implemented a qtest for this issue, but it requires a rather big data file. What is the best way to submit this file? It is a gzip file, size = 204Kb. I can attach this file to the ticket. > When PTF is used over a large partitions result could be corrupted > ------------------------------------------------------------------ > > Key: HIVE-11583 > URL: https://issues.apache.org/jira/browse/HIVE-11583 > Project: Hive > Issue Type: Bug > Components: PTF-Windowing > Affects Versions: 0.14.0, 0.13.1, 0.14.1, 1.0.0, 1.2.0, 1.2.1 > Environment: Hadoop 2.6 + Apache hive built from trunk > Reporter: Illya Yalovyy > Priority: Critical > Attachments: HIVE-11583.patch > > > Dataset: > Window has 50001 record (2 blocks on disk and 1 block in memory) > Size of the second block is >32Mb (2 splits) > Result: > When the last block is read from the disk only first split is actually > loaded. The second split gets missed. The total count of the result dataset > is correct, but some records are missing and another are duplicated. > Example: > {code:sql} > CREATE TABLE ptf_big_src ( > id INT, > key STRING, > grp STRING, > value STRING > ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; > LOAD DATA LOCAL INPATH '../../data/files/ptf_3blocks.txt.gz' OVERWRITE INTO > TABLE ptf_big_src; > SELECT grp, COUNT(1) cnt FROM ptf_big_trg GROUP BY grp ORDER BY cnt desc; > --- > -- A 25000 > -- B 20000 > -- C 5001 > --- > CREATE TABLE ptf_big_trg AS SELECT *, row_number() OVER (PARTITION BY key > ORDER BY grp) grp_num FROM ptf_big_src; > SELECT grp, COUNT(1) cnt FROM ptf_big_trg GROUP BY grp ORDER BY cnt desc; > -- > -- A 34296 > -- B 15704 > -- C 1 > --- > {code} > Counts by 'grp' are incorrect! -- This message was sent by Atlassian JIRA (v6.3.4#6332)