[ https://issues.apache.org/jira/browse/HIVE-21924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mustafa Iman updated HIVE-21924: -------------------------------- Attachment: HIVE-21924.2.patch Status: Patch Available (was: Open) > Split text files even if header/footer exists > --------------------------------------------- > > Key: HIVE-21924 > URL: https://issues.apache.org/jira/browse/HIVE-21924 > Project: Hive > Issue Type: Improvement > Components: File Formats > Affects Versions: 2.4.0, 4.0.0, 3.2.0 > Reporter: Prasanth Jayachandran > Assignee: Mustafa Iman > Priority: Major > Labels: pull-request-available > Attachments: HIVE-21924.2.patch, HIVE-21924.patch > > Time Spent: 1h 20m > Remaining Estimate: 0h > > https://github.com/apache/hive/blob/967a1cc98beede8e6568ce750ebeb6e0d048b8ea/ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java#L494-L503 > > {code} > int headerCount = 0; > int footerCount = 0; > if (table != null) { > headerCount = Utilities.getHeaderCount(table); > footerCount = Utilities.getFooterCount(table, conf); > if (headerCount != 0 || footerCount != 0) { > // Input file has header or footer, cannot be splitted. > HiveConf.setLongVar(conf, ConfVars.MAPREDMINSPLITSIZE, > Long.MAX_VALUE); > } > } > {code} > this piece of code makes the CSV (or any text files with header/footer) files > not splittable if header or footer is present. > If only header is present, we can find the offset after first line break and > use that to split. Similarly for footer, may be read few KB's of data at the > end and find the last line break offset. Use that to determine the data range > which can be used for splitting. Few reads during split generation are > cheaper than not splitting the file at all. -- This message was sent by Atlassian Jira (v8.3.4#803005)