Eric Hanson created HIVE-6234:
---------------------------------
Summary: Implement fast vectorized InputFormat extension for text
files
Key: HIVE-6234
URL: https://issues.apache.org/jira/browse/HIVE-6234
Project: Hive
Issue Type: Sub-task
Reporter: Eric Hanson
Assignee: Eric Hanson
Implement support for vectorized scan input of text files (plain text with
configurable record and fields separators). This should work for CSV files, tab
delimited files, etc.
The goal is to provide high-performance reading of these files using vectorized
scans, and also to do it as an extension of existing Hive. Then, if vectorized
query is enabled, existing tables based on text files will be able to benefit
immediately without the need to use a different input format.
Another goal is to go beyond a simple layering of vectorized row batch iterator
over the top of the existing row iterator. It should be possible to, say, read
a chunk of data into a byte buffer (several thousand or even million rows), and
then read data from it into vectorized row batches directly. Object creations
should be minimized to save allocation time and GC overhead. If it is possible
to save CPU for values like dates and numbers by caching the translation from
string to the final data type, that should ideally be implemented.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)