[ https://issues.apache.org/jira/browse/HIVE-6234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13888279#comment-13888279 ]
Eric Hanson commented on HIVE-6234: ----------------------------------- This is just getting started. I need to put this aside for a while (probably at least until the end of Feb.). I parked the latest information here on the JIRA. > Implement fast vectorized InputFormat extension for text files > -------------------------------------------------------------- > > Key: HIVE-6234 > URL: https://issues.apache.org/jira/browse/HIVE-6234 > Project: Hive > Issue Type: Sub-task > Reporter: Eric Hanson > Assignee: Eric Hanson > Attachments: HIVE-6234.02.patch, HIVE-6234.03.patch, Vectorized Text > InputFormat design.docx, Vectorized Text InputFormat design.pdf, > state-diagram.jpg > > > Implement support for vectorized scan input of text files (plain text with > configurable record and field separators). This should work for CSV files, > tab delimited files, etc. > The goal is to provide high-performance reading of these files using > vectorized scans, and also to do it as an extension of existing Hive. Then, > if vectorized query is enabled, existing tables based on text files will be > able to benefit immediately without the need to use a different input format. > After upgrading to new Hive bits that support this, faster, vectorized > processing over existing text tables should just work, when vectorization is > enabled. > Another goal is to go beyond a simple layering of vectorized row batch > iterator over the top of the existing row iterator. It should be possible to, > say, read a chunk of data into a byte buffer (several thousand or even > million rows), and then read data from it into vectorized row batches > directly. Object creations should be minimized to save allocation time and GC > overhead. If it is possible to save CPU for values like dates and numbers by > caching the translation from string to the final data type, that should > ideally be implemented. -- This message was sent by Atlassian JIRA (v6.1.5#6160)