[ https://issues.apache.org/jira/browse/HIVE-12860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Elliot West updated HIVE-12860: ------------------------------- Target Version/s: 2.2.0 (was: 1.3.0) > Add WITH HEADER option to INSERT OVERWRITE DIRECTORY > ---------------------------------------------------- > > Key: HIVE-12860 > URL: https://issues.apache.org/jira/browse/HIVE-12860 > Project: Hive > Issue Type: New Feature > Components: Hive > Reporter: Elliot West > Assignee: Elliot West > > _As a Hive user_ > _I'd like the option to seamlessly write out a header row to file system > based result sets_ > _So that I can generate reports with a specification that mandates a header > row._ > h3. Motivations > There is a significant use-case where Hive is used to construct a scheduled > data processing pipeline that generates a report in HDFS for consumption by > some third party (internal or external). This report may then be transferred > out of the system for consumption by other tools or processes. It is not > uncommon for the third party to specify that the report includes a header row > at the start of the file. The current options for adding headers are > difficult to use effectively and elegantly. > h3. Acceptance criteria > * {{INSERT OVERWRITE DIRECTORY}} commands can be invoked with an option to > include a header row at the start of the result set file. > * The header row will contain the column names derived from the accompanying > {{SELECT}} query. > * It will likely be the case that multiple tasks will be writing the final > file of the query result set. In this event only the task writing the first > chunk of the file should emit the header row. > h3. Proposed HQL changes > {code} > 1. INSERT OVERWRITE [LOCAL] DIRECTORY directory1 > 2. [ROW FORMAT row_format] [STORED AS file_format] > 3. [WITH HEADER] > 4. SELECT ... FROM ... > {code} > It is proposed that the {{WITH HEADER}} stanza at line 3 be introduced to > enable this feature. > h3. Current workarounds > * It is usually suggested that users set the CLI option > {{hive.cli.print.header=true}} and capture the result set from standard out. > However, this does not work well in scheduled, headless environments such as > the Oozie Hive action. This can also push the file handling into shell > scripts and complicate the process of getting the report into HDFS. > * The keep report processing entirely within the domain of Hive some users > {{UNION}} the result of their query with a tiny table of a single row > containing the header names. A synthesised rank column is used with an > {{ORDER BY}} to ensure that the header is written to the very start of the > file. See [this example on Stack > Overflow|http://stackoverflow.com/questions/15139561/adding-column-headers-to-hive-result-set/25214480#25214480]. > h3. References > * HIVE-138: Original request for header functionality. > * [Hive Wiki: writing data into the file system from > queries|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Writingdataintothefilesystemfromqueries]. -- This message was sent by Atlassian JIRA (v6.3.15#6346)