[ 
https://issues.apache.org/jira/browse/HIVE-12860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Elliot West updated HIVE-12860:
-------------------------------
    Target Version/s: 2.2.0  (was: 1.3.0)

> Add WITH HEADER option to INSERT OVERWRITE DIRECTORY
> ----------------------------------------------------
>
>                 Key: HIVE-12860
>                 URL: https://issues.apache.org/jira/browse/HIVE-12860
>             Project: Hive
>          Issue Type: New Feature
>          Components: Hive
>            Reporter: Elliot West
>            Assignee: Elliot West
>
> _As a Hive user_
> _I'd like the option to seamlessly write out a header row to file system 
> based result sets_
> _So that I can generate reports with a specification that mandates a header 
> row._
> h3. Motivations
> There is a significant use-case where Hive is used to construct a scheduled 
> data processing pipeline that generates a report in HDFS for consumption by 
> some third party (internal or external). This report may then be transferred 
> out of the system for consumption by other tools or processes. It is not 
> uncommon for the third party to specify that the report includes a header row 
> at the start of the file. The current options for adding headers are 
> difficult to use effectively and elegantly.
> h3. Acceptance criteria
> * {{INSERT OVERWRITE DIRECTORY}} commands can be invoked with an option to 
> include a header row at the start of the result set file.
> * The header row will contain the column names derived from the accompanying 
> {{SELECT}} query.
> * It will likely be the case that multiple tasks will be writing the final 
> file of the query result set. In this event only the task writing the first 
> chunk of the file should emit the header row.
> h3. Proposed HQL changes
> {code}
> 1.  INSERT OVERWRITE [LOCAL] DIRECTORY directory1
> 2.    [ROW FORMAT row_format] [STORED AS file_format]
> 3.    [WITH HEADER]
> 4.    SELECT ... FROM ...
> {code}
> It is proposed that the {{WITH HEADER}} stanza at line 3 be introduced to 
> enable this feature.
> h3. Current workarounds
> * It is usually suggested that users set the CLI option 
> {{hive.cli.print.header=true}} and capture the result set from standard out. 
> However, this does not work well in scheduled, headless environments such as 
> the Oozie Hive action. This can also push the file handling into shell 
> scripts and complicate the process of getting the report into HDFS.
> * The keep report processing entirely within the domain of Hive some users 
> {{UNION}} the result of their query with a tiny table of a single row 
> containing the header names. A synthesised rank column is used with an 
> {{ORDER BY}} to ensure that the header is written to the very start of the 
> file. See [this example on Stack 
> Overflow|http://stackoverflow.com/questions/15139561/adding-column-headers-to-hive-result-set/25214480#25214480].
> h3. References
> * HIVE-138: Original request for header functionality.
> * [Hive Wiki: writing data into the file system from 
> queries|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Writingdataintothefilesystemfromqueries].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to