Elliot West created HIVE-12860:
----------------------------------
Summary: Add WITH HEADER option to INSERT OVERWRITE DIRECTORY
Key: HIVE-12860
URL: https://issues.apache.org/jira/browse/HIVE-12860
Project: Hive
Issue Type: New Feature
Components: Hive
Reporter: Elliot West
Assignee: Elliot West
_As a Hive user_
_I'd like the option to seamlessly write out a header row to file system based
result sets_
_So that I can generate reports whose specification mandates a header row._
h4. Motivations
There is a significant use-case where Hive is used to construct a scheduled
data processing pipeline that generates a report in HDFS for consumption by
some third party (internal or external). This report may then be transferred
out of the system for consumption by other tools or processes. It is not
uncommon for the third party to specify that the report includes a header row
at the start of the file. The current options for adding headers are difficult
to use effectively and elegantly.
h4. Acceptance criteria
* {{INSERT OVERWRITE DIRECTORY}} commands can be invoked with an option to
include a header row at the start of the result set file.
* The header row will contain the column names derived from the accompanying
{{SELECT}} query.
* It will likely be the case that multiple tasks will be writing the final file
of the query result set. In this event only the task writing the first chunk of
the file should emit the header row.
h4. Proposed HQL changes
{code}
1. INSERT OVERWRITE [LOCAL] DIRECTORY directory1
2. [ROW FORMAT row_format] [STORED AS file_format]
3. [WITH HEADER]
4. SELECT ... FROM ...
{code}
It is proposed that the {{WITH HEADER}} stanza at line 3 be introduced to
enable this feature.
h4. Current workarounds
* It is usually suggested that users set the CLI option
{{hive.cli.print.header=true}} and capture the result set from standard out.
However, this does not work well in scheduled, headless environments such as
the Oozie Hive action. This can also push the file handling into shell scripts
and complicate the process of getting the report into HDFS.
* The keep report processing entirely within the domain of Hive some users
{{UNION}} the result of their query with a tiny table of a single row
containing the header names. A synthesised rank column is used with an {{ORDER
BY}} to ensure that the header is written to the very start of the file. See
[this example on Stack
Overflow|http://stackoverflow.com/questions/15139561/adding-column-headers-to-hive-result-set/25214480#25214480].
h4. References
* HIVE-138: Original request for header functionality.
* [Hive Wiki: writing data into the file system from
queries|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Writingdataintothefilesystemfromqueries].
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)