remove name node calls in hive by creating temporary directories
----------------------------------------------------------------
Key: HIVE-2201
URL: https://issues.apache.org/jira/browse/HIVE-2201
Project: Hive
Issue Type: Improvement
Reporter: Namit Jain
Currently, in Hive, when a file gets written by a FileSinkOperator,
the sequence of operations is as follows:
1. In tmp directory tmp1, create a tmp file _tmp_1
2. At the end of the operator, move
/tmp1/_tmp_1 to /tmp1/1
3. Move directory /tmp1 to /tmp2
4. For all files in /tmp2, remove all files starting with _tmp and
duplicate files.
Due to speculative execution, a lot of temporary files are created
in /tmp1 (or /tmp2). This leads to a lot of name node calls,
specially for large queries.
The protocol above can be modified slightly:
1. In tmp directory tmp1, create a tmp file _tmp_1
2. At the end of the operator, move
/tmp1/_tmp_1 to /tmp2/1
3. Move directory /tmp2 to /tmp3
4. For all files in /tmp3, remove all duplicate files.
This should reduce the number of tmp files.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira