Roshan Naik created HIVE-4196:
---------------------------------

             Summary: Support for Streaming Partitions in Hive
                 Key: HIVE-4196
                 URL: https://issues.apache.org/jira/browse/HIVE-4196
             Project: Hive
          Issue Type: New Feature
          Components: Database/Schema, HCatalog
    Affects Versions: 0.10.1
            Reporter: Roshan Naik


Motivation: Allow Hive users to immediately query data streaming in through 
clients such as Flume.


Currently Hive partitions must be created after all the data for the partition 
is available. Thereafter, data in the partitions is considered immutable. 

This proposal introduces the notion of a streaming partition into which new 
files an be committed periodically and made available for queries before the 
partition is closed and converted into a standard partition.

The admin enables streaming partition on a table using DDL. He provides the 
following pieces of information:
- Name of the partition in the table on which streaming is enabled
- Frequency at which the streaming partition should be closed and converted 
into a standard partition.

Tables with streaming partition enabled will be partitioned by one and only one 
column. It is assumed that this column will contain a timestamp.

Closing the current streaming partition converts it into a standard partition. 
Based on the specified frequency, the current streaming partition  is closed 
and a new one created for future writes. This is referred to as 'rolling the 
partition'.


A streaming partition's life cycle is as follows:

 - A new streaming partition is instantiated for writes

 - Streaming clients request (via webhcat) for a HDFS file name into which they 
can write a chunk of records for a specific table.

 - Streaming clients write a chunk (via webhdfs) to that file and commit it(via 
webhcat). Committing merely indicates that the chunk has been written 
completely and ready for serving queries.  

 - When the partition is rolled, all committed chunks are swept into single 
directory and a standard partition pointing to that directory is created. The 
streaming partition is closed and new streaming partition is created. Rolling 
the partition is atomic. Streaming clients are agnostic of partition rolling.  

 - Hive queries will be able to query the partition that is currently open for 
streaming. only committed chunks will be visible. read consistency will be 
ensured so that repeated reads of the same partition will be idempotent for the 
lifespan of the query.



Partition rolling requires an active agent/thread running to check when it is 
time to roll and trigger the roll. This could be either be achieved by using an 
external agent such as Oozie (preferably) or an internal agent.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to