[
https://issues.apache.org/jira/browse/HIVE-4196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13776952#comment-13776952
]
Roshan Naik commented on HIVE-4196:
-----------------------------------
Thanks Ashutosh. Since your recommendations apply to subtask HIVE-5138, I have
copied ur comments over to it. I will address them there.
> Support for Streaming Partitions in Hive
> ----------------------------------------
>
> Key: HIVE-4196
> URL: https://issues.apache.org/jira/browse/HIVE-4196
> Project: Hive
> Issue Type: New Feature
> Components: Database/Schema, HCatalog
> Affects Versions: 0.10.1
> Reporter: Roshan Naik
> Assignee: Roshan Naik
> Attachments: HCatalogStreamingIngestFunctionalSpecificationandDesign-
> apr 29- patch1.docx, HCatalogStreamingIngestFunctionalSpecificationandDesign-
> apr 29- patch1.pdf, HIVE-4196.v1.patch
>
>
> Motivation: Allow Hive users to immediately query data streaming in through
> clients such as Flume.
> Currently Hive partitions must be created after all the data for the
> partition is available. Thereafter, data in the partitions is considered
> immutable.
> This proposal introduces the notion of a streaming partition into which new
> files an be committed periodically and made available for queries before the
> partition is closed and converted into a standard partition.
> The admin enables streaming partition on a table using DDL. He provides the
> following pieces of information:
> - Name of the partition in the table on which streaming is enabled
> - Frequency at which the streaming partition should be closed and converted
> into a standard partition.
> Tables with streaming partition enabled will be partitioned by one and only
> one column. It is assumed that this column will contain a timestamp.
> Closing the current streaming partition converts it into a standard
> partition. Based on the specified frequency, the current streaming partition
> is closed and a new one created for future writes. This is referred to as
> 'rolling the partition'.
> A streaming partition's life cycle is as follows:
> - A new streaming partition is instantiated for writes
> - Streaming clients request (via webhcat) for a HDFS file name into which
> they can write a chunk of records for a specific table.
> - Streaming clients write a chunk (via webhdfs) to that file and commit
> it(via webhcat). Committing merely indicates that the chunk has been written
> completely and ready for serving queries.
> - When the partition is rolled, all committed chunks are swept into single
> directory and a standard partition pointing to that directory is created. The
> streaming partition is closed and new streaming partition is created. Rolling
> the partition is atomic. Streaming clients are agnostic of partition rolling.
>
> - Hive queries will be able to query the partition that is currently open
> for streaming. only committed chunks will be visible. read consistency will
> be ensured so that repeated reads of the same partition will be idempotent
> for the lifespan of the query.
> Partition rolling requires an active agent/thread running to check when it is
> time to roll and trigger the roll. This could be either be achieved by using
> an external agent such as Oozie (preferably) or an internal agent.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira