Traditionally data in Hive was write once (insert) read many. You could append to tables and partitions, add new partitions, etc. You could remove data by dropping tables or partitions. But there was no updates of data or deletes of particular rows. This was what was meant by immutable. Hive was originally done this way because it was based on MapReduce and HDFS and these were the natural semantics given those underlying systems.

For many use cases (e.g. ETL) this is sufficient, and the vast majority of people still run Hive this way.

We added transactions and updates and deletes to Hive because some use cases require these features. Hive is being used more and more as a data warehouse, and while updates and deletes are less common there they are still required (slow changing dimensions, fixing wrong data, deleting records for compliance, etc.) Also streaming data into warehouses from transactional systems is a common use case.

Alan.

Ashok Kumar <mailto:ashok34...@yahoo.com>
December 29, 2015 at 14:59
Hi,

Can someone please clarify what  "immutable data" in Hive means?

I have been told that data in Hive is/should be immutable but in that case why we need transactional tables in Hive that allow updates to data.

thanks and greetings




Reply via email to