Traditionally data in Hive was write once (insert) read many. You could
append to tables and partitions, add new partitions, etc. You could
remove data by dropping tables or partitions. But there was no updates
of data or deletes of particular rows. This was what was meant by
immutable. Hive was originally done this way because it was based on
MapReduce and HDFS and these were the natural semantics given those
underlying systems.
For many use cases (e.g. ETL) this is sufficient, and the vast majority
of people still run Hive this way.
We added transactions and updates and deletes to Hive because some use
cases require these features. Hive is being used more and more as a
data warehouse, and while updates and deletes are less common there they
are still required (slow changing dimensions, fixing wrong data,
deleting records for compliance, etc.) Also streaming data into
warehouses from transactional systems is a common use case.
Alan.
Ashok Kumar <mailto:ashok34...@yahoo.com>
December 29, 2015 at 14:59
Hi,
Can someone please clarify what "immutable data" in Hive means?
I have been told that data in Hive is/should be immutable but in that
case why we need transactional tables in Hive that allow updates to data.
thanks and greetings