Eugene Koifman created HIVE-13479:
-------------------------------------
Summary: Relax sorting requirement in ACID tables
Key: HIVE-13479
URL: https://issues.apache.org/jira/browse/HIVE-13479
Project: Hive
Issue Type: New Feature
Components: Transactions
Affects Versions: 1.2.0
Reporter: Eugene Koifman
Assignee: Eugene Koifman
Currently ACID tables require data to be sorted according to internal primary
key. This is that base + delta files can be efficiently sort/merged to produce
the snapshot for current transaction.
This prevents the user to make the table sorted based on any other criteria
which can be useful. One example is using dynamic partition insert (which also
occurs for update/delete SQL). This may create lots of writers
(buckets*partitions) and tax cluster resources.
The usual solution is hive.optimize.sort.dynamic.partition=true which won't be
honored for ACID tables.
We could rely on hash table based algorithm to merge delta files and then not
require any particular sort on Acid tables. One way to do that is to treat
each update event as an Insert (new internal PK) + delete (old PK). Delete
events are very small since they just need to contain PKs. So the hash table
would just need to contain Delete events and be reasonably memory efficient.
This is a significant amount of work but worth doing.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)