> If I have a series of entries that look like
...
> { "update", {"baz" : "bar" }}
Due to the way the split distribution works, you need a global ordering
key for each operation.
0, "ADD", "baz", ""
1, "SET", "baz", "bar"
2, "DEL", "baz", null
If you do not have updates coming in within a second, you could store a
timestamp.
Then you can write a windowing function for Hive to merge/order them.
select flatten_txns(op, key, value) over (partition by key order by ts)
from txns;
At this point, you're nearly reinventing what Hive's own
insert/update/delete statements do.
Except, compared to that, these updates are faster (since it's really an
unconditional SET).
Cheers,
Gopal