Hello, HCat-Dev.

I'm working on modifying the HCat messages (sent over JMS/ActiveMQ, for 
partition-add/delete) so that clients (such as
Oozie) would have an easier time with consumption.
Here are some limitations of what's available currently:
1. The present implementation in HCatalog (branch-0.4/) seems to send the 
entire Partition (Java) instance in serialized fashion. Since the 
partition-parameters, hdfs-location etc. are all serialized, the messages are 
rather, emm, garrulous.
2. There doesn't seem to be any support for versioning either. So when new 
fields are added, older clients won't work at all without update.

Could we consider transmitting only that info which identifies the partitions 
that pertain to the operation (e.g. partition keys), and drop any information 
that might be gathered from querying the metadata (e.g. storage location, 
partition-parameters, etc.)

We're also considering that the initial implementation encode the ActiveMQ 
payload in JSON.  Here's an example of the proposed message format for an 
"add_partition" operation:

"add_partition": {
  "hcat_server" : "thrift://my.hcat.server:9080",
  "hcat_service_principal" : "hcat/[email protected]",
  "db": "default",
  "table": "starling_jobs",
  "partitions":
    [
      {"grid": "AxoniteBlue", "dt": "2012_10_25"},// Sets of partition-keys.
      {"grid": "AxoniteBlue", "dt": "2012_10_26"},
      {"grid": "AxoniteBlue", "dt": "2012_10_27"},
      {"grid": "AxoniteBlue", "dt": "2012_10_28"},
    ],
  "timestamp": "1351534729" // In this case, interpreted as creation-time.
}

If we continue to use JMS MapMessages, we could consider having 3 keys in the 
map:
1. version = "1" (for the first implementation. Increment as we go.)
2. format = "json" (We could consider adding different formats if we choose.)
3. message = <the json message body, as above.>

The version and format help a factory choose the right implementation to 
deserialize the message. (A client-side library we supply to Oozie should hide 
this and provide POJOs.)

Since the "partitions" field is an array, and since the values corresponding to 
partition-keys are all strings, we'd be able to accommodate partial 
partitions-specs, or even wild-cards. This might help us add support for 
"mark-set-done" later on.

The first key ("add_partition", "drop_partition" or "alter_partition") 
indicates the operation, and the value indicates the record-body. (At first 
glance, the record-body doesn't change for these operations. But that might 
change, so we'll keep them distinct.)

Also note that HiveMetaStore::add_partitions_core() currently doesn't send 1 
message for the entire set of partitions being added. Instead we get one 
message per partition. This could be verbose and sub-optimal. We'll tackle this 
sort of thing after we've nailed the format down.

I'm toying with the idea of adding an "other" property, an array of key-values 
to accommodate stuff we hadn't considered, at "run-time" (like if we want to 
introduce a hack). The need for such a property is contingent on the behaviour 
of Jackson w.r.t. newly added properties in the record-body. (I'll run 
experiments and keep you posted.)

What do you think?

Mithun

Reply via email to