Hi all,
I have a table with a lot of logging / metrics data that looks like this: id: STRING host_details: ARRAY<STRUCT<key: STRING, value: STRUCT>> metrics: ARRAY<STRUCT<key: STRING, value: STRUCT>> I would like to be able to extract known keys and their values (from each of the host_details and metrics arrays-of-struct). Yes, this format sucks and could be a map, but technically the keys aren’t unique, although I am happy to make that compromise at this point. In reality the table has more fields in the struct. In Spark or MR, I could extract all the fields I need and project them to columns with only a single pass over the table. I’m struggling to achieve the same in Hive, which may be either my SQL ability or that there isn’t a UDF available to meet my needs. The most efficient approach I have found so far, uses the collect() Brickhouse UDF And two lateral views generated from the inline() UDTF: SELECT id, collect(t.key, t..value) AS map_h_m FROM (SELECT id, hds.key AS key, hds.value AS value FROM metrics LATERAL VIEW inline(metrics.host_details) hds UNION ALL SELECT id, ms.key AS key, ms.value AS value FROM metrics LATERAL VIEW inline(metrics.metrics)) t GROUP BY t.id I can create this as a VIEW and then extract columns from the map using the standard map_h_m[‘error_count’] notation. Using collect() / GROUP BY seems more efficient than doing a join by id (I think it cuts out a shuffle), but still has problems: - it’s still scanning the metrics table multiple times - It assumes that keys are unique across both host_details and metrics. Does anyone have any thoughts before I attempt to write my own UDF(s) to assist? Many thanks, Tom
