Hi all,

I have a table with a lot of logging / metrics data that looks like this:


id: STRING

host_details: ARRAY<STRUCT<key: STRING, value: STRUCT>>

metrics: ARRAY<STRUCT<key: STRING, value: STRUCT>>


I would like to be able to extract known keys and their values (from each
of the host_details and metrics arrays-of-struct). Yes, this format sucks
and could be a map, but technically the keys aren’t unique, although I am
happy to make that compromise at this point. In reality the table has more
fields in the struct.


In Spark or MR, I could extract all the fields I need and project them to
columns with only a single pass over the table. I’m struggling to achieve
the same in Hive, which may be either my SQL ability or that there isn’t a
UDF available to meet my needs.


The most efficient approach I have found so far, uses the collect()
Brickhouse UDF And two lateral views generated from the inline() UDTF:


SELECT

id,

collect(t.key, t..value) AS map_h_m

FROM

(SELECT id, hds.key AS key, hds.value AS value

FROM metrics

LATERAL VIEW inline(metrics.host_details) hds

UNION ALL

SELECT id, ms.key AS key, ms.value AS value

FROM metrics

LATERAL VIEW inline(metrics.metrics)) t

GROUP BY t.id


I can create this as a VIEW and then extract columns from the map using the
standard map_h_m[‘error_count’] notation.


Using collect() / GROUP BY seems more efficient than doing a join by id (I
think it cuts out a shuffle), but still has problems:

   - it’s still scanning the metrics table multiple times
   - It assumes that keys are unique across both host_details and metrics.


Does anyone have any thoughts before I attempt to write my own UDF(s) to
assist?


Many thanks,

Tom

Reply via email to