Creating the table
CREATE TABLE foo
(tags MAP<String, String>, size int);
INSERT INTO foo VALUES
(map("country", "USA"), 10),
(map("country", "USA"), 20),
(map("country", "UK"), 30);
SELECT * FROM foo
+--------------------+-----------+
| foo.tags | foo.size |
+--------------------+-----------+
| {"country":"USA"} | 10 |
| {"country":"USA"} | 20 |
| {"country":"UK"} | 30 |
+--------------------+-----------+
Aggregating the Table
SELECT DISTINCT tags["country"] from foo;
+-------+
| _c0 |
+-------+
| USA |
| NULL |
+-------+
SELECT tags["country"], sum(size) FROM foo GROUP BY tags["country"];
+-------+------+
| _c0 | _c1 |
+-------+------+
| USA | 10 |
| NULL | 50 |
+-------+------+
And even more strangely, with a subquery:
SELECT flattened.country, sum(flattened.size)
FROM (
SELECT tags["country"] as country, size
FROM foo
WHERE tags["country"] IS NOT NULL
) as flattened
GROUP BY flattened.country;
+--------------------+------+
| flattened.country | _c1 |
+--------------------+------+
| USA | 10 |
+--------------------+------+
---------------
Is there any way to stop this from happening and get the correct aggregation
behaviour? The only method I've found is to create a new table using the query,
write that to disk and then do the aggregation on that.