I tested the example on Hive 2.3.6, and it returned correct results. Hive
3.1.2 and 4.0.0-SNAPSHOT also returned correct results. So, I guess, if
this is a bug, it was introduced somewhere around Hive 3.0 and fixed in
3.1.2.
On Hive 2.3.6, I used these commands instead:
create table dummy(a string); insert into table dummy values ('a');
insert into foo select map("country", "USA"), 10 from dummy;
insert into foo select map("country", "USA"), 20 from dummy;
insert into foo select map("country", "UK"), 30 from dummy;
Cheers,
--- Sungwoo
On Wed, Feb 12, 2020 at 10:44 PM Nakul Khanna (BLOOMBERG/ LONDON) <
[email protected]> wrote:
> Hey Zoltan,
>
> Thanks for the response. When I call "select version()" I get:
>
> 3.1.0.3.1.4.0-315 re8d79f440455fa4400daf79974666b3055f1730f
>
> So a couple of patch versions old - any idea if this was a known bug
> before?
>
> Regards,
> Nakul
>
> From: [email protected] At: 02/12/20 12:31:33
> To: Nakul Khanna (BLOOMBERG/ LONDON ) <[email protected]>,
> [email protected]
> Cc: Jacky Lee (BLOOMBERG/ PRINCETON ) <[email protected]>, He Chen
> (BLOOMBERG/ PRINCETON ) <[email protected]>, Peter Babinski
> (BLOOMBERG/ PRINCETON ) <[email protected]>, Bernat Gabor
> (BLOOMBERG/ LONDON ) <[email protected]>, Shashank Singh (BLOOMBERG/
> PRINCETON ) <[email protected]>
> Subject: Re: Issues with aggregating on map values
>
> Hey Nakul!
>
> It's not clear which version you are using;
> I've checked this issue on apache/master and the 3.1.2 release - and both
> of
> them returned accurate results.
> You could execute: 'select version()' ; or run 'hive --version' in a
> commandline
>
> cheers,
> Zoltan
>
> On 2/11/20 11:38 AM, Nakul Khanna (BLOOMBERG/ LONDON) wrote:
> > Creating the table
> >
> > CREATE TABLE foo
> > (tags MAP<String, String>, size int);
> > INSERT INTO foo VALUES
> > (map("country", "USA"), 10),
> > (map("country", "USA"), 20),
> > (map("country", "UK"), 30);
> >
> > SELECT * FROM foo
> >
> > +--------------------+-----------+
> > | foo.tags | foo.size |
> > +--------------------+-----------+
> > | {"country":"USA"} | 10 |
> > | {"country":"USA"} | 20 |
> > | {"country":"UK"} | 30 |
> > +--------------------+-----------+
> >
> > Aggregating the Table
> >
> > SELECT DISTINCT tags["country"] from foo;
> >
> > +-------+
> > | _c0 |
> > +-------+
> > | USA |
> > | NULL |
> > +-------+
> >
> > SELECT tags["country"], sum(size) FROM foo GROUP BY tags["country"];
> >
> > +-------+------+
> > | _c0 | _c1 |
> > +-------+------+
> > | USA | 10 |
> > | NULL | 50 |
> > +-------+------+
> >
> > And even more strangely, with a subquery:
> >
> > SELECT flattened.country, sum(flattened.size)
> > FROM (
> > SELECT tags["country"] as country, size
> > FROM foo
> > WHERE tags["country"] IS NOT NULL
> > ) as flattened
> > GROUP BY flattened.country;
> >
> > +--------------------+------+
> > | flattened.country | _c1 |
> > +--------------------+------+
> > | USA | 10 |
> > +--------------------+------+
> >
> > ---------------
> >
> > Is there any way to stop this from happening and get the correct
> aggregation
> behaviour? The only method I've found is to create a new table using the
> query,
> write that to
> > disk and then do the aggregation on that.
>
>
>