Hive Query how to : group by and UDTF on the resulting records

vivek thakre Sun, 10 Mar 2013 19:49:24 -0700

Hello,

I have a table with userid, movieId and some more columns say c1, c2, c3....
I want to group the records by userId and then do some processing on those
records (for each user) and
output less number of records (or same number of records) based on some
logic.


The processing involves considering all the records at once for that user
say n, do some comparison, filtering, composition, add some more columns
etc and output m records.
m <= n

I can achieve this first creating a view by using a group by clause
(concatenating all the columns) and using collect_set.
This way for each user I would get an array containing all the rows and all
the columns.

Using that view I can then apply UDTF to do the processing on each row and
output more rows

Here is the example

uid1 m1 c1r1 c2r1 c3r1
uid1 m2 c1r2 c2r2 c3r2
uid1 m1c1r3 c2r3 c3r3
uid2 m4 c1r4 c2r4 c3r4
uid2 m1 c1r5 c2r5 c3r5
uid2 m2 c1r6 c2r6 c3r6
uid2 m6 c1r7 c2r7 c3r7

CREATE VIEW groupByUserId_view (all_user_col) as select
collect_set(concat_ws('\t',userid,m,c1r,c2r,c3r)) from userMovie group by
userid;

(the view will have only two rows for uid1 and uid2)

SELECT processing_UDTF(all_user_col) as final_result from groupByUserId_view

The final output is

uid1 m1 c1r1 c2r1 c3r1 c4r1
uid1 m2 c1r2 c2r2 c3r2 c4r2
uid2 m1 c1r4 c2r4 c3r4 c4r3
uid2 m2 c1r7 c2r7 c3r7 c4r4

(based on the processing logic, both uid1 and uid2 has now only 2 rows and
and extra column).

I want to ask , is there a better way to achieve this?

Thank you

Hive Query how to : group by and UDTF on the resulting records

Reply via email to