I've used a rank udf for this previously, distribute and sort by the column 
then select all rows where rank=1. That should work with a join but I never 
tried it. It'd be an issue if the join outputs a lot of records that then are 
all dropped since that'd slow down the query.

I've actually forked Hive internally and added a distinct join based on the, 
now deprecated I guess, unique join code. It's ugly in terms of syntax and I 
haven't had a chance to open source it but it allows a good amount of control 
over what is joined to what (ie: select the row in table A whose column x is 
closets to column y in table B, for example request time). I really wish Hive 
had better support for such "non-SQL" types of queries which are common in a 
world of unstructured and un-clean data.

-Marcin

From: Sunita Arvind [mailto:sunitarv...@gmail.com]
Sent: Tuesday, July 30, 2013 11:00 AM
To: user@hive.apache.org
Subject: Hive Join with distinct rows

Hi Praveen / All,

I also have a requirement similar to the one explained (by Praveen) below:
distinct rows on a single column with corresponding data from other columns.

http://mail-archives.apache.org/mod_mbox/hive-user/201211.mbox/%3ccahmb8ta+r0h5a+armutookhkp8fxctown68qoz6lkfmwbrk...@mail.gmail.com%3E
This email thread dates back to Nov 2012 and is a very common use case.I just 
wanted to check if there is a solution already or we still need to write a UDF.
regards
Sunita

Reply via email to