Re: Remove duplicate records in Hive

2014-09-10 Thread vivek thakre
Considering that the records only differ by one column i.e if the first two columns are are unique (distinct), then you simply use group by with max as aggregation function to eliminate duplicates i,e select cno, sqno, max (date) from table group by cno, sqno If the above assumption is not true i

Setting evn variables through Hive

2013-10-07 Thread vivek thakre
Hello, I am using some legacy binaries as streaming in Hive. These binaries are dependent on libraries which are installed on all the nodes of the cluster under /user/project_name/lib The env variable I want to set is LD_LIBRARY_PATH. Something like LD_LIBRARY_PATH=/user/project_name/lib I tried

Re: Loopup objects in distributed cache

2013-04-04 Thread vivek thakre
the overhead is relatively small (reading 15MB per mapper is > negligible compared to several GB of processed data). > > Best regards, > Jan > > > On Wed, Apr 3, 2013 at 10:35 PM, vivek thakre wrote: > >> Hello, >> >> I want to write a functionality using UDT

Loopup objects in distributed cache

2013-04-03 Thread vivek thakre
Hello, I want to write a functionality using UDTF. The functionality involves reading 7 different text files and create lookup structures such as Map, Set, List , Map of String and List etc to be used in the logic. These files are small size average 15 MB. I can add these files in distributed ca

Hive Query how to : group by and UDTF on the resulting records

2013-03-10 Thread vivek thakre
Hello, I have a table with userid, movieId and some more columns say c1, c2, c3 I want to group the records by userId and then do some processing on those records (for each user) and output less number of records (or same number of records) based on some logic. The processing involves conside