You might also look at http://www.quora
.com/Hive-computing/How-are-SQL-type-analytic-and-windowing-functions-accomplished-in-Hadoop-Hivefor
a way to utilize secondary sort for analytic windowing functions.
RANK() OVER(...) will require grouping and sorting. While it can be done
in the mapper or reducer stage, it is better to utilize Hadoop's shuffle
properties to accomplish both of them. The disadvantage may be that you
can compute only one RANK() in a MapReduce job.
--
Alex K
On Fri, Apr 20, 2012 at 10:54 AM, Philip Tromans philip.j.trom...@gmail.com
wrote:
Have a read of the thread Lag function in Hive, linked from:
http://mail-archives.apache.org/mod_mbox/hive-user/201204.mbox/thread
There's an example of how to force a function to run reduce-side. I've
written a UDF which replicates RANK () OVER (...), but it requires the
syntactic sugar given in the thread. I'd like to make changes to the
hive query planner at some point, so that you can annotate a UDF with
a run on reducer hint, and after that I'd happily open source
everything. If you want more details of how to implement your own
partitionedRowNumber() UDF then I'd be happy to elaborate.
Cheers,
Phil.
On 20 April 2012 18:35, Mark Grover mgro...@oanda.com wrote:
Hi Rajan and Justin,
As per my understanding, the scope of a UDF is only one row of data at a
time. Therefore, it can be done all map side without the need for the
reducer being involved. Now, depending on where you are storing the result
of the query, your query may have reducers that do something.
A simple query like Rajan mentioned
select MyUDF(field1,field2) from table;
should have the UDF execute() being called in the map phase.
Now to Justin's question,
rank function (
http://msdn.microsoft.com/en-us/library/ms176102%28v=sql.110%29.aspx)
seems to have a sytax like:
RANK ( ) OVER ( [ partition_by_clause ] order_by_clause )
Rank function works on a collection of rows (distributed by the some
column - the same one you would use in your partition_by_clause in MS SQL).
You can accomplish that using UDAF (read more about them at
https://cwiki.apache.org/Hive/genericudafcasestudy.html) or by writing a
custom reducer (read about that at
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform
).
I don't think rank can be done using a UDF.
Good luck!
Mark
Mark Grover, Business Intelligence Analyst
OANDA Corporation
www: oanda.com www: fxtrade.com
Best Trading Platform - World Finance's Forex Awards 2009.
The One to Watch - Treasury Today's Adam Smith Awards 2009.
- Original Message -
From: Justin Coffey jqcof...@gmail.com
To: user@hive.apache.org
Sent: Thursday, April 19, 2012 10:29:11 AM
Subject: Re: Lifecycle and Configuration of a hive UDF
Hello All,
I second this question. I have a MS SQL rank function which I would
like to run, the results it gives appears to suggest it is executed Mapper
side as opposed to reducer side, even when run with cluster by
constraints.
-Justin
On Thu, Apr 19, 2012 at 1:21 AM, Ranjan Bagchi ran...@powerreviews.com
wrote:
Hi,
What's the lifecycle of a hive udf. If I call
select MyUDF(field1,field2) from table;
Then MyUDF is instantiated once per mapper, and within each mapper
execute(field1, field2) is called for each reducer? I hope this is the
case, but I can't find anything about this in the documentation.
So I'd like to have some run-time configuration of my UDF: I'm curious
how people do this. Is there a way I can send it a value or have it access
a file, etc? How about performing a query against the hive store?
Thanks,
Ranjan
--
jqcof...@gmail.com
-