Re: Lifecycle and Configuration of a hive UDF

2012-04-24 Thread Justin Coffey
Hi Mark,
 Looks great to me!  Thanks for adding it.

-Justin

On Tue, Apr 24, 2012 at 5:55 AM, Mark Grover  wrote:

> Added a tiny blurb here:
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-UDFinternals
> Comments/suggestions welcome!
>
> Thanks for bringing it up, Justin.
>
> Mark
>
> Mark Grover, Business Intelligence Analyst
> OANDA Corporation
>
> www: oanda.com www: fxtrade.com
> e: mgro...@oanda.com
>
> "Best Trading Platform" - World Finance's Forex Awards 2009.
> "The One to Watch" - Treasury Today's Adam Smith Awards 2009.
>
>
> - Original Message -
> From: "Justin Coffey" 
> To: user@hive.apache.org
> Sent: Monday, April 23, 2012 5:19:15 AM
> Subject: Re: Lifecycle and Configuration of a hive UDF
>
> Hello All,
> Thank you much for the responses. I can confirm that the lag function
> implementation works in my case:
> create temporary function lag as 'com.example.hive.udf.Lag';
> select session_id,hit_datetime_gmt,lag(hit_datetime_gmt,session_id)
> from (select session_id,hit_datetime_gmt from omni2 where
> visit_day='2012-01-12' and session_id
> is not null
> distribute by session_id
> sort by session_id,hit_datetime_gmt ) X
> distribute by session_id limit 1000
>
>
> For the rank it looks like:
>
>
>
> create temporary function rank as 'com.example.hadoop.hive.udf.Rank';
> select user_id, time, rank(user_id) as rank
> from (
> select user_id, time
> from log
> where day = '2012-04-01' and hour = 7
> distribute by user_id
> sort by user_id, time
> ) X
> distribute by user_id
> limit 2000
>
>
> As mentioned by others this appears to force the UDF to be executed Reduce
> side. At least, I can't figure out how it works otherwise because only one
> MapReduce job is created (with multiple reducers).
>
>
> As a note to the documentation maintainers, it might be nice to have the
> procedural workflow of UDF/UDTF/UDAF's documented in the wiki. I know it is
> logical that an aggregation function happens reducer side, but I think
> there is sufficient complexity in an SQL to MR translator that it is worth
> the effort to explicitly document it and the other functions (or please
> just bludgeon me over the head if I happened to miss it).
>
>
> Not to be pedantic, but for example, the UDAF case study doc does not even
> mention the word "reduce":
> https://cwiki.apache.org/Hive/genericudafcasestudy.html
>
>
> Thanks again to all the pointers!
>
>
> -Justin
>
>
> On Fri, Apr 20, 2012 at 8:18 PM, Alex Kozlov < ale...@cloudera.com >
> wrote:
>
>
> You might also look at http://www. quora
> .com/Hive-computing/How-are-SQL-type-analytic-and-windowing-functions-accomplished-in-Hadoop-Hive
> for a way to utilize secondary sort for analytic windowing functions.
>
> RANK() OVER(...) will require grouping and sorting. While it can be done
> in the mapper or reducer stage, it is better to utilize Hadoop's shuffle
> properties to accomplish both of them. The disadvantage may be that you can
> compute only one RANK() in a MapReduce job.
>
> --
>
> Alex K
>
>
>
>
> On Fri, Apr 20, 2012 at 10:54 AM, Philip Tromans <
> philip.j.trom...@gmail.com > wrote:
>
>
> Have a read of the thread "Lag function in Hive", linked from:
>
> http://mail-archives.apache.org/mod_mbox/hive-user/201204.mbox/thread
>
> There's an example of how to force a function to run reduce-side. I've
> written a UDF which replicates RANK () OVER (...), but it requires the
> syntactic sugar given in the thread. I'd like to make changes to the
> hive query planner at some point, so that you can annotate a UDF with
> a "run on reducer" hint, and after that I'd happily open source
> everything. If you want more details of how to implement your own
> partitionedRowNumber() UDF then I'd be happy to elaborate.
>
> Cheers,
>
> Phil.
>
>
>
> On 20 April 2012 18:35, Mark Grover < mgro...@oanda.com > wrote:
> > Hi Rajan and Justin,
> >
> > As per my understanding, the scope of a UDF is only one row of data at a
> time. Therefore, it can be done all map side without the need for the
> reducer being involved. Now, depending on where you are storing the result
> of the query, your query may have reducers that do something.
> >
> > A simple query like Rajan mentioned
> > select MyUDF(field1,field2) from table;
> >
> > should have the UDF execute() being called in the map phase.
> >
> >
> > Now to Justin's question,
&

Re: Lifecycle and Configuration of a hive UDF

2012-04-23 Thread Mark Grover
Added a tiny blurb here: 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-UDFinternals
Comments/suggestions welcome!

Thanks for bringing it up, Justin.

Mark

Mark Grover, Business Intelligence Analyst
OANDA Corporation 

www: oanda.com www: fxtrade.com 
e: mgro...@oanda.com 

"Best Trading Platform" - World Finance's Forex Awards 2009. 
"The One to Watch" - Treasury Today's Adam Smith Awards 2009. 


- Original Message -
From: "Justin Coffey" 
To: user@hive.apache.org
Sent: Monday, April 23, 2012 5:19:15 AM
Subject: Re: Lifecycle and Configuration of a hive UDF

Hello All, 
Thank you much for the responses. I can confirm that the lag function 
implementation works in my case: 
create temporary function lag as 'com.example.hive.udf.Lag'; 
select session_id,hit_datetime_gmt,lag(hit_datetime_gmt,session_id) 
from (select session_id,hit_datetime_gmt from omni2 where 
visit_day='2012-01-12' and session_id 
is not null 
distribute by session_id 
sort by session_id,hit_datetime_gmt ) X 
distribute by session_id limit 1000 


For the rank it looks like: 



create temporary function rank as 'com.example.hadoop.hive.udf.Rank'; 
select user_id, time, rank(user_id) as rank 
from ( 
select user_id, time 
from log 
where day = '2012-04-01' and hour = 7 
distribute by user_id 
sort by user_id, time 
) X 
distribute by user_id 
limit 2000 


As mentioned by others this appears to force the UDF to be executed Reduce 
side. At least, I can't figure out how it works otherwise because only one 
MapReduce job is created (with multiple reducers). 


As a note to the documentation maintainers, it might be nice to have the 
procedural workflow of UDF/UDTF/UDAF's documented in the wiki. I know it is 
logical that an aggregation function happens reducer side, but I think there is 
sufficient complexity in an SQL to MR translator that it is worth the effort to 
explicitly document it and the other functions (or please just bludgeon me over 
the head if I happened to miss it). 


Not to be pedantic, but for example, the UDAF case study doc does not even 
mention the word "reduce": 
https://cwiki.apache.org/Hive/genericudafcasestudy.html 


Thanks again to all the pointers! 


-Justin 


On Fri, Apr 20, 2012 at 8:18 PM, Alex Kozlov < ale...@cloudera.com > wrote: 


You might also look at http://www. quora 
.com/Hive-computing/How-are-SQL-type-analytic-and-windowing-functions-accomplished-in-Hadoop-Hive
 for a way to utilize secondary sort for analytic windowing functions. 

RANK() OVER(...) will require grouping and sorting. While it can be done in the 
mapper or reducer stage, it is better to utilize Hadoop's shuffle properties to 
accomplish both of them. The disadvantage may be that you can compute only one 
RANK() in a MapReduce job. 

-- 

Alex K 




On Fri, Apr 20, 2012 at 10:54 AM, Philip Tromans < philip.j.trom...@gmail.com > 
wrote: 


Have a read of the thread "Lag function in Hive", linked from: 

http://mail-archives.apache.org/mod_mbox/hive-user/201204.mbox/thread 

There's an example of how to force a function to run reduce-side. I've 
written a UDF which replicates RANK () OVER (...), but it requires the 
syntactic sugar given in the thread. I'd like to make changes to the 
hive query planner at some point, so that you can annotate a UDF with 
a "run on reducer" hint, and after that I'd happily open source 
everything. If you want more details of how to implement your own 
partitionedRowNumber() UDF then I'd be happy to elaborate. 

Cheers, 

Phil. 



On 20 April 2012 18:35, Mark Grover < mgro...@oanda.com > wrote: 
> Hi Rajan and Justin, 
> 
> As per my understanding, the scope of a UDF is only one row of data at a 
> time. Therefore, it can be done all map side without the need for the reducer 
> being involved. Now, depending on where you are storing the result of the 
> query, your query may have reducers that do something. 
> 
> A simple query like Rajan mentioned 
> select MyUDF(field1,field2) from table; 
> 
> should have the UDF execute() being called in the map phase. 
> 
> 
> Now to Justin's question, 
> rank function ( 
> http://msdn.microsoft.com/en-us/library/ms176102%28v=sql.110%29.aspx ) 
> seems to have a sytax like: 
> RANK ( ) OVER ( [ partition_by_clause ] order_by_clause ) 
> 
> Rank function works on a collection of rows (distributed by the some column - 
> the same one you would use in your partition_by_clause in MS SQL). 
> You can accomplish that using UDAF (read more about them at 
> https://cwiki.apache.org/Hive/genericudafcasestudy.html ) or by writing a 
> custom reducer (read about that at 
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform ). 
> 
> I don't 

Re: Lifecycle and Configuration of a hive UDF

2012-04-23 Thread Justin Coffey
Hello All,
Thank you much for the responses. I can confirm that the lag function
implementation works in my case:

create temporary function lag as 'com.example.hive.udf.Lag';
select session_id,hit_datetime_gmt,lag(hit_datetime_gmt,session_id)
from (select session_id,hit_datetime_gmt from omni2 where
visit_day='2012-01-12' and session_id
is not null
distribute by session_id
sort by session_id,hit_datetime_gmt ) X
distribute by session_id limit 1000

For the rank it looks like:

create temporary function rank as 'com.example.hadoop.hive.udf.Rank';
select user_id, time, rank(user_id) as rank
from (
select user_id, time
from log
where day = '2012-04-01' and hour = 7
distribute by user_id
sort by user_id, time
) X
distribute by user_id
limit 2000

As mentioned by others this appears to force the UDF to be executed Reduce
side.  At least, I can't figure out how it works otherwise because only one
MapReduce job is created (with multiple reducers).

As a note to the documentation maintainers, it might be nice to have the
procedural workflow of UDF/UDTF/UDAF's documented in the wiki.  I know it
is logical that an aggregation function happens reducer side, but I think
there is sufficient complexity in an SQL to MR translator that it is worth
the effort to explicitly document it and the other functions (or please
just bludgeon me over the head if I happened to miss it).

Not to be pedantic, but for example, the UDAF case study doc does not even
mention the word "reduce":
https://cwiki.apache.org/Hive/genericudafcasestudy.html

Thanks again to all the pointers!

-Justin

On Fri, Apr 20, 2012 at 8:18 PM, Alex Kozlov  wrote:

> You might also look at http://www.quora
> .com/Hive-computing/How-are-SQL-type-analytic-and-windowing-functions-accomplished-in-Hadoop-Hivefor
>  a way to utilize secondary sort for analytic windowing functions.
>
> RANK() OVER(...) will require grouping and sorting.  While it can be done
> in the mapper or reducer stage, it is better to utilize Hadoop's shuffle
> properties to accomplish both of them.  The disadvantage may be that you
> can compute only one RANK() in a MapReduce job.
>
> --
> Alex K
>
> On Fri, Apr 20, 2012 at 10:54 AM, Philip Tromans <
> philip.j.trom...@gmail.com> wrote:
>
>> Have a read of the thread "Lag function in Hive", linked from:
>>
>> http://mail-archives.apache.org/mod_mbox/hive-user/201204.mbox/thread
>>
>> There's an example of how to force a function to run reduce-side. I've
>> written a UDF which replicates RANK () OVER (...), but it requires the
>> syntactic sugar given in the thread. I'd like to make changes to the
>> hive query planner at some point, so that you can annotate a UDF with
>> a "run on reducer" hint, and after that I'd happily open source
>> everything. If you want more details of how to implement your own
>> partitionedRowNumber() UDF then I'd be happy to elaborate.
>>
>> Cheers,
>>
>> Phil.
>>
>> On 20 April 2012 18:35, Mark Grover  wrote:
>> > Hi Rajan and Justin,
>> >
>> > As per my understanding, the scope of a UDF is only one row of data at
>> a time. Therefore, it can be done all map side without the need for the
>> reducer being involved. Now, depending on where you are storing the result
>> of the query, your query may have reducers that do something.
>> >
>> > A simple query like Rajan mentioned
>> > select MyUDF(field1,field2) from table;
>> >
>> > should have the UDF execute() being called in the map phase.
>> >
>> >
>> > Now to Justin's question,
>> > rank function (
>> http://msdn.microsoft.com/en-us/library/ms176102%28v=sql.110%29.aspx)
>> > seems to have a sytax like:
>> > RANK ( ) OVER ( [ partition_by_clause ] order_by_clause )
>> >
>> > Rank function works on a collection of rows (distributed by the some
>> column - the same one you would use in your partition_by_clause in MS SQL).
>> > You can accomplish that using UDAF (read more about them at
>> https://cwiki.apache.org/Hive/genericudafcasestudy.html) or by writing a
>> custom reducer (read about that at
>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform
>> ).
>> >
>> > I don't think rank can be done using a UDF.
>> >
>> > Good luck!
>> >
>> > Mark
>> >
>> > Mark Grover, Business Intelligence Analyst
>> > OANDA Corporation
>> >
>> > www: oanda.com www: fxtrade.com
>> >
>> > "Best Trading Platform" - World Finance's Forex Aw

Re: Lifecycle and Configuration of a hive UDF

2012-04-20 Thread Alex Kozlov
You might also look at http://www.quora
.com/Hive-computing/How-are-SQL-type-analytic-and-windowing-functions-accomplished-in-Hadoop-Hivefor
a way to utilize secondary sort for analytic windowing functions.

RANK() OVER(...) will require grouping and sorting.  While it can be done
in the mapper or reducer stage, it is better to utilize Hadoop's shuffle
properties to accomplish both of them.  The disadvantage may be that you
can compute only one RANK() in a MapReduce job.

--
Alex K

On Fri, Apr 20, 2012 at 10:54 AM, Philip Tromans  wrote:

> Have a read of the thread "Lag function in Hive", linked from:
>
> http://mail-archives.apache.org/mod_mbox/hive-user/201204.mbox/thread
>
> There's an example of how to force a function to run reduce-side. I've
> written a UDF which replicates RANK () OVER (...), but it requires the
> syntactic sugar given in the thread. I'd like to make changes to the
> hive query planner at some point, so that you can annotate a UDF with
> a "run on reducer" hint, and after that I'd happily open source
> everything. If you want more details of how to implement your own
> partitionedRowNumber() UDF then I'd be happy to elaborate.
>
> Cheers,
>
> Phil.
>
> On 20 April 2012 18:35, Mark Grover  wrote:
> > Hi Rajan and Justin,
> >
> > As per my understanding, the scope of a UDF is only one row of data at a
> time. Therefore, it can be done all map side without the need for the
> reducer being involved. Now, depending on where you are storing the result
> of the query, your query may have reducers that do something.
> >
> > A simple query like Rajan mentioned
> > select MyUDF(field1,field2) from table;
> >
> > should have the UDF execute() being called in the map phase.
> >
> >
> > Now to Justin's question,
> > rank function (
> http://msdn.microsoft.com/en-us/library/ms176102%28v=sql.110%29.aspx)
> > seems to have a sytax like:
> > RANK ( ) OVER ( [ partition_by_clause ] order_by_clause )
> >
> > Rank function works on a collection of rows (distributed by the some
> column - the same one you would use in your partition_by_clause in MS SQL).
> > You can accomplish that using UDAF (read more about them at
> https://cwiki.apache.org/Hive/genericudafcasestudy.html) or by writing a
> custom reducer (read about that at
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform
> ).
> >
> > I don't think rank can be done using a UDF.
> >
> > Good luck!
> >
> > Mark
> >
> > Mark Grover, Business Intelligence Analyst
> > OANDA Corporation
> >
> > www: oanda.com www: fxtrade.com
> >
> > "Best Trading Platform" - World Finance's Forex Awards 2009.
> > "The One to Watch" - Treasury Today's Adam Smith Awards 2009.
> >
> >
> > - Original Message -
> > From: "Justin Coffey" 
> > To: user@hive.apache.org
> > Sent: Thursday, April 19, 2012 10:29:11 AM
> > Subject: Re: Lifecycle and Configuration of a hive UDF
> >
> > Hello All,
> > I second this question. I have a MS SQL "rank" function which I would
> like to run, the results it gives appears to suggest it is executed Mapper
> side as opposed to reducer side, even when run with "cluster by"
> constraints.
> >
> >
> > -Justin
> >
> >
> > On Thu, Apr 19, 2012 at 1:21 AM, Ranjan Bagchi < ran...@powerreviews.com> 
> > wrote:
> >
> >
> > Hi,
> >
> > What's the lifecycle of a hive udf. If I call
> >
> > select MyUDF(field1,field2) from table;
> >
> > Then MyUDF is instantiated once per mapper, and within each mapper
> execute(field1, field2) is called for each reducer? I hope this is the
> case, but I can't find anything about this in the documentation.
> >
> > So I'd like to have some run-time configuration of my UDF: I'm curious
> how people do this. Is there a way I can send it a value or have it access
> a file, etc? How about performing a query against the hive store?
> >
> > Thanks,
> >
> > Ranjan
> >
> >
> >
> >
> >
> > --
> > jqcof...@gmail.com
> > -
>


Re: Lifecycle and Configuration of a hive UDF

2012-04-20 Thread Philip Tromans
Have a read of the thread "Lag function in Hive", linked from:

http://mail-archives.apache.org/mod_mbox/hive-user/201204.mbox/thread

There's an example of how to force a function to run reduce-side. I've
written a UDF which replicates RANK () OVER (...), but it requires the
syntactic sugar given in the thread. I'd like to make changes to the
hive query planner at some point, so that you can annotate a UDF with
a "run on reducer" hint, and after that I'd happily open source
everything. If you want more details of how to implement your own
partitionedRowNumber() UDF then I'd be happy to elaborate.

Cheers,

Phil.

On 20 April 2012 18:35, Mark Grover  wrote:
> Hi Rajan and Justin,
>
> As per my understanding, the scope of a UDF is only one row of data at a 
> time. Therefore, it can be done all map side without the need for the reducer 
> being involved. Now, depending on where you are storing the result of the 
> query, your query may have reducers that do something.
>
> A simple query like Rajan mentioned
> select MyUDF(field1,field2) from table;
>
> should have the UDF execute() being called in the map phase.
>
>
> Now to Justin's question,
> rank function 
> (http://msdn.microsoft.com/en-us/library/ms176102%28v=sql.110%29.aspx)
> seems to have a sytax like:
> RANK ( ) OVER ( [ partition_by_clause ] order_by_clause )
>
> Rank function works on a collection of rows (distributed by the some column - 
> the same one you would use in your partition_by_clause in MS SQL).
> You can accomplish that using UDAF (read more about them at 
> https://cwiki.apache.org/Hive/genericudafcasestudy.html) or by writing a 
> custom reducer (read about that at 
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform).
>
> I don't think rank can be done using a UDF.
>
> Good luck!
>
> Mark
>
> Mark Grover, Business Intelligence Analyst
> OANDA Corporation
>
> www: oanda.com www: fxtrade.com
>
> "Best Trading Platform" - World Finance's Forex Awards 2009.
> "The One to Watch" - Treasury Today's Adam Smith Awards 2009.
>
>
> - Original Message -
> From: "Justin Coffey" 
> To: user@hive.apache.org
> Sent: Thursday, April 19, 2012 10:29:11 AM
> Subject: Re: Lifecycle and Configuration of a hive UDF
>
> Hello All,
> I second this question. I have a MS SQL "rank" function which I would like to 
> run, the results it gives appears to suggest it is executed Mapper side as 
> opposed to reducer side, even when run with "cluster by" constraints.
>
>
> -Justin
>
>
> On Thu, Apr 19, 2012 at 1:21 AM, Ranjan Bagchi < ran...@powerreviews.com > 
> wrote:
>
>
> Hi,
>
> What's the lifecycle of a hive udf. If I call
>
> select MyUDF(field1,field2) from table;
>
> Then MyUDF is instantiated once per mapper, and within each mapper 
> execute(field1, field2) is called for each reducer? I hope this is the case, 
> but I can't find anything about this in the documentation.
>
> So I'd like to have some run-time configuration of my UDF: I'm curious how 
> people do this. Is there a way I can send it a value or have it access a 
> file, etc? How about performing a query against the hive store?
>
> Thanks,
>
> Ranjan
>
>
>
>
>
> --
> jqcof...@gmail.com
> -


Re: Lifecycle and Configuration of a hive UDF

2012-04-20 Thread Mark Grover
Hi Rajan and Justin,

As per my understanding, the scope of a UDF is only one row of data at a time. 
Therefore, it can be done all map side without the need for the reducer being 
involved. Now, depending on where you are storing the result of the query, your 
query may have reducers that do something.

A simple query like Rajan mentioned
select MyUDF(field1,field2) from table; 

should have the UDF execute() being called in the map phase.


Now to Justin's question,
rank function 
(http://msdn.microsoft.com/en-us/library/ms176102%28v=sql.110%29.aspx)
seems to have a sytax like:
RANK ( ) OVER ( [ partition_by_clause ] order_by_clause )

Rank function works on a collection of rows (distributed by the some column - 
the same one you would use in your partition_by_clause in MS SQL).
You can accomplish that using UDAF (read more about them at 
https://cwiki.apache.org/Hive/genericudafcasestudy.html) or by writing a custom 
reducer (read about that at 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform).

I don't think rank can be done using a UDF.

Good luck!

Mark

Mark Grover, Business Intelligence Analyst
OANDA Corporation 

www: oanda.com www: fxtrade.com 

"Best Trading Platform" - World Finance's Forex Awards 2009. 
"The One to Watch" - Treasury Today's Adam Smith Awards 2009. 


- Original Message -
From: "Justin Coffey" 
To: user@hive.apache.org
Sent: Thursday, April 19, 2012 10:29:11 AM
Subject: Re: Lifecycle and Configuration of a hive UDF

Hello All, 
I second this question. I have a MS SQL "rank" function which I would like to 
run, the results it gives appears to suggest it is executed Mapper side as 
opposed to reducer side, even when run with "cluster by" constraints. 


-Justin 


On Thu, Apr 19, 2012 at 1:21 AM, Ranjan Bagchi < ran...@powerreviews.com > 
wrote: 


Hi, 

What's the lifecycle of a hive udf. If I call 

select MyUDF(field1,field2) from table; 

Then MyUDF is instantiated once per mapper, and within each mapper 
execute(field1, field2) is called for each reducer? I hope this is the case, 
but I can't find anything about this in the documentation. 

So I'd like to have some run-time configuration of my UDF: I'm curious how 
people do this. Is there a way I can send it a value or have it access a file, 
etc? How about performing a query against the hive store? 

Thanks, 

Ranjan 





-- 
jqcof...@gmail.com 
- 


Re: Lifecycle and Configuration of a hive UDF

2012-04-19 Thread Justin Coffey
Hello All,
   I second this question.  I have a MS SQL "rank" function which I would
like to run, the results it gives appears to suggest it is executed Mapper
side as opposed to reducer side, even when run with "cluster by"
constraints.

-Justin

On Thu, Apr 19, 2012 at 1:21 AM, Ranjan Bagchi wrote:

> Hi,
>
> What's the lifecycle of a hive udf.  If I call
>
> select MyUDF(field1,field2) from table;
>
> Then MyUDF is instantiated once per mapper, and within each mapper
> execute(field1, field2) is called for each reducer?  I hope this is the
> case, but I can't find anything about this in the documentation.
>
> So I'd like to have some run-time configuration of my UDF:  I'm curious
> how people do this.  Is there a way I can send it a value or have it access
> a file, etc?  How about performing a query against the hive store?
>
> Thanks,
>
> Ranjan
>
>


-- 
jqcof...@gmail.com
-


Lifecycle and Configuration of a hive UDF

2012-04-18 Thread Ranjan Bagchi
Hi,

What's the lifecycle of a hive udf.  If I call 

select MyUDF(field1,field2) from table;

Then MyUDF is instantiated once per mapper, and within each mapper 
execute(field1, field2) is called for each reducer?  I hope this is the case, 
but I can't find anything about this in the documentation.

So I'd like to have some run-time configuration of my UDF:  I'm curious how 
people do this.  Is there a way I can send it a value or have it access a file, 
etc?  How about performing a query against the hive store?

Thanks,

Ranjan