Re: Add external jars automatically

2013-03-13 Thread Alex Kozlov
If you look into ${HIVE_HOME}/bin/hive script there are multiple ways to
add the jar.  One of my favorite, besides the .hiverc file, has been to put
the jar into ${HIVE_HOME}/auxlib dir.  There always is the
HIVE_AUX_JARS_PATH environment variable (but this introduces a dependency
on the environment).

On Wed, Mar 13, 2013 at 10:26 AM, Krishna Rao krishnanj...@gmail.comwrote:

 Hi all,

 I'm using the hive json serde and need to run: ADD JAR
 /usr/lib/hive/lib/hive-json-serde-0.2.jar;, before I can use tables that
 require it.

 Is it possible to have this jar available automatically?

 I could do it via adding the statement to a .hiverc file, but I was
 wondering if there is some better way...

 Cheers,

 Krishna



Re: Lifecycle and Configuration of a hive UDF

2012-04-20 Thread Alex Kozlov
You might also look at http://www.quora
.com/Hive-computing/How-are-SQL-type-analytic-and-windowing-functions-accomplished-in-Hadoop-Hivefor
a way to utilize secondary sort for analytic windowing functions.

RANK() OVER(...) will require grouping and sorting.  While it can be done
in the mapper or reducer stage, it is better to utilize Hadoop's shuffle
properties to accomplish both of them.  The disadvantage may be that you
can compute only one RANK() in a MapReduce job.

--
Alex K

On Fri, Apr 20, 2012 at 10:54 AM, Philip Tromans philip.j.trom...@gmail.com
 wrote:

 Have a read of the thread Lag function in Hive, linked from:

 http://mail-archives.apache.org/mod_mbox/hive-user/201204.mbox/thread

 There's an example of how to force a function to run reduce-side. I've
 written a UDF which replicates RANK () OVER (...), but it requires the
 syntactic sugar given in the thread. I'd like to make changes to the
 hive query planner at some point, so that you can annotate a UDF with
 a run on reducer hint, and after that I'd happily open source
 everything. If you want more details of how to implement your own
 partitionedRowNumber() UDF then I'd be happy to elaborate.

 Cheers,

 Phil.

 On 20 April 2012 18:35, Mark Grover mgro...@oanda.com wrote:
  Hi Rajan and Justin,
 
  As per my understanding, the scope of a UDF is only one row of data at a
 time. Therefore, it can be done all map side without the need for the
 reducer being involved. Now, depending on where you are storing the result
 of the query, your query may have reducers that do something.
 
  A simple query like Rajan mentioned
  select MyUDF(field1,field2) from table;
 
  should have the UDF execute() being called in the map phase.
 
 
  Now to Justin's question,
  rank function (
 http://msdn.microsoft.com/en-us/library/ms176102%28v=sql.110%29.aspx)
  seems to have a sytax like:
  RANK ( ) OVER ( [ partition_by_clause ] order_by_clause )
 
  Rank function works on a collection of rows (distributed by the some
 column - the same one you would use in your partition_by_clause in MS SQL).
  You can accomplish that using UDAF (read more about them at
 https://cwiki.apache.org/Hive/genericudafcasestudy.html) or by writing a
 custom reducer (read about that at
 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform
 ).
 
  I don't think rank can be done using a UDF.
 
  Good luck!
 
  Mark
 
  Mark Grover, Business Intelligence Analyst
  OANDA Corporation
 
  www: oanda.com www: fxtrade.com
 
  Best Trading Platform - World Finance's Forex Awards 2009.
  The One to Watch - Treasury Today's Adam Smith Awards 2009.
 
 
  - Original Message -
  From: Justin Coffey jqcof...@gmail.com
  To: user@hive.apache.org
  Sent: Thursday, April 19, 2012 10:29:11 AM
  Subject: Re: Lifecycle and Configuration of a hive UDF
 
  Hello All,
  I second this question. I have a MS SQL rank function which I would
 like to run, the results it gives appears to suggest it is executed Mapper
 side as opposed to reducer side, even when run with cluster by
 constraints.
 
 
  -Justin
 
 
  On Thu, Apr 19, 2012 at 1:21 AM, Ranjan Bagchi  ran...@powerreviews.com 
  wrote:
 
 
  Hi,
 
  What's the lifecycle of a hive udf. If I call
 
  select MyUDF(field1,field2) from table;
 
  Then MyUDF is instantiated once per mapper, and within each mapper
 execute(field1, field2) is called for each reducer? I hope this is the
 case, but I can't find anything about this in the documentation.
 
  So I'd like to have some run-time configuration of my UDF: I'm curious
 how people do this. Is there a way I can send it a value or have it access
 a file, etc? How about performing a query against the hive store?
 
  Thanks,
 
  Ranjan
 
 
 
 
 
  --
  jqcof...@gmail.com
  -



Re: Hive equivalent of row_number()

2012-04-12 Thread Alex Kozlov
http://www.quora.com/Hive-computing/How-are-SQL-type-analytic-and-windowing-functions-accomplished-in-Hadoop-Hive

--
Alex K
http://www.cloudera.com/company/press-center/hadoop-world-nyc/

On Thu, Apr 12, 2012 at 1:43 PM, Saurabh S saurab...@live.com wrote:


 I have a table with three columns, A, B, and Score, where A and B are some
 items, and Score is some kind of affinity between A and B. There are N
 number of items of each A and B, so that the total number of rows in the
 table are N^2.

 Is there a way to fetch top 5 items in B for each item in A? So, for
 each distinct item in A, I want to look up 5 items in B which have the
 highest value in Score.

 If this were to be done in DB2, I would probably use some kind of
 windowing function using row_number().



Re: Hive assert()?

2011-05-26 Thread Alex Kozlov
1) Would `select count(1) from (query)` do the same thing?  I am a bit
confused what is the semantic of assert: is it just no rows or some kind of
syntax error check?
2) Hive is not an OLTP and is not optimized for single row inserts (or
updates for this matter).  In a trivial implementation one would just do
copy-on-write, i.e. overwrite the whole data file, or add a small file
containing one or a few rows.  I do not think indices have been implemented
yet either.
3) You can probably do what you want with HBase (if you explain what you
want to do in more detail)


On Thu, May 26, 2011 at 2:00 PM, Igor Tatarinov i...@decide.com wrote:

 I would like to implement some kind of assert functionality in Hive QL.

 Here is how I do it in MySQL. I can assert that a given query returns no
 (bad) rows by creating a table with one row containing '1' and a unique
 index. Then, I try to insert into that table select 1 from (query). If the
 query returns something, I have an assert failure. I've found such asserts
 very helpful in catching data bugs early.

 Can I do something similar in Hive? I suppose I could implement a UDF that
 fails if it gets executed. This way I can ensure there is no records
 matching a given query.
 That doesn't sound too bad but perhaps there is a cleaner solution.

 Thanks!