Rank(): here's how I did it, for better or worse...

Keith Wiley Tue, 02 Apr 2013 13:54:39 -0700

I agree, it's probably best to use a better engineered approach such as 
Edward's.  In the meantime, if anyone would benefit from a walk-through of my 
direct approach, here it is.  It combines Ritesh's direct ultra-simplistic 
method with Edward's correct HiveQL syntax.  As would be expected, it is 
sensitive to shell vagaries that would be better managed by a combo system like 
git and maven...but it works.


========================================

Create Rank.java:
-----
package com.example.hive.udf;
import org.apache.hadoop.hive.ql.exec.UDF;
 
public final class Rank extends UDF{
    private int counter;
    private String last_key;
    public int evaluate(final String key){
      if ( !key.equalsIgnoreCase(this.last_key) ) {
         this.counter = 0;
         this.last_key = key;
      }
      return this.counter++;
    }
}

========================================

Compile Rank.java to Rank.class.  Then bundle to Rank.jar.  Observe that the 
jar command is highly sensitive to the relative path to the .class file when 
naming the package inside the resulting .jar file:
-----
$ mkdir ./RankTempDir
$ javac -classpath 
$HIVE_HOME/lib/hive-serde-0.8.1.jar:$HIVE_HOME/lib/hive-exec-0.8.1.jar:$HADOOP_HOME/hadoop-core.jar
 -d ./RankTempDir Rank.java
$ cd RankTempDir;
$ jar -cf ../Rank.jar ./com
$ cd ..

You will have to verify (via echo) HADOOP_HOME and HIVE_HOME and then will have 
to verify the name of the serde, exec, and core files in each directory.  The 
exact filenames are probably version specific.

Verify the package path in Rank.jar:
-----
$ jar -tvf Rank.jar

You should see 'com/example/hive/udf/Rank.class'.  If you see a different path, 
the package has not been properly represented in the jar w.r.t. its designation 
in the .java file.

========================================

Run hive and prepare the session to use the UDF:

$ hive
hive> add jar Rank.jar;
hive> create temporary function rank as 'com.example.hive.udf.Rank';

You must either run hive from a directory containing Rank.jar or specify an 
alternate path to it in the "add" command.  Note that the Rank class's full 
package is specified in the "create" command and therefore must, logically, 
match the package in both the .java and the .jar files.

========================================

Consider a table named 'test' consisting of columns 'user', 'category', and 
'value', containing the following data:

hive> select * from test;
user1   catA    1
user1   catB    11
user1   catC    111
user2   catA    222
user2   catB    22
user2   catC    2
user3   catA    3
user3   catB    5
user3   catC    4

So the top category for user1 is catC, for user2 is catA and for user3 is catB. 
 Say we want the top N valued categories for each user.  In the example below, 
N is 2 (it is indicated in the final WHERE clause).  Here is the format of the 
corresponding ranked query, and its result:

hive> SELECT user, category, value, ranked_col
FROM (
    SELECT user, category, value, rank(user) ranked_col
        FROM (
            SELECT user, category, value
            FROM test
        DISTRIBUTE BY user
        SORT BY user, value desc
    ) a
) b
WHERE ranked_col < 2
ORDER BY user, ranked_col;
...
[wait for Hive query and MapReduce job(s) to finish]
...
user1   catC    111     0
user1   catB    11      1
user2   catA    222     0
user2   catB    22      1
user3   catB    5       0
user3   catC    4       1

Note that ranks are 0-indexed (of course I suppose that's a property of the 
specific .java we wrote above, and therefore is easily amenable to 1-indexing, 
which would more canonically connote the notion of a "rank").

Anyway, that's what I came up with.  I don't by any means claim it's the best 
approach.  Edward is surely right that the best method would be to use the 
powerful tools made available by the large developer community such as git and 
maven.

Cheers!

________________________________________________________________________________
Keith Wiley     [email protected]     keithwiley.com    music.keithwiley.com

"And what if we picked the wrong religion?  Every week, we're just making God
madder and madder!"
                                           --  Homer Simpson
________________________________________________________________________________

Rank(): here's how I did it, for better or worse...

Reply via email to