I agree, it's probably best to use a better engineered approach such as
Edward's. In the meantime, if anyone would benefit from a walk-through of my
direct approach, here it is. It combines Ritesh's direct ultra-simplistic
method with Edward's correct HiveQL syntax. As would be expected, it is
sensitive to shell vagaries that would be better managed by a combo system like
git and maven...but it works.
========================================
Create Rank.java:
-----
package com.example.hive.udf;
import org.apache.hadoop.hive.ql.exec.UDF;
public final class Rank extends UDF{
private int counter;
private String last_key;
public int evaluate(final String key){
if ( !key.equalsIgnoreCase(this.last_key) ) {
this.counter = 0;
this.last_key = key;
}
return this.counter++;
}
}
========================================
Compile Rank.java to Rank.class. Then bundle to Rank.jar. Observe that the
jar command is highly sensitive to the relative path to the .class file when
naming the package inside the resulting .jar file:
-----
$ mkdir ./RankTempDir
$ javac -classpath
$HIVE_HOME/lib/hive-serde-0.8.1.jar:$HIVE_HOME/lib/hive-exec-0.8.1.jar:$HADOOP_HOME/hadoop-core.jar
-d ./RankTempDir Rank.java
$ cd RankTempDir;
$ jar -cf ../Rank.jar ./com
$ cd ..
You will have to verify (via echo) HADOOP_HOME and HIVE_HOME and then will have
to verify the name of the serde, exec, and core files in each directory. The
exact filenames are probably version specific.
Verify the package path in Rank.jar:
-----
$ jar -tvf Rank.jar
You should see 'com/example/hive/udf/Rank.class'. If you see a different path,
the package has not been properly represented in the jar w.r.t. its designation
in the .java file.
========================================
Run hive and prepare the session to use the UDF:
$ hive
hive> add jar Rank.jar;
hive> create temporary function rank as 'com.example.hive.udf.Rank';
You must either run hive from a directory containing Rank.jar or specify an
alternate path to it in the "add" command. Note that the Rank class's full
package is specified in the "create" command and therefore must, logically,
match the package in both the .java and the .jar files.
========================================
Consider a table named 'test' consisting of columns 'user', 'category', and
'value', containing the following data:
hive> select * from test;
user1 catA 1
user1 catB 11
user1 catC 111
user2 catA 222
user2 catB 22
user2 catC 2
user3 catA 3
user3 catB 5
user3 catC 4
So the top category for user1 is catC, for user2 is catA and for user3 is catB.
Say we want the top N valued categories for each user. In the example below,
N is 2 (it is indicated in the final WHERE clause). Here is the format of the
corresponding ranked query, and its result:
hive> SELECT user, category, value, ranked_col
FROM (
SELECT user, category, value, rank(user) ranked_col
FROM (
SELECT user, category, value
FROM test
DISTRIBUTE BY user
SORT BY user, value desc
) a
) b
WHERE ranked_col < 2
ORDER BY user, ranked_col;
...
[wait for Hive query and MapReduce job(s) to finish]
...
user1 catC 111 0
user1 catB 11 1
user2 catA 222 0
user2 catB 22 1
user3 catB 5 0
user3 catC 4 1
Note that ranks are 0-indexed (of course I suppose that's a property of the
specific .java we wrote above, and therefore is easily amenable to 1-indexing,
which would more canonically connote the notion of a "rank").
Anyway, that's what I came up with. I don't by any means claim it's the best
approach. Edward is surely right that the best method would be to use the
powerful tools made available by the large developer community such as git and
maven.
Cheers!
________________________________________________________________________________
Keith Wiley [email protected] keithwiley.com music.keithwiley.com
"And what if we picked the wrong religion? Every week, we're just making God
madder and madder!"
-- Homer Simpson
________________________________________________________________________________