Assume that I would like to write this pig script:
REGISTER myudfs.jar;
A = LOAD 'hist_data' AS (id: chararray, word: chararray, count : float );
B = GROUP A BY id
C = CROSS B, B
D = FOREACH C GENERATE $0, $2, myudfs.HIST($1,$3);
F = ORDER D BY DESC $2
DUMP C;
I take (id, histogram) pairs and I would like to perform a all-to-all
comparison
The cross operation is an overkill because my measure myudfs.HIST($1,$3) is
symmetric thus ( could cut by half the comparisons), but it will do.
My Real Question is :
Where I can find a template for the description of this myudfs.HIST($1,$3) ?
For example:
package myudfs;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.apache.pig.impl.util.WrappedIOException;
import Anomaly;
public class HIST extends EvalFunc (?) implements Algebraic
{
public String getInitial() {return Initial.class.getName();}
public String getIntermed() {return Intermed.class.getName();}
public String getFinal() {return Final.class.getName();}
static public class Initial extends EvalFunc (Tuple) {
public Tuple exec(Tuple input) throws IOException {return
TupleFactory.getInstance().newTuple(count(input));}
}
static public class Intermed extends EvalFunc (Tuple) {
public Tuple exec(Tuple input) throws IOException {return
TupleFactory.getInstance().newTuple(sum(input));}
}
static public class Final extends EvalFunc (Long) {
public Tuple exec(Tuple input) throws IOException {return sum(input);}
}
public Float exec(?) throws IOException {
if (input == null || input.size() == 0)
return null;
try{
String[] words1 = new String[]; // need to retrieve the words
from $1 code above
double[] counts1 = new double[]; // need to retrieve the counts
from $1 from above
String[] words2 = new String[]; // from $3
double[] counts2 = new double[];
return Anomaly.dist(words1, counts1,words2,count2);
}catch(Exception e){
throw WrappedIOException.wrap("Caught exception processing input
row ", e);
}
}
}