Hi All,
I have a use case to get the get N distinct url's based on the number of
hits and their latest timestamp. Pls find below the snippet of the pig
script that I have written to do this.
prunedUrlData = FOREACH urlPatternData GENERATE (url_pattern is
null?url:url_pattern) AS
url,domid,urlkey,urllen,puid,nwid,lmd,rc,punam,nwnam,ispub,com.xxx.GetDomainStorageLimit(nwid)
AS *domainlimit*;
group_by_Domain_Url = GROUP prunedUrlData BY domid;
rankedUrlByDomain = FOREACH group_by_Domain_Url
{
distinct_url = DISTINCT prunedUrlData;
url_rank_dom = ORDER distinct_url BY lmd DESC,rc DESC;
url_domain_limit = LIMIT url_rank_dom *domainlimit*;
GENERATE FLATTEN(url_domain_limit);
};
The only problem that I have now is the domainlimit variable that I'm
passing to the LIMIT statement @ runtime. I'm getting the following
exception :
java.lang.RuntimeException: Unable to evaluate Limit expression: NULL
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLimit.getNext(POLimit.java:97)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:432)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:583)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.PORelationToExprProject.getNext(PORelationToExprProject.java:107)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:334)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:372)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:297)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:465)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:433)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:413)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:257)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:164)
at
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:610)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:444)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
If I use a constant for the LIMIT, it works fine. I printed the
"group_by_Domain_Url" to see if i'm getting the domainlimit, and I'm
able to see a value.
But when i apply it to LIMIT, it says "Unable to evaluate Limit
expression: NULL". Where am I going wrong?
Regards,
Skanda