Doing a search on past issues, I saw this https://issues.apache.org/jira/browse/PIG-761
<https://issues.apache.org/jira/browse/PIG-761>which looks related, but it was patched a while ago...? 2010/12/15 Jonathan Coveney <jcove...@gmail.com> > I am using 0.7.0...perhaps when I can get 0.8.0 it will fix the issue. The > script is not the script I run -- I just realiased the actual script, so the > issues you raises, while valid, come from that and are not an issue in the > actual script. > > Logic wise, however, should it work? > > 2010/12/15 Daniel Dai <jiany...@yahoo-inc.com> > > Which version of Pig are you using? I find some syntax error in your >> script. Is this the script you actually run? >> >> Here is the syntax error I find: >> 1. What is ahh, ooh? >> 2. Alias cannot be "group", it is a keyword >> 3. "sort = ORDER counts BY cnt DESC; ". Do you mean "sort = ORDER count BY >> cnt DESC;"? >> >> Daniel >> >> >> Jonathan Coveney wrote: >> >>> I am getting an error I have not seen before and would love some help. I >>> did >>> a DESCRIBE and it parses fine, but when you actually try and execute, >>> that >>> is when it blows up. >>> >>> Here is the error: >>> >>> 2010-12-15 16:25:33,084 [main] WARN >>> >>> >>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher >>> - Encountered Warning DID_NOT_FIND_LOAD_ONLY_MAP_PLAN 2 time(s). >>> 2010-12-15 16:25:33,087 [main] INFO >>> >>> >>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.CombinerOptimizer >>> - Choosing to move algebraic foreach to combiner >>> 2010-12-15 16:25:33,090 [main] INFO >>> >>> >>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.CombinerOptimizer >>> - Choosing to move algebraic foreach to combiner >>> 2010-12-15 16:25:33,091 [main] WARN >>> >>> >>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher >>> - Encountered Warning MULTI_LEAF_MAP 1 time(s). >>> 2010-12-15 16:25:33,098 [main] ERROR org.apache.pig.tools.grunt.Grunt - >>> ERROR 2086: Unexpected problem during optimization. Could not find all >>> LocalRearrange operators. >>> >>> Here is my code (I feel like there should be a better way to do this but >>> can't think of it within the pig framework) >>> >>> big_database = LOAD '/va/voom' AS (field1, field2, id, stuff); >>> >>> /* >>> This cuts down the big database by the first field. It then groups by >>> that >>> field and the id, and counts how many times that occurs. This is a very >>> expensive operation...I'd love to avoid this. >>> */ >>> filter_1 = filter big_database by (field1 == ahh); >>> >>> relevant = FOREACH filter_1 GENERATE id, field1; >>> group = GROUP relevant BY (id,field1); >>> count = FOREACH group GENERATE FLATTEN(group), COUNT(relevant) as cnt; >>> >>> /* >>> This cuts the big database down by another, much more specific field, and >>> then finds the id's, associated with that field, that came up the most. >>> The >>> goal is to be able to see ALL of the occurrences in filter_1 for anything >>> */ >>> >>> filter_2 = filter filter_1 by (field2 == ooh); >>> >>> get = FOREACH filter_2 GENERATE id; >>> group = GROUP get BY id; >>> count = FOREACH group GENERATE group as grp, COUNT(get) as cnt; >>> sort = ORDER counts BY cnt DESC; >>> crop = LIMIT sort 1000; >>> >>> join_the_two = JOIN crop BY grp, count BY group::id; >>> >>> renamed = FOREACH join_the_two GENERATE crop::grp as uid, crop::cnt as >>> main_count, count::group::field1 as lat_val, count::cnt as lat_val_count; >>> >>> STORE renamed INTO '/wam/bam'; >>> >>> >>> So really the overall idea is to filter by field1, field2, get the most >>> common occurrences of the identifier, then see all of the activity in the >>> original table. The only way I thought to do this was to get all of the >>> activity indexed by identifier, and then join to it, basically cutting >>> out >>> the uid's we do not care about. >>> >>> I feel like you could do this as a COGROUP as well, but that would be >>> quite >>> expensive overall, and once you have all of the fields we eventually want >>> to >>> group over into a bag, I have no idea how we would do sorting on it. I >>> guess >>> a UDF? >>> >>> If my explanation was lacking I can go into more depth or try and post >>> some >>> dummy data that would go along with it. >>> >>> Thanks for your help. >>> >>> >> >> >