I am getting an error I have not seen before and would love some help. I did
a DESCRIBE and it parses fine, but when you actually try and execute, that
is when it blows up.

Here is the error:

2010-12-15 16:25:33,084 [main] WARN
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Encountered Warning DID_NOT_FIND_LOAD_ONLY_MAP_PLAN 2 time(s).
2010-12-15 16:25:33,087 [main] INFO
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.CombinerOptimizer
- Choosing to move algebraic foreach to combiner
2010-12-15 16:25:33,090 [main] INFO
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.CombinerOptimizer
- Choosing to move algebraic foreach to combiner
2010-12-15 16:25:33,091 [main] WARN
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Encountered Warning MULTI_LEAF_MAP 1 time(s).
2010-12-15 16:25:33,098 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 2086: Unexpected problem during optimization. Could not find all
LocalRearrange operators.

Here is my code (I feel like there should be a better way to do this but
can't think of it within the pig framework)

big_database = LOAD '/va/voom' AS (field1, field2, id, stuff);

/*
This cuts down the big database by the first field. It then groups by that
field and the id, and counts how many times that occurs. This is a very
expensive operation...I'd love to avoid this.
*/
filter_1 = filter big_database by (field1 == ahh);

relevant = FOREACH filter_1 GENERATE id, field1;
group = GROUP relevant BY (id,field1);
count = FOREACH group GENERATE FLATTEN(group), COUNT(relevant) as cnt;

/*
This cuts the big database down by another, much more specific field, and
then finds the id's, associated with that field, that came up the most. The
goal is to be able to see ALL of the occurrences in filter_1 for anything
*/

filter_2 = filter filter_1 by (field2 == ooh);

get = FOREACH filter_2 GENERATE id;
group = GROUP get BY id;
count = FOREACH group GENERATE group as grp, COUNT(get) as cnt;
sort = ORDER counts BY cnt DESC;
crop = LIMIT sort 1000;

join_the_two = JOIN crop BY grp, count BY group::id;

renamed = FOREACH join_the_two GENERATE crop::grp as uid, crop::cnt as
main_count, count::group::field1 as lat_val, count::cnt as lat_val_count;

STORE renamed INTO '/wam/bam';


So really the overall idea is to filter by field1, field2, get the most
common occurrences of the identifier, then see all of the activity in the
original table. The only way I thought to do this was to get all of the
activity indexed by identifier, and then join to it, basically cutting out
the uid's we do not care about.

I feel like you could do this as a COGROUP as well, but that would be quite
expensive overall, and once you have all of the fields we eventually want to
group over into a bag, I have no idea how we would do sorting on it. I guess
a UDF?

If my explanation was lacking I can go into more depth or try and post some
dummy data that would go along with it.

Thanks for your help.

Reply via email to