I am using 0.7.0...perhaps when I can get 0.8.0 it will fix the issue. The
script is not the script I run -- I just realiased the actual script, so the
issues you raises, while valid, come from that and are not an issue in the
actual script.

Logic wise, however, should it work?

2010/12/15 Daniel Dai <jiany...@yahoo-inc.com>

> Which version of Pig are you using? I find some syntax error in your
> script. Is this the script you actually run?
>
> Here is the syntax error I find:
> 1. What is ahh, ooh?
> 2. Alias cannot be "group", it is a keyword
> 3. "sort = ORDER counts BY cnt DESC; ". Do you mean "sort = ORDER count BY
> cnt DESC;"?
>
> Daniel
>
>
> Jonathan Coveney wrote:
>
>> I am getting an error I have not seen before and would love some help. I
>> did
>> a DESCRIBE and it parses fine, but when you actually try and execute, that
>> is when it blows up.
>>
>> Here is the error:
>>
>> 2010-12-15 16:25:33,084 [main] WARN
>>
>>  
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>> - Encountered Warning DID_NOT_FIND_LOAD_ONLY_MAP_PLAN 2 time(s).
>> 2010-12-15 16:25:33,087 [main] INFO
>>
>>  
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.CombinerOptimizer
>> - Choosing to move algebraic foreach to combiner
>> 2010-12-15 16:25:33,090 [main] INFO
>>
>>  
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.CombinerOptimizer
>> - Choosing to move algebraic foreach to combiner
>> 2010-12-15 16:25:33,091 [main] WARN
>>
>>  
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>> - Encountered Warning MULTI_LEAF_MAP 1 time(s).
>> 2010-12-15 16:25:33,098 [main] ERROR org.apache.pig.tools.grunt.Grunt -
>> ERROR 2086: Unexpected problem during optimization. Could not find all
>> LocalRearrange operators.
>>
>> Here is my code (I feel like there should be a better way to do this but
>> can't think of it within the pig framework)
>>
>> big_database = LOAD '/va/voom' AS (field1, field2, id, stuff);
>>
>> /*
>> This cuts down the big database by the first field. It then groups by that
>> field and the id, and counts how many times that occurs. This is a very
>> expensive operation...I'd love to avoid this.
>> */
>> filter_1 = filter big_database by (field1 == ahh);
>>
>> relevant = FOREACH filter_1 GENERATE id, field1;
>> group = GROUP relevant BY (id,field1);
>> count = FOREACH group GENERATE FLATTEN(group), COUNT(relevant) as cnt;
>>
>> /*
>> This cuts the big database down by another, much more specific field, and
>> then finds the id's, associated with that field, that came up the most.
>> The
>> goal is to be able to see ALL of the occurrences in filter_1 for anything
>> */
>>
>> filter_2 = filter filter_1 by (field2 == ooh);
>>
>> get = FOREACH filter_2 GENERATE id;
>> group = GROUP get BY id;
>> count = FOREACH group GENERATE group as grp, COUNT(get) as cnt;
>> sort = ORDER counts BY cnt DESC;
>> crop = LIMIT sort 1000;
>>
>> join_the_two = JOIN crop BY grp, count BY group::id;
>>
>> renamed = FOREACH join_the_two GENERATE crop::grp as uid, crop::cnt as
>> main_count, count::group::field1 as lat_val, count::cnt as lat_val_count;
>>
>> STORE renamed INTO '/wam/bam';
>>
>>
>> So really the overall idea is to filter by field1, field2, get the most
>> common occurrences of the identifier, then see all of the activity in the
>> original table. The only way I thought to do this was to get all of the
>> activity indexed by identifier, and then join to it, basically cutting out
>> the uid's we do not care about.
>>
>> I feel like you could do this as a COGROUP as well, but that would be
>> quite
>> expensive overall, and once you have all of the fields we eventually want
>> to
>> group over into a bag, I have no idea how we would do sorting on it. I
>> guess
>> a UDF?
>>
>> If my explanation was lacking I can go into more depth or try and post
>> some
>> dummy data that would go along with it.
>>
>> Thanks for your help.
>>
>>
>
>

Reply via email to