Re: Help debugging an "unexpected problem during optimization"?

Jonathan Coveney Thu, 16 Dec 2010 07:23:47 -0800

Doing a search on past issues, I saw this

https://issues.apache.org/jira/browse/PIG-761


<https://issues.apache.org/jira/browse/PIG-761>which looks related, but it
was patched a while ago...?

2010/12/15 Jonathan Coveney <jcove...@gmail.com>

> I am using 0.7.0...perhaps when I can get 0.8.0 it will fix the issue. The
> script is not the script I run -- I just realiased the actual script, so the
> issues you raises, while valid, come from that and are not an issue in the
> actual script.
>
> Logic wise, however, should it work?
>
> 2010/12/15 Daniel Dai <jiany...@yahoo-inc.com>
>
> Which version of Pig are you using? I find some syntax error in your
>> script. Is this the script you actually run?
>>
>> Here is the syntax error I find:
>> 1. What is ahh, ooh?
>> 2. Alias cannot be "group", it is a keyword
>> 3. "sort = ORDER counts BY cnt DESC; ". Do you mean "sort = ORDER count BY
>> cnt DESC;"?
>>
>> Daniel
>>
>>
>> Jonathan Coveney wrote:
>>
>>> I am getting an error I have not seen before and would love some help. I
>>> did
>>> a DESCRIBE and it parses fine, but when you actually try and execute,
>>> that
>>> is when it blows up.
>>>
>>> Here is the error:
>>>
>>> 2010-12-15 16:25:33,084 [main] WARN
>>>
>>>  
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>>> - Encountered Warning DID_NOT_FIND_LOAD_ONLY_MAP_PLAN 2 time(s).
>>> 2010-12-15 16:25:33,087 [main] INFO
>>>
>>>  
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.CombinerOptimizer
>>> - Choosing to move algebraic foreach to combiner
>>> 2010-12-15 16:25:33,090 [main] INFO
>>>
>>>  
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.CombinerOptimizer
>>> - Choosing to move algebraic foreach to combiner
>>> 2010-12-15 16:25:33,091 [main] WARN
>>>
>>>  
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>>> - Encountered Warning MULTI_LEAF_MAP 1 time(s).
>>> 2010-12-15 16:25:33,098 [main] ERROR org.apache.pig.tools.grunt.Grunt -
>>> ERROR 2086: Unexpected problem during optimization. Could not find all
>>> LocalRearrange operators.
>>>
>>> Here is my code (I feel like there should be a better way to do this but
>>> can't think of it within the pig framework)
>>>
>>> big_database = LOAD '/va/voom' AS (field1, field2, id, stuff);
>>>
>>> /*
>>> This cuts down the big database by the first field. It then groups by
>>> that
>>> field and the id, and counts how many times that occurs. This is a very
>>> expensive operation...I'd love to avoid this.
>>> */
>>> filter_1 = filter big_database by (field1 == ahh);
>>>
>>> relevant = FOREACH filter_1 GENERATE id, field1;
>>> group = GROUP relevant BY (id,field1);
>>> count = FOREACH group GENERATE FLATTEN(group), COUNT(relevant) as cnt;
>>>
>>> /*
>>> This cuts the big database down by another, much more specific field, and
>>> then finds the id's, associated with that field, that came up the most.
>>> The
>>> goal is to be able to see ALL of the occurrences in filter_1 for anything
>>> */
>>>
>>> filter_2 = filter filter_1 by (field2 == ooh);
>>>
>>> get = FOREACH filter_2 GENERATE id;
>>> group = GROUP get BY id;
>>> count = FOREACH group GENERATE group as grp, COUNT(get) as cnt;
>>> sort = ORDER counts BY cnt DESC;
>>> crop = LIMIT sort 1000;
>>>
>>> join_the_two = JOIN crop BY grp, count BY group::id;
>>>
>>> renamed = FOREACH join_the_two GENERATE crop::grp as uid, crop::cnt as
>>> main_count, count::group::field1 as lat_val, count::cnt as lat_val_count;
>>>
>>> STORE renamed INTO '/wam/bam';
>>>
>>>
>>> So really the overall idea is to filter by field1, field2, get the most
>>> common occurrences of the identifier, then see all of the activity in the
>>> original table. The only way I thought to do this was to get all of the
>>> activity indexed by identifier, and then join to it, basically cutting
>>> out
>>> the uid's we do not care about.
>>>
>>> I feel like you could do this as a COGROUP as well, but that would be
>>> quite
>>> expensive overall, and once you have all of the fields we eventually want
>>> to
>>> group over into a bag, I have no idea how we would do sorting on it. I
>>> guess
>>> a UDF?
>>>
>>> If my explanation was lacking I can go into more depth or try and post
>>> some
>>> dummy data that would go along with it.
>>>
>>> Thanks for your help.
>>>
>>>
>>
>>
>

Re: Help debugging an "unexpected problem during optimization"?

Reply via email to