Dmitriy -- my requirements have changed slightly in this particular
instance, I actually now need to order by several columns, so I think
that means I have to use an inner order-by, rather than TOP.
Thankfully the bags are small.

Daniel -- I'm working on extracting out a small test case that
demonstrates it, as this was deep within quite a big script.

I've hit another weird snag in the process of doing this -- I'll start
another thread as I think it's a distinct problem.

On 22 July 2011 13:56, Dmitriy Ryaboy <dvrya...@gmail.com> wrote:
> On the subject of TOP -- the reason you would use it instead of an inner
> order + limit is that it's much more efficient for large bags.
> It is algebraic, so the computation can be well optimized. On top of that,
> it does not require a full sort of the bag.
>
> -D
>
> On Thu, Jul 21, 2011 at 9:41 PM, Daniel Dai <da...@hortonworks.com> wrote:
>
>> The syntax looks legal. Can you do an explain?
>>
>> Daniel
>>
>> On Thu, Jul 21, 2011 at 5:15 AM, Andrew Clegg <
>> andrew.clegg+mah...@gmail.com
>> > wrote:
>>
>> > Hi,
>> >
>> > I have some code that looks like this:
>> >
>> > top_hits = foreach regrouped {
>> >    result = TOP(1, 6, projected_joined_albums); -- field 6 = score
>> >    generate flatten(result);
>> > };
>> >
>> > I'm not too keen on the TOP syntax because it's opaque and you need
>> > the comment there to explain what's going on.
>> >
>> > I've seen the same thing achieved like so, in a more transparent way,
>> > and in fact I've used this in other cases myself:
>> >
>> > top_hits = foreach regrouped {
>> >    sorted = order projected_joined_albums by score desc;
>> >    result = limit sorted 1;
>> >    generate flatten(result);
>> > };
>> >
>> > However, although the first form works for me, the second dies with
>> > the following error:
>> >
>> > java.lang.ClassCastException: java.lang.Integer cannot be cast to
>> > org.apache.pig.data.Tuple
>> >        at
>> >
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:392)
>> >        at
>> >
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:276)
>> >        at
>> >
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:138)
>> >        at
>> >
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:291)
>> >        at
>> >
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:355)
>> >        at
>> >
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291)
>> >        at
>> >
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:433)
>> >        at
>> >
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:401)
>> >        at
>> >
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:381)
>> >        at
>> >
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:251)
>> >        at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
>> > (etc.)
>> >
>> > Is there a reason for why it would fail in this case? I can't
>> > understand the meaning of the error, it'd be nice if it reported
>> > *which* Tuple was failing a cast.
>> >
>> > regrouped has the following schema:
>> >
>> > {group: (artistid: int,country: int,week:
>> > chararray),projected_joined_albums:
>> > {joined_albums_2::joined_albums_1::flattened_albums::key: (artistid:
>> > int,country: int,week:
>> > chararray),joined_albums_2::joined_albums_1::flattened_albums::timestamp:
>> > long,joined_albums_2::joined_albums_1::flattened_albums::albumid:
>> > int,track_counts::numtracks: long,joined_albums_2::reach::reach:
>> > int,joined_albums_2::joined_albums_1::album_titles::title_len:
>> > long,score: long}}
>> >
>> > That's a bit complex so I extracted the individual fields with a
>> > foreach .. generate beforehand:
>> >
>> > {group: (artistid: int,country: int,week:
>> > chararray),projected_joined_albums: {key: (artistid: int,country:
>> > int,week: chararray),timestamp: long,albumid: int,numtracks:
>> > long,reach: int,title_len: long,score: long}}
>> >
>> > It didn't affect the error, though.
>> >
>> > Thanks for any suggestions,
>> >
>> > Andrew.
>> >
>> > --
>> >
>> > http://tinyurl.com/andrew-clegg-linkedin |
>> http://twitter.com/andrew_clegg
>> >
>>
>



-- 

http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg

Reply via email to