Dmitriy -- my requirements have changed slightly in this particular instance, I actually now need to order by several columns, so I think that means I have to use an inner order-by, rather than TOP. Thankfully the bags are small.
Daniel -- I'm working on extracting out a small test case that demonstrates it, as this was deep within quite a big script. I've hit another weird snag in the process of doing this -- I'll start another thread as I think it's a distinct problem. On 22 July 2011 13:56, Dmitriy Ryaboy <dvrya...@gmail.com> wrote: > On the subject of TOP -- the reason you would use it instead of an inner > order + limit is that it's much more efficient for large bags. > It is algebraic, so the computation can be well optimized. On top of that, > it does not require a full sort of the bag. > > -D > > On Thu, Jul 21, 2011 at 9:41 PM, Daniel Dai <da...@hortonworks.com> wrote: > >> The syntax looks legal. Can you do an explain? >> >> Daniel >> >> On Thu, Jul 21, 2011 at 5:15 AM, Andrew Clegg < >> andrew.clegg+mah...@gmail.com >> > wrote: >> >> > Hi, >> > >> > I have some code that looks like this: >> > >> > top_hits = foreach regrouped { >> > result = TOP(1, 6, projected_joined_albums); -- field 6 = score >> > generate flatten(result); >> > }; >> > >> > I'm not too keen on the TOP syntax because it's opaque and you need >> > the comment there to explain what's going on. >> > >> > I've seen the same thing achieved like so, in a more transparent way, >> > and in fact I've used this in other cases myself: >> > >> > top_hits = foreach regrouped { >> > sorted = order projected_joined_albums by score desc; >> > result = limit sorted 1; >> > generate flatten(result); >> > }; >> > >> > However, although the first form works for me, the second dies with >> > the following error: >> > >> > java.lang.ClassCastException: java.lang.Integer cannot be cast to >> > org.apache.pig.data.Tuple >> > at >> > >> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:392) >> > at >> > >> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:276) >> > at >> > >> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:138) >> > at >> > >> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:291) >> > at >> > >> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:355) >> > at >> > >> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291) >> > at >> > >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:433) >> > at >> > >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:401) >> > at >> > >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:381) >> > at >> > >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:251) >> > at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176) >> > (etc.) >> > >> > Is there a reason for why it would fail in this case? I can't >> > understand the meaning of the error, it'd be nice if it reported >> > *which* Tuple was failing a cast. >> > >> > regrouped has the following schema: >> > >> > {group: (artistid: int,country: int,week: >> > chararray),projected_joined_albums: >> > {joined_albums_2::joined_albums_1::flattened_albums::key: (artistid: >> > int,country: int,week: >> > chararray),joined_albums_2::joined_albums_1::flattened_albums::timestamp: >> > long,joined_albums_2::joined_albums_1::flattened_albums::albumid: >> > int,track_counts::numtracks: long,joined_albums_2::reach::reach: >> > int,joined_albums_2::joined_albums_1::album_titles::title_len: >> > long,score: long}} >> > >> > That's a bit complex so I extracted the individual fields with a >> > foreach .. generate beforehand: >> > >> > {group: (artistid: int,country: int,week: >> > chararray),projected_joined_albums: {key: (artistid: int,country: >> > int,week: chararray),timestamp: long,albumid: int,numtracks: >> > long,reach: int,title_len: long,score: long}} >> > >> > It didn't affect the error, though. >> > >> > Thanks for any suggestions, >> > >> > Andrew. >> > >> > -- >> > >> > http://tinyurl.com/andrew-clegg-linkedin | >> http://twitter.com/andrew_clegg >> > >> > -- http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg