i guess i want to order the groups. the grouping is actually irrelevant in this case, it is only used for the sake of specifying custom partitioner in the PARTITIONED BY clause.
I guess what would really solve the problem is custom partitioner in the ORDER BY. so using GROUP would just be a hack. On Mon, Jan 24, 2011 at 1:28 PM, Alan Gates <ga...@yahoo-inc.com> wrote: > Do you want to order the groups or just within the groups? If you want to > order within the groups you can do that in Pig in a single job. > > Alan. > > > On Jan 24, 2011, at 1:20 PM, Dmitriy Lyubimov wrote: > > Thanks. >> >> So i take there's no way in pig to specify custom partitioner And the >> ordering in one MR step? >> >> I don't think prebuilding HFILEs is the best strategy in my case. For my >> job >> is incremental (i.e. i am not replacing 100% of the data). However, it is >> big enough that i don't want to create random writes. >> >> but using custom partitioner in GROUP statement along with PARALLEL and >> somehow specifying ordering as well would probably be ideal . >> >> i wonder if sequential spec of GROUP and ORDER BY could translate into a >> single MR job? i guess not, would it? >> >> >> >> -d >> >> On Mon, Jan 24, 2011 at 1:12 PM, Dmitriy Ryaboy <dvrya...@gmail.com> >> wrote: >> >> Pushing this logic into the storefunc would force an MR boundary before >>> the >>> store (unless the StoreFunc passed, I suppose) which can make things >>> overly >>> complex. >>> >>> I think for the purposes of bulk-loading into HBase, a better approach >>> might >>> be to use the native map-reduce functionality and feed results you want >>> to >>> store into a map-reduce job created as per >>> >>> >>> http://hbase.apache.org/docs/r0.20.6/api/org/apache/hadoop/hbase/mapreduce/package-summary.html(the<http://hbase.apache.org/docs/r0.20.6/api/org/apache/hadoop/hbase/mapreduce/package-summary.html%28the> >>> < >>> http://hbase.apache.org/docs/r0.20.6/api/org/apache/hadoop/hbase/mapreduce/package-summary.html%28the >>> > >>> >>> bulk loading section). >>> >>> D >>> >>> On Mon, Jan 24, 2011 at 11:51 AM, Dmitriy Lyubimov <dlie...@gmail.com >>> >>>> wrote: >>>> >>> >>> Better yet, it would've seem to be logical if partitioning and advise on >>>> partition #s is somehow tailored to a storefunc . It would stand to >>>> >>> reason >>> >>>> that for as long as we are not storing to hdfs, store func is in the >>>> best >>>> position to determine optimal save parameters such as order, >>>> partitioning >>>> and parallelism. >>>> >>>> On Mon, Jan 24, 2011 at 11:47 AM, Dmitriy Lyubimov <dlie...@gmail.com >>>> >>>>> wrote: >>>>> >>>> >>>> Hi, >>>>> >>>>> so it seems to be more efficient if storing to hbase partitions by >>>>> >>>> regions >>>> >>>>> and orders by hbase keys. >>>>> >>>>> I see that pig 0.8 (pig-282) added custom partitioner in a group but i >>>>> >>>> am >>> >>>> not sure if order is enforced there. >>>>> >>>>> Is there a way to run single MR that orders and partitions data as per >>>>> above and uses an explicitly specifed store func in reducers? >>>>> >>>>> Thank you. >>>>> >>>>> >>>> >>> >