Re: Custom partitioning and order for optimum hbase store

Alan Gates Mon, 24 Jan 2011 14:27:40 -0800

Since Pig uses the partitioner to provide a total order (by which Imean an order across part files), we don't allow users to override thepartitioner in that case. But I think what you want to do would beachievable if you have a UDF that maps the key to the region serveryou want it in and a custom partitioner that partitions based on theregion server id generated by the udf:


...
C = foreach B generate *, key_to_region_mapper(key) as region;
D = group C by region partition using region_partitioner;
E = foreach D {
      E1 = order C by key;
      generate flatten(E1);
}
F = store E into HBaseStorage();

This will group by the region and partition by it (so each reducer canget one part file to turn into one hfile for hbase) and order the keyswithin that region's part file. The ordering will be done as asecondary sort in MR.

The only issue I see here is that Pig isn't smart enough to realizethat you don't need to pull the entire bag into memory in order toflatten it. Ideally it would realize this and just stream from thereduce iterator to the collect, but it won't. It will read everythingoff of the reduce iterator into memory (spilling if there is more thancan fit) and then storing it all to hbase.


Alan.

On Jan 24, 2011, at 2:06 PM, Dmitriy Lyubimov wrote:

i guess i want to order the groups. the grouping is actuallyirrelevant inthis case, it is only used for the sake of specifying custompartitioner in
the PARTITIONED BY clause.
I guess what would really solve the problem is custom partitioner inthe
ORDER BY. so using GROUP would just be a hack.
On Mon, Jan 24, 2011 at 1:28 PM, Alan Gates <ga...@yahoo-inc.com>wrote:
Do you want to order the groups or just within the groups? If youwant to
order within the groups you can do that in Pig in a single job.

Alan.


On Jan 24, 2011, at 1:20 PM, Dmitriy Lyubimov wrote:

Thanks.
So i take there's no way in pig to specify custom partitioner Andthe
ordering in one MR step?
I don't think prebuilding HFILEs is the best strategy in my case.For my
job
is incremental (i.e. i am not replacing 100% of the data).However, it is
big enough that i don't want to create random writes.
but using custom partitioner in GROUP statement along withPARALLEL and
somehow specifying ordering as well would probably be ideal .
i wonder if sequential spec of GROUP and ORDER BY could translateinto a
single MR job? i guess not, would it?



-d

On Mon, Jan 24, 2011 at 1:12 PM, Dmitriy Ryaboy <dvrya...@gmail.com>
wrote:
Pushing this logic into the storefunc would force an MR boundarybefore
the
store (unless the StoreFunc passed, I suppose) which can makethings
overly
complex.
I think for the purposes of bulk-loading into HBase, a betterapproach
might
be to use the native map-reduce functionality and feed resultsyou want
to
store into a map-reduce job created as per
http://hbase.apache.org/docs/r0.20.6/api/org/apache/hadoop/hbase/mapreduce/package-summary.html(the<http://hbase.apache.org/docs/r0.20.6/api/org/apache/hadoop/hbase/mapreduce/package-summary.html%28the>
<
http://hbase.apache.org/docs/r0.20.6/api/org/apache/hadoop/hbase/mapreduce/package-summary.html%28the
bulk loading section).

D

On Mon, Jan 24, 2011 at 11:51 AM, Dmitriy Lyubimov <dlie...@gmail.com
wrote:
Better yet, it would've seem to be logical if partitioning andadvise on
partition #s is somehow tailored to a storefunc . It would standto
reason
that for as long as we are not storing to hdfs, store func is inthe
best
position to determine optimal save parameters such as order,
partitioning
and parallelism.

On Mon, Jan 24, 2011 at 11:47 AM, Dmitriy Lyubimov <dlie...@gmail.com
wrote:
Hi,
so it seems to be more efficient if storing to hbase partitionsby
regions
and orders by hbase keys.
I see that pig 0.8 (pig-282) added custom partitioner in agroup but i
am
not sure if order is enforced there.
Is there a way to run single MR that orders and partitions dataas per
above and uses an explicitly specifed store func in reducers?

Thank you.

Re: Custom partitioning and order for optimum hbase store

Reply via email to