i guess i want to order the groups. the grouping is actually irrelevant in
this case, it is only used for the sake of specifying custom partitioner in
the PARTITIONED BY clause.

I guess what would really solve the problem is custom partitioner in the
ORDER BY. so using GROUP would just be a hack.

On Mon, Jan 24, 2011 at 1:28 PM, Alan Gates <ga...@yahoo-inc.com> wrote:

> Do you want to order the groups or just within the groups?  If you want to
> order within the groups you can do that in Pig in a single job.
>
> Alan.
>
>
> On Jan 24, 2011, at 1:20 PM, Dmitriy Lyubimov wrote:
>
>  Thanks.
>>
>> So i take there's no way in pig to specify custom partitioner And the
>> ordering in one MR step?
>>
>> I don't think prebuilding HFILEs is the best strategy in my case. For my
>> job
>> is incremental (i.e. i am not replacing 100% of the data). However, it is
>> big enough that i don't want to create random writes.
>>
>> but using custom partitioner in GROUP statement along with PARALLEL and
>> somehow specifying ordering as well would probably be ideal .
>>
>> i wonder if sequential spec of GROUP and ORDER BY could translate into a
>> single MR job? i guess not, would it?
>>
>>
>>
>> -d
>>
>> On Mon, Jan 24, 2011 at 1:12 PM, Dmitriy Ryaboy <dvrya...@gmail.com>
>> wrote:
>>
>>  Pushing this logic into the storefunc would force an MR boundary before
>>> the
>>> store (unless the StoreFunc passed, I suppose) which can make things
>>> overly
>>> complex.
>>>
>>> I think for the purposes of bulk-loading into HBase, a better approach
>>> might
>>> be to use the native map-reduce functionality and feed results you want
>>> to
>>> store into a map-reduce job created as per
>>>
>>>
>>> http://hbase.apache.org/docs/r0.20.6/api/org/apache/hadoop/hbase/mapreduce/package-summary.html(the<http://hbase.apache.org/docs/r0.20.6/api/org/apache/hadoop/hbase/mapreduce/package-summary.html%28the>
>>> <
>>> http://hbase.apache.org/docs/r0.20.6/api/org/apache/hadoop/hbase/mapreduce/package-summary.html%28the
>>> >
>>>
>>> bulk loading section).
>>>
>>> D
>>>
>>> On Mon, Jan 24, 2011 at 11:51 AM, Dmitriy Lyubimov <dlie...@gmail.com
>>>
>>>> wrote:
>>>>
>>>
>>>  Better yet, it would've seem to be logical if partitioning and advise on
>>>> partition #s is somehow tailored to a storefunc . It would stand to
>>>>
>>> reason
>>>
>>>> that for as long as we are not storing to hdfs, store func is in the
>>>> best
>>>> position to determine optimal save parameters such as order,
>>>> partitioning
>>>> and parallelism.
>>>>
>>>> On Mon, Jan 24, 2011 at 11:47 AM, Dmitriy Lyubimov <dlie...@gmail.com
>>>>
>>>>> wrote:
>>>>>
>>>>
>>>>  Hi,
>>>>>
>>>>> so it seems to be more efficient if storing to hbase partitions by
>>>>>
>>>> regions
>>>>
>>>>> and orders by hbase keys.
>>>>>
>>>>> I see that pig 0.8 (pig-282) added custom partitioner in a group but i
>>>>>
>>>> am
>>>
>>>> not sure if order is enforced there.
>>>>>
>>>>> Is there a way to run single MR that orders and partitions data as per
>>>>> above and uses an explicitly specifed store func in reducers?
>>>>>
>>>>> Thank you.
>>>>>
>>>>>
>>>>
>>>
>

Reply via email to