Pig handles doing multiple group bys on the same input, often in a single MR 
job.  So:

A = load 'file';
B = group A by $0;
C = foreach B generate group, COUNT(A);
store C into 'output1';
D = group A by $1;
E = foreach D generate group, COUNT(A);
store D into 'output2';

This can be done in a single MR job.  Is that what you're looking for?

Alan.

On Oct 15, 2013, at 12:12 PM, ey-chih chow wrote:

> What I really want to know is,in Pig, how can I read an input data set only
> once and generate multiple instances with distinct keys for each data point
> and do a group-by?
> 
> Best regards,
> 
> Ey-Chih Chow
> 
> 
> On Tue, Oct 15, 2013 at 10:16 AM, Pradeep Gollakota 
> <pradeep...@gmail.com>wrote:
> 
>> I'm not aware of anyway to do that. I think you're also missing the spirit
>> of Pig. Pig is meant to be a data workflow language. Describe a workflow
>> for your data using PigLatin and Pig will then compile your script to
>> MapReduce jobs. The number of MapReduce jobs that it generates is the
>> smallest number of jobs (based on the optimizers) that Pig thinks it needs
>> to complete the workflow.
>> 
>> Why do you want to control the number of MR jobs?
>> 
>> 
>> On Tue, Oct 15, 2013 at 10:07 AM, ey-chih chow <eyc...@gmail.com> wrote:
>> 
>>> Thanks everybody.  Is there anyway we can programmatically control the
>>> number of M-R jobs that a Pig script will generate, similar to write M-R
>>> jobs in Java?
>>> 
>>> Best regards,
>>> 
>>> Ey-Chih Chow
>>> 
>>> 
>>> On Tue, Oct 15, 2013 at 6:14 AM, Shahab Yunus <shahab.yu...@gmail.com
>>>> wrote:
>>> 
>>>> And Geert's comment about using external-to-Pig approach reminds me
>> that,
>>>> then you have Netflix's PigLipstick too. Nice visual tool for actual
>>>> execution and stores job history as well.
>>>> 
>>>> Regards,
>>>> Shahab
>>>> 
>>>> 
>>>> On Tue, Oct 15, 2013 at 8:51 AM, Geert Van Landeghem <
>> g...@foundation.be
>>>>> wrote:
>>>> 
>>>>> You can also use ambrose to monitor execution of your pig script at
>>>>> runtime. Remark: from pig-0.11 on.
>>>>> 
>>>>> It show you the DAG of MR jobs and which are currently being
>> executed.
>>> As
>>>>> long as pig-ambrose is connected to the execution of your script
>>>> (workflow)
>>>>> you can replay the workflow.
>>>>> 
>>>>> --
>>>>> kind regards,
>>>>> Geert
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On 15-okt.-2013, at 14:43, Shahab Yunus <shahab.yu...@gmail.com>
>>> wrote:
>>>>> 
>>>>>> Have you tried using ILLUSTRATE and EXPLAIN command? As far as I
>>> know,
>>>> I
>>>>>> don't think they give you the exact number as it depends on the
>>> actual
>>>>> data
>>>>>> but I believe you can interpret it/extrapolate it from the
>>> information
>>>>>> provided by these commands.
>>>>>> 
>>>>>> Regards,
>>>>>> Shahab
>>>>>> 
>>>>>> 
>>>>>> On Tue, Oct 15, 2013 at 3:57 AM, ey-chih chow <eyc...@gmail.com>
>>>> wrote:
>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> I have a Pig script that has two group-by statements on the the
>>> input
>>>>> data
>>>>>>> set.  Is there anybody knows how many M-R jobs the script will
>>>> generate?
>>>>>>> Thanks.
>>>>>>> 
>>>>>>> Best regards,
>>>>>>> 
>>>>>>> Ey-Chih Chow
>>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>> 


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Reply via email to