Can you describe what your input data looks like and what you want your
output data to look like?

I don’t understand your question. A group by is really straight forward to
do on a dataset.

A = LOAD 'mydata' using MyStorage();
B = GROUP A BY group_key;
dump B;

Is that what you’re looking for?


On Tue, Oct 15, 2013 at 12:12 PM, ey-chih chow <eyc...@gmail.com> wrote:

> What I really want to know is,in Pig, how can I read an input data set only
> once and generate multiple instances with distinct keys for each data point
> and do a group-by?
>
> Best regards,
>
> Ey-Chih Chow
>
>
> On Tue, Oct 15, 2013 at 10:16 AM, Pradeep Gollakota <pradeep...@gmail.com
> >wrote:
>
> > I'm not aware of anyway to do that. I think you're also missing the
> spirit
> > of Pig. Pig is meant to be a data workflow language. Describe a workflow
> > for your data using PigLatin and Pig will then compile your script to
> > MapReduce jobs. The number of MapReduce jobs that it generates is the
> > smallest number of jobs (based on the optimizers) that Pig thinks it
> needs
> > to complete the workflow.
> >
> > Why do you want to control the number of MR jobs?
> >
> >
> > On Tue, Oct 15, 2013 at 10:07 AM, ey-chih chow <eyc...@gmail.com> wrote:
> >
> > > Thanks everybody.  Is there anyway we can programmatically control the
> > > number of M-R jobs that a Pig script will generate, similar to write
> M-R
> > > jobs in Java?
> > >
> > > Best regards,
> > >
> > > Ey-Chih Chow
> > >
> > >
> > > On Tue, Oct 15, 2013 at 6:14 AM, Shahab Yunus <shahab.yu...@gmail.com
> > > >wrote:
> > >
> > > > And Geert's comment about using external-to-Pig approach reminds me
> > that,
> > > > then you have Netflix's PigLipstick too. Nice visual tool for actual
> > > > execution and stores job history as well.
> > > >
> > > > Regards,
> > > > Shahab
> > > >
> > > >
> > > > On Tue, Oct 15, 2013 at 8:51 AM, Geert Van Landeghem <
> > g...@foundation.be
> > > > >wrote:
> > > >
> > > > > You can also use ambrose to monitor execution of your pig script at
> > > > > runtime. Remark: from pig-0.11 on.
> > > > >
> > > > > It show you the DAG of MR jobs and which are currently being
> > executed.
> > > As
> > > > > long as pig-ambrose is connected to the execution of your script
> > > > (workflow)
> > > > > you can replay the workflow.
> > > > >
> > > > > --
> > > > > kind regards,
> > > > >  Geert
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On 15-okt.-2013, at 14:43, Shahab Yunus <shahab.yu...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Have you tried using ILLUSTRATE and EXPLAIN command? As far as I
> > > know,
> > > > I
> > > > > > don't think they give you the exact number as it depends on the
> > > actual
> > > > > data
> > > > > > but I believe you can interpret it/extrapolate it from the
> > > information
> > > > > > provided by these commands.
> > > > > >
> > > > > > Regards,
> > > > > > Shahab
> > > > > >
> > > > > >
> > > > > > On Tue, Oct 15, 2013 at 3:57 AM, ey-chih chow <eyc...@gmail.com>
> > > > wrote:
> > > > > >
> > > > > >> Hi,
> > > > > >>
> > > > > >> I have a Pig script that has two group-by statements on the the
> > > input
> > > > > data
> > > > > >> set.  Is there anybody knows how many M-R jobs the script will
> > > > generate?
> > > > > >> Thanks.
> > > > > >>
> > > > > >> Best regards,
> > > > > >>
> > > > > >> Ey-Chih Chow
> > > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to