Can you describe what your input data looks like and what you want your output data to look like?
I don’t understand your question. A group by is really straight forward to do on a dataset. A = LOAD 'mydata' using MyStorage(); B = GROUP A BY group_key; dump B; Is that what you’re looking for? On Tue, Oct 15, 2013 at 12:12 PM, ey-chih chow <eyc...@gmail.com> wrote: > What I really want to know is,in Pig, how can I read an input data set only > once and generate multiple instances with distinct keys for each data point > and do a group-by? > > Best regards, > > Ey-Chih Chow > > > On Tue, Oct 15, 2013 at 10:16 AM, Pradeep Gollakota <pradeep...@gmail.com > >wrote: > > > I'm not aware of anyway to do that. I think you're also missing the > spirit > > of Pig. Pig is meant to be a data workflow language. Describe a workflow > > for your data using PigLatin and Pig will then compile your script to > > MapReduce jobs. The number of MapReduce jobs that it generates is the > > smallest number of jobs (based on the optimizers) that Pig thinks it > needs > > to complete the workflow. > > > > Why do you want to control the number of MR jobs? > > > > > > On Tue, Oct 15, 2013 at 10:07 AM, ey-chih chow <eyc...@gmail.com> wrote: > > > > > Thanks everybody. Is there anyway we can programmatically control the > > > number of M-R jobs that a Pig script will generate, similar to write > M-R > > > jobs in Java? > > > > > > Best regards, > > > > > > Ey-Chih Chow > > > > > > > > > On Tue, Oct 15, 2013 at 6:14 AM, Shahab Yunus <shahab.yu...@gmail.com > > > >wrote: > > > > > > > And Geert's comment about using external-to-Pig approach reminds me > > that, > > > > then you have Netflix's PigLipstick too. Nice visual tool for actual > > > > execution and stores job history as well. > > > > > > > > Regards, > > > > Shahab > > > > > > > > > > > > On Tue, Oct 15, 2013 at 8:51 AM, Geert Van Landeghem < > > g...@foundation.be > > > > >wrote: > > > > > > > > > You can also use ambrose to monitor execution of your pig script at > > > > > runtime. Remark: from pig-0.11 on. > > > > > > > > > > It show you the DAG of MR jobs and which are currently being > > executed. > > > As > > > > > long as pig-ambrose is connected to the execution of your script > > > > (workflow) > > > > > you can replay the workflow. > > > > > > > > > > -- > > > > > kind regards, > > > > > Geert > > > > > > > > > > > > > > > > > > > > > > > > > On 15-okt.-2013, at 14:43, Shahab Yunus <shahab.yu...@gmail.com> > > > wrote: > > > > > > > > > > > Have you tried using ILLUSTRATE and EXPLAIN command? As far as I > > > know, > > > > I > > > > > > don't think they give you the exact number as it depends on the > > > actual > > > > > data > > > > > > but I believe you can interpret it/extrapolate it from the > > > information > > > > > > provided by these commands. > > > > > > > > > > > > Regards, > > > > > > Shahab > > > > > > > > > > > > > > > > > > On Tue, Oct 15, 2013 at 3:57 AM, ey-chih chow <eyc...@gmail.com> > > > > wrote: > > > > > > > > > > > >> Hi, > > > > > >> > > > > > >> I have a Pig script that has two group-by statements on the the > > > input > > > > > data > > > > > >> set. Is there anybody knows how many M-R jobs the script will > > > > generate? > > > > > >> Thanks. > > > > > >> > > > > > >> Best regards, > > > > > >> > > > > > >> Ey-Chih Chow > > > > > >> > > > > > > > > > > > > > > > > > > > >