Thanks, Pi. Yes, I totally agree that this would be optional. Olga
> -----Original Message----- > From: pi song [mailto:[EMAIL PROTECTED] > Sent: Tuesday, May 20, 2008 3:33 AM > To: [email protected] > Subject: Re: How Grouping works for multiple groups > > Conceptually the more we could capture from what users want > to do as the whole, the more clever query optimizer we can > have. It is good if users can construct the whole processing > graph and process all at once but I feel changing STORE from > "do it right now" to "do it later" seems to be a bit dodgy. > Introducing the transaction-like syntax is OK but please make > it optional, meaning if we don't use, just do the way it is > now. Some people might still want just a few lines and go!!! > > On backend side:- > > 1) The new execution engine design allows us to wire the plan > as DAG but I'm not sure if it executes by looking at DAG or > just extracting a tree from DAG. > > 2) We already have a disjoint union operator called POPackage > for tagging purpose. > > I view this suggestion as "another pattern" for query > optimizer. We shouldn't enforce it but have to make it > "possible to do". (There is a common issue in optimization. > Sometimes different techniques just cannot work together!!). > Pi > > > > On 5/20/08, Olga Natkovich <[EMAIL PROTECTED]> wrote: > > > > I think we should introduce BEGIN ... EXECUTE {ALL} where > > > > BEGIN can be omitted and then assumed to be in the beginning of > > script/program/session. > > EXECUTE would mean "best effort execute" meaning we try to > execute all > > and let user know what succeeded and what failed EXECUTE ALL would > > mean execute as transaction, aborting all on failure. > > > > Olga > > > > > -----Original Message----- > > > From: Alan Gates [mailto:[EMAIL PROTECTED] > > > Sent: Monday, May 19, 2008 11:54 AM > > > To: [email protected] > > > Subject: Re: How Grouping works for multiple groups > > > > > > Paolo had already suggested that we add an EXECUTE command for > > > exactly this purpose in interactive mode. > > > > > > Alan. > > > > > > Utkarsh Srivastava wrote: > > > > Yes, I agree, not introducing new syntax is much more > preferable. > > > > > > > > Doing this optimization automatically for the batch mode is > > > a good idea. > > > > For the interactive mode, we would need something like a COMMIT > > > > statement, which will force execution (with execution not > > > > automatically starting on a STORE command as it currently does). > > > > > > > > As regards failure, we could start with our current model, > > > one failure > > > > fails everything. > > > > > > > > Utkarsh > > > > > > > > > > > >> -----Original Message----- > > > >> From: Olga Natkovich [mailto:[EMAIL PROTECTED] > > > >> Sent: Monday, May 19, 2008 11:23 AM > > > >> To: [email protected] > > > >> Subject: RE: How Grouping works for multiple groups > > > >> > > > >> Utkarsh, > > > >> > > > >> I agree that this issue has been brought up a number > of times and > > > >> > > > > needs > > > > > > > >> to be addressed. I think it would be nice if we could address > > > >> this without introducing new syntax for store. In batch mode, > > > this would > > > >> be quite easy since we can build execution plan for the > > > entire script > > > >> rather than one store at a time. I realize that for > > > interactive and > > > >> embedded case it is a bit trickier. Also we need to > > > clarify what are > > > >> > > > > the > > > > > > > >> semantics of this kind of operation in the presence of > failure. > > > >> If one store fails, what happens with the rest of the > computation? > > > >> > > > >> Olga > > > >> > > > >> > > > >>> -----Original Message----- > > > >>> From: Utkarsh Srivastava [mailto:[EMAIL PROTECTED] > > > >>> Sent: Monday, May 19, 2008 11:06 AM > > > >>> To: [email protected] > > > >>> Subject: FW: How Grouping works for multiple groups > > > >>> > > > >>> Following is an email that showed up on the > user-list. I am sure > > > >>> most people must have seen it. > > > >>> > > > >>> The guy wants to scan the data once and do multiple > > > things with it. > > > >>> This kind of a need arises often but we don't have a > very good > > > >>> answer to it. > > > >>> > > > >>> We have SPLIT, but that is only half the solution (and > > > probably not > > > >>> a very good one). > > > >>> > > > >>> What is needed is more like a multi-store command (I > > > think someone > > > >>> has proposed it on one of these lists before). > > > >>> > > > >>> So you would be able to do things like > > > >>> > > > >>> A = LOAD ... > > > >>> B = FILTER A by .. > > > >>> C = FILTER A by .. > > > >>> //do something with B > > > >>> //do something else with C > > > >>> store B,C <===== The new multi-store command > > > >>> > > > >>> > > > >>> Sawzall does better than us in this regard because they have > > > >>> collectors to which you can output data, and you can set > > > up as many > > > >>> collectors as you want. > > > >>> > > > >>> Utkarsh > > > >>> > > > >>> -----Original Message----- > > > >>> From: Goel, Ankur [mailto:[EMAIL PROTECTED] > > > >>> Sent: Monday, May 19, 2008 1:24 AM > > > >>> To: [EMAIL PROTECTED] > > > >>> Cc: Holsman, Ian > > > >>> Subject: How Grouping works for multiple groups > > > >>> > > > >>> Hi folks, > > > >>> I am new to PIG having a little bit of Hadoop > > > >>> Map-reduce experience. I recently had chance to use PIG > > > for my data > > > >>> analysis task for which I had written a Map-Red > program earlier. > > > >>> A few questions came up in my mind that I thought would be > > > >>> better asked in this forum. Here's a brief description of my > > > analysis task > > > >>> to give you an idea of what I am doing. > > > >>> > > > >>> - For each tuple I need to classify the data into 3 > groups - A, > > > >>> B, > > > >>> > > > > C. > > > > > > > >>> - For group A and B, I need to aggregate the number > of distinct > > > >>> > > > > items > > > > > > > >>> in each group and have them sorted in reverse order in > > > the output. > > > >>> > > > >>> - For group C, I only need to output those distinct items. > > > >>> > > > >>> - The output for each of these go to their respective > > > output files > > > >>> for e.g. A_file.txt, B_file.txt > > > >>> > > > >>> > > > >>> Now, it seems like in PIG's execution plan each 'Group' > > > >>> operation is a separate Map-Reduce job even though its > > > happening on > > > >>> the same set of tuples. Whereas writing a Map-Red job for > > > the same > > > >>> allows me to prefix a "Group identifier" of my choice to > > > the 'key' > > > >>> and produce the relevant 'value' data which I then use > > > subsequently > > > >>> in the combiner and reducer to perform the other > operations and > > > >>> output to different files. > > > >>> > > > >>> If my understanding of PIG is correct then its > execution plan is > > > >>> spawning multiple Map-Red jobs to scan the same data-set > > > again for > > > >>> different groups which is costlier than writing a custom > > > Map-red job > > > >>> and packing more work in a single Map-Red job the way > I mentioned. > > > >>> > > > >>> I can always reduce the number of groups in my PIG scripts to > > > >>> 1 by having a user-defined function generating those > > > group prefixes > > > >>> before a group call and then do multiple filters on the > > > group 'key' > > > >>> again using a user-defined function that does group > > > identification > > > >>> but this is less than intuitive and requires more > user-defined > > > >>> functions than one would like. > > > >>> > > > >>> My question is , Do current optimization techniques > take care of > > > >>> such a scenario ? My observation is they don't, but I > > > could be wrong > > > >>> here. If they do then how can I have a peek into the > > > execution plan > > > >>> to make sure that its not spawning more than > necessary number of > > > >>> Map-Red jobs. > > > >>> > > > >>> If they don't, then is it something planned for the future ? > > > >>> > > > >>> Also, I don't see 'Pig Pen' debugging environment anywhere ? > > > >>> Is it still a part of PIG, if yes then how can I use it ? > > > >>> > > > >>> I know its been a rather long mail, but any help here > is deeply > > > >>> appreciated as going forward we plan to use PIG > heavily to avoid > > > >>> writing custom Map-Red jobs for every different kind > of analysis > > > >>> that we intend to do. > > > >>> > > > >>> Thanks and Regards > > > >>> -Ankur > > > >>> > > > >>> > > > > > >
