Paolo had already suggested that we add an EXECUTE command for exactly this purpose in interactive mode.

Alan.

Utkarsh Srivastava wrote:
Yes, I agree, not introducing new syntax is much more preferable.
Doing this optimization automatically for the batch mode is a good idea.
For the interactive mode, we would need something like a COMMIT
statement, which will force execution (with execution not automatically
starting on a STORE command as it currently does).

As regards failure, we could start with our current model, one failure
fails everything.

Utkarsh

-----Original Message-----
From: Olga Natkovich [mailto:[EMAIL PROTECTED]
Sent: Monday, May 19, 2008 11:23 AM
To: [email protected]
Subject: RE: How Grouping works for multiple groups

Utkarsh,

I agree that this issue has been brought up a number of times and
needs
to be addressed. I think it would be nice if we could address this
without introducing new syntax for store. In batch mode, this would be
quite easy since we can build execution plan for the entire script
rather than one store at a time. I realize that for interactive and
embedded case it is a bit trickier. Also we need to clarify what are
the
semantics of this kind of operation in the presence of failure. If one
store fails, what happens with the rest of the computation?

Olga

-----Original Message-----
From: Utkarsh Srivastava [mailto:[EMAIL PROTECTED]
Sent: Monday, May 19, 2008 11:06 AM
To: [email protected]
Subject: FW: How Grouping works for multiple groups

Following is an email that showed up on the user-list. I am
sure most people must have seen it.

The guy wants to scan the data once and do multiple things
with it. This kind of a need arises often but we don't have a
very good answer to it.

We have SPLIT, but that is only half the solution (and
probably not a very good one).

What is needed is more like a multi-store command (I think
someone has proposed it on one of these lists before).

So you would be able to do things like

A = LOAD ...
B = FILTER A by ..
C = FILTER A by ..
//do something with B
//do something else with C
store B,C   <===== The new multi-store command


Sawzall does better than us in this regard because they have
collectors to which you can output data, and you can set up
as many collectors as you want.

Utkarsh

-----Original Message-----
From: Goel, Ankur [mailto:[EMAIL PROTECTED]
Sent: Monday, May 19, 2008 1:24 AM
To: [EMAIL PROTECTED]
Cc: Holsman, Ian
Subject: How Grouping works for multiple groups

Hi folks,
             I am new to PIG having a little bit of Hadoop
Map-reduce experience. I recently had chance to use PIG for
my data analysis task for which I had written a Map-Red
program earlier.
A few questions came up in my mind that I thought would be
better asked in this forum. Here's a brief description of my
analysis task to give you an idea of what I am doing.

- For each tuple I need to classify the data into 3 groups - A, B,
C.
- For group A and B,  I need to aggregate the number of distinct
items
  in each group and have them sorted in reverse order in the output.

- For group C, I only need to output those distinct items.

- The output for each of these go to their respective output
files for e.g. A_file.txt, B_file.txt


Now, it seems like in PIG's execution plan each 'Group'
operation is a separate Map-Reduce job even though its
happening on the same set of tuples. Whereas writing a
Map-Red job for the same allows me to prefix a "Group
identifier" of my choice to the 'key' and produce the
relevant 'value' data which I then use subsequently in the
combiner and reducer to perform the other operations and
output to different files.

If my understanding of PIG is correct then its execution plan
is spawning multiple Map-Red jobs to scan the same data-set
again for different groups which is costlier than writing a
custom Map-red job and packing more work in a single Map-Red
job the way I mentioned.

I can always reduce the number of groups in my PIG scripts to
1 by having a user-defined function generating those group
prefixes before a group call and then do multiple filters on
the group 'key'
again using a user-defined function that does group
identification but this is less than intuitive and requires
more user-defined functions than one would like.

My question is , Do current optimization techniques take care
of such a scenario ? My observation is they don't, but I
could be wrong here. If they do then how can I have a peek
into the execution plan to make sure that its not spawning
more than necessary number of Map-Red jobs.

If they don't, then is it something planned for the future ?

Also, I don't see 'Pig Pen' debugging environment anywhere ?
Is it still a part of PIG, if yes then how can I use it ?

I know its been a rather long mail, but any help here is
deeply appreciated as going forward we plan to use PIG
heavily to avoid writing custom Map-Red jobs for every
different kind of analysis that we intend to do.

Thanks and Regards
-Ankur

Reply via email to