Let me ask the question differently.  Let's say I was not using Pig.  I
wanted to do this just using Java MapReduce.  The input file is HUGE.  One
obvious way to do this would be to write 3 different MR jobs.  But that
means this huge file be read 3 times which is what I am trying to avoid.

Is there a way to write a Mapper that will read this file only once, and
then write to 3 different Reducers with different keys?

Going back to Pig, when I LOAD this file & then later 'group by' 3 different
keys, how does Pig do this?  Does it "LOAD" this input file into some
interim file & call 3 different Map Reduce jobs?

If this makes no sense, please ignore it.  I will try to use 'Explain',
'Describe' to learn the internals.  Thanks.


On Mon, Oct 3, 2011 at 6:04 PM, Jonathan Coveney <jcove...@gmail.com> wrote:

> If you want to know more about the internals, I'd check out the paper Yahoo
> put out on the topic (or, of course, buy the book Programming Pig).
>
> The answer to this is pretty simple: if you load a file multiple times into
> different relations, then it will be scanned multiple times. So...
>
> a = load 'thing';
> b = load 'thing;
>
> {..stuff using a..}
> {..stuff using b..}
>
> would load 'thing' twice. This is done for joins and whatnot -- there are
> cases when you need to load the same file separately, twice. What happens
> is
> essentially that you're going to load and scan the data twice.
>
> However, as in your case, if you instead combine the load, then you'd have
>
> a = load 'thing';
> {..stuff using a..}
> {..stuff using a (which previously used b)..}
>
> Now it will just scan a once, and then go into each of the pipelines you
> defined.
>
> Obviously it's more complex than that, but that's the general gist.
>
> 2011/10/3 Something Something <mailinglist...@gmail.com>
>
> > I have 3 Pig scripts that load data from the same log file, but filter &
> > group this data differently.  If I combine these 3 into one & LOAD only
> > once, performance seems to have improved, but now I am curious exactly
> what
> > does LOAD do?
> >
> > How does LOAD work internally?  Does Pig save results of the LOAD into
> some
> > separate location in HDFS?  Someone please explain how LOAD relates to
> > MapReduce?  Thanks.
> >
>

Reply via email to