Let me ask the question differently. Let's say I was not using Pig. I wanted to do this just using Java MapReduce. The input file is HUGE. One obvious way to do this would be to write 3 different MR jobs. But that means this huge file be read 3 times which is what I am trying to avoid.
Is there a way to write a Mapper that will read this file only once, and then write to 3 different Reducers with different keys? Going back to Pig, when I LOAD this file & then later 'group by' 3 different keys, how does Pig do this? Does it "LOAD" this input file into some interim file & call 3 different Map Reduce jobs? If this makes no sense, please ignore it. I will try to use 'Explain', 'Describe' to learn the internals. Thanks. On Mon, Oct 3, 2011 at 6:04 PM, Jonathan Coveney <jcove...@gmail.com> wrote: > If you want to know more about the internals, I'd check out the paper Yahoo > put out on the topic (or, of course, buy the book Programming Pig). > > The answer to this is pretty simple: if you load a file multiple times into > different relations, then it will be scanned multiple times. So... > > a = load 'thing'; > b = load 'thing; > > {..stuff using a..} > {..stuff using b..} > > would load 'thing' twice. This is done for joins and whatnot -- there are > cases when you need to load the same file separately, twice. What happens > is > essentially that you're going to load and scan the data twice. > > However, as in your case, if you instead combine the load, then you'd have > > a = load 'thing'; > {..stuff using a..} > {..stuff using a (which previously used b)..} > > Now it will just scan a once, and then go into each of the pipelines you > defined. > > Obviously it's more complex than that, but that's the general gist. > > 2011/10/3 Something Something <mailinglist...@gmail.com> > > > I have 3 Pig scripts that load data from the same log file, but filter & > > group this data differently. If I combine these 3 into one & LOAD only > > once, performance seems to have improved, but now I am curious exactly > what > > does LOAD do? > > > > How does LOAD work internally? Does Pig save results of the LOAD into > some > > separate location in HDFS? Someone please explain how LOAD relates to > > MapReduce? Thanks. > > >