What I understand is you are looking for a value based on key, I guess you should look at a Key-Value Datastore (like Voldermort). But Again accessing the datastore for each key in 2nd MR Job would be a costly operation, which might require additional tuning the datastore.
PS - I am not sure if this is a good practice. On Thu, Apr 5, 2012 at 11:37 AM, Stuti Awasthi <stutiawas...@hcl.com> wrote: > Thanks everyone,**** > > ** ** > > So with this discussion, there are 2 main opinions I got :**** > > **1. **Not to call one MR job from inside another MR job.**** > > **2. **Can use distributed cache (but not good for very large file).* > *** > > I want to design the system so that I can efficiently do the processing. > So if I run MR job to process File2 first and store its data in > KeyValueFormat in HDFS.**** > > Once this job is complete, I start with another MR job to process File1. > Now since each I/p line of File1 will require to get the some data from > output of first MR job.**** > > **1. **Normal way to do is , For each input line for 2nd MR job, it > will loop through the contents of output from MR job1 and get the relevant > data for processing.**** > > **2. **Since I have stored output of File2 in key-value format, can > I directly get the value for specific key.**** > > ** ** > > So I want to know that if I have output1 in KeyValueFormat in HDFS. I run > a separate job with different I/p file and wants to access data from > output1 on the basis of keys, can we attain that without looping output1.* > *** > > ** ** > > Thanks**** > > ** ** > > *From:* Praveen Kumar K J V S [mailto:praveenkjvs.develo...@gmail.com] > *Sent:* Wednesday, April 04, 2012 6:43 PM > *To:* mapreduce-user@hadoop.apache.org > *Subject:* Re: Calling one MR job within another MR job**** > > ** ** > > Dear Stuti,**** > > **** > > As per the mail chain I uderstand you want to do SetJoin on two sets File1 > and File2 with some join finction F(F1,F2). On this assumption, please find > my reply below:**** > > **** > > Set join is not simple and that too if input the input is very large. It > essestially does a cartesian product between the two sets F1 and F2 and > filter out the required data based on some function F(F1, F2).**** > > **** > > What i mean is say you have two files each with 10Lakh lines, then to > perform a set join you essentially do 100Lakh operations and filter > phase works on these 100Lakh results to filter out the required ones.**** > > **** > > Hence such a problem being exponentially inreasing in input size, it is > helpful if you know how to Set-Join funciton works. having such insight is > helpful.**** > > **** > > Though I have to admit, that these kind of problems are still under active > reasearch, please refer links below for more detail:**** > > **** > > 1. http://www.youtube.com/watch?v=kiuUGXWRzPA - google tech talks**** > 2. > > http://www.slideshare.net/rvernica/efficient-parallel-setsimilarity-joins-using-mapreduce-sigmod-2010-slides > **** > 3. http://research.microsoft.com/apps/pubs/default.aspx?id=76165**** > 4. http://www.slideshare.net/ydn/4-similarity-joinshadoopsummit2010* > *** > > @Distributed cahce: Its not great if you have huge files. By Default you > have to size limit of 10GB as the max size for a distributed file**** > > **** > > @launching jobs inside a mapper: Not a great idea, because for every key > value you will launch a job and so essentially you will end up launching > very huge number of jobs. Absolutely No No. A bug in production can bring > down the cluster. Also its difficult to track all these jobs.**** > > **** > > Thank,**** > > Praveen**** > > On Wed, Apr 4, 2012 at 6:17 PM, <jagatsi...@gmail.com> wrote:**** > > Hello Stuti > > The way you have explained it seems we can think about caching the file2 > already in nodes. > > -- Just out of context , In the same way replicated joins are being > handled in Pig in which one file (file2) to be joined is cached in the > memory by file1. > > Regards > > Jagat**** > > **** > > ----- Original Message -----**** > > From: Stuti Awasthi**** > > Sent: 04/04/12 07:55 AM**** > > To: mapreduce-user@hadoop.apache.org**** > > Subject: RE: Calling one MR job within another MR job**** > > ** ** > > Hi Ravi,**** > > **** > > **** > > **** > > **** > > **** > > There is no job dependency so I cannot use chaining MR or JobControl as > you suggested.**** > > **** > > **** > > I have 2 relatively big files, I start processing with File1 as input to > MR1 job , now this processing required to find the data from File2. One way > to do is loop through File2 and get the data. Other way to pass File2 in > MR2 job for parallel processing.**** > > **** > > **** > > **** > > **** > > **** > > Second option is making hinting me to call an MR2 job inside from MR1 job. > I am sure this is the common problem that people usually face. What is the > best way to resolve this kind of issue.**** > > **** > > **** > > **** > > **** > > **** > > Thanks**** > > **** > > **** > > **** > > **** > > **** > > *From:* Ravi teja ch n v [mailto:raviteja.c...@huawei.com] > *Sent:* Wednesday, April 04, 2012 4:35 PM > *To:* mapreduce-user@hadoop.apache.org > *Subject:* RE: Calling one MR job within another MR job**** > > **** > > **** > > **** > > **** > > **** > > Hi Stuti,**** > > **** > > **** > > **** > > **** > > **** > > If you are looking for MRjob2 to run after MRjob1, ie the job dependency, > **** > > **** > > **** > > you can use JobControl API, where you can manage the dependencies.**** > > **** > > **** > > **** > > **** > > **** > > Calling another Job from a Mapper is not a good idea.**** > > **** > > **** > > **** > > **** > > **** > > Thanks,**** > > **** > > **** > > Ravi Teja**** > > **** > > **** > > **** > > **** > > **** > ------------------------------ > > *From:* Stuti Awasthi [stutiawas...@hcl.com] > *Sent:* 04 April 2012 16:04:19 > *To:* mapreduce-user@hadoop.apache.org > *Subject:* Calling one MR job within another MR job**** > > **** > > **** > > Hi all,**** > > **** > > **** > > **** > > **** > > **** > > We have a usecase in which I start with first MR1 job with input file as > File1.txt, and from this job, call another MR2 job with input as File2.txt > **** > > **** > > **** > > So :**** > > **** > > **** > > MRjob1{**** > > **** > > **** > > Map(){**** > > **** > > **** > > MRJob2(File2.txt)**** > > **** > > **** > > }**** > > **** > > **** > > }**** > > **** > > **** > > **** > > **** > > **** > > MRJob2{**** > > **** > > **** > > Processing….**** > > **** > > **** > > }**** > > **** > > **** > > **** > > **** > > **** > > My queries are is this kind of approach is possible and how much are the > implications from the performance perspective.**** > > **** > > **** > > **** > > **** > > **** > > **** > > **** > > **** > > Regards,**** > > **** > > **** > > *Stuti Awasthi***** > > **** > > **** > > HCL Comnet Systems and Services Ltd**** > > **** > > **** > > F-8/9 Basement, Sec-3,Noida.**** > > **** > > **** > > **** > > **** > > **** > > **** > > **** > > **** > ------------------------------ > > ::DISCLAIMER:: > > ----------------------------------------------------------------------------------------------------------------------- > > The contents of this e-mail and any attachment(s) are confidential and > intended for the named recipient(s) only. > It shall not attach any liability on the originator or HCL or its > affiliates. Any views or opinions presented in > this email are solely those of the author and may not necessarily reflect > the opinions of HCL or its affiliates. > Any form of reproduction, dissemination, copying, disclosure, > modification, distribution and / or publication of > this message without the prior written consent of the author of this > e-mail is strictly prohibited. If you have > received this email in error please delete it and notify the sender > immediately. Before opening any mail and > attachments please check them for viruses and defect. > > > ----------------------------------------------------------------------------------------------------------------------- > **** > > **** > > **** > > **** > > > > > > **** > > ** ** > -- Ashwanth Kumar / ashwanthkumar.in