Hi Pig (Hadoop-subproject) can serve the best option for these kind of problems. I suggest you to take a look.
--nitesh On Sun, Apr 5, 2009 at 11:32 PM, jason hadoop <jason.had...@gmail.com> wrote: > Alpha chapters are available, and 8 should be available in the alpha's as > soon as draft one gets back from technical review. > > On Sun, Apr 5, 2009 at 7:43 AM, Christian Ulrik Søttrup > <soett...@nbi.dk>wrote: > >> jason hadoop wrote: >> >>> This is discussed in chapter 8 of my book. >>> >>> >> What book? Is it out? >> >> In short, >>> If both data sets are: >>> >>> - in same key order >>> - partitioned with the same partitioner, >>> - the input format of each data set is the same, (necessary for this >>> simple example only) >>> >>> A map side join will present all the key value pairs of each partition, to >>> a >>> single map task, in key order, >>> Path dir1 == the directory containing the part-XXXXX files for data set 1 >>> Path dir2 == The directory containing the part-XXXXX files for data set 2 >>> and use CompositeInputFormat.compose to build the join statement >>> >>> set the InputFormat to CompositeInputFormat, >>> conf.setInputFormat(CompositeInputFormat.class); >>> >>> String joinStatement = CompositeInputFormat.compose("inner", dir1, dir2); >>> conf.set('mapred.join.expr", joinStatement); >>> >>> The value classfor your map method will be TupleWritable >>> In the map method, >>> >>> - value.has(x) indicates if the Xth ordinal data set has a value for >>> this >>> key >>> - value.get(x) returns the value from the Xth ordinal data set for this >>> key >>> - value.size() returns the number of data sets in the join >>> >>> In our example, dir1 would be ordinal 0, and dir2 would be ordinal 1. >>> >>> >> The partitioner is normally used for the reduce step but here it will be >> used already at the mapper stage? >> >> Basically my files look like: >> id<tab>matrix >> id2<tab>anothermatrix >> and >> id<tab>vector1 >> id<tab>vector2 >> id2<tab>vector3 >> >> id is just an integer and there is only one matrix but many vectors tied to >> the same id. >> I just want the values from both files that has the same id. >> Do I need a partitioner in this case? What happens if the file is split >> into blocks such that two blocks >> contain entries with the same key? >> >> Am I right if what happens is that using the example above the mapper will >> be called three times with: >> key=id tuple=(matrix,vector1) >> key=id tuple=(matrix,vector2) >> key=id2 tuple=(anothermatix,vector3) >> >> cheers, >> Christian >> >> > > > -- > Alpha Chapters of my book on Hadoop are available > http://www.apress.com/book/view/9781430219422 > -- Nitesh Bhatia Dhirubhai Ambani Institute of Information & Communication Technology Gandhinagar Gujarat "Life is never perfect. It just depends where you draw the line." visit: http://www.awaaaz.com - connecting through music http://www.volstreet.com - lets volunteer for better tomorrow http://www.instibuzz.com - Voice opinions, Transact easily, Have fun