The join package does a streaming merge sort between each part-X in your input directories, part-0000 will be handled a single task, part-0001 will be handled in a single task and so on These jobs are essentially io bound, and hard to beat for performance.
On Wed, Jun 24, 2009 at 2:09 PM, pmg <parmod.me...@gmail.com> wrote: > > I have two files FileA (with 600K records) and FileB (With 2million > records) > > FileA has a key which is same of all the records > > 123 724101722493 > 123 781676672721 > > FileB has the same key as FileA > > 123 5026328101569 > 123 5026328001562 > > Using hadoop join package I can create output file with tuples and cross > product of FileA and FileB. > > 123 [724101722493,5026328101569] > 123 [724101722493,5026328001562] > 123 [781676672721,5026328101569] > 123 [781676672721,5026328001562] > > How does CompositeInputFormat scale when we want to join 600K with 2 > millions records. Does it run on the node with single map/reduce? > > Also how can I not write the result into a file instead input split the > result into different nodes where I can compare the tuples e.g. comparing > 724101722493 with 5026328101569 using some heuristics. > > thanks > -- > View this message in context: > http://www.nabble.com/CompositeInputFormat-scalbility-tp24192957p24192957.html > Sent from the Hadoop core-user mailing list archive at Nabble.com. > > -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals