The join package does a streaming merge sort between each part-X in your
input directories,
part-0000 will be handled a single task,
part-0001 will be handled in a single task
and so on
These jobs are essentially io bound, and hard to beat for performance.

On Wed, Jun 24, 2009 at 2:09 PM, pmg <parmod.me...@gmail.com> wrote:

>
> I have two files FileA (with 600K records) and FileB (With 2million
> records)
>
> FileA has a key which is same of all the records
>
> 123    724101722493
> 123    781676672721
>
> FileB has the same key as FileA
>
> 123    5026328101569
> 123    5026328001562
>
> Using hadoop join package I can create output file with tuples and cross
> product of FileA and FileB.
>
> 123    [724101722493,5026328101569]
> 123    [724101722493,5026328001562]
> 123    [781676672721,5026328101569]
> 123    [781676672721,5026328001562]
>
> How does CompositeInputFormat scale when we want to join 600K with 2
> millions records. Does it run on the node with single map/reduce?
>
> Also how can I not write the result into a file instead input split the
> result into different nodes where I can compare the tuples e.g. comparing
> 724101722493 with 5026328101569 using some heuristics.
>
> thanks
> --
> View this message in context:
> http://www.nabble.com/CompositeInputFormat-scalbility-tp24192957p24192957.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals

Reply via email to