I forgot the quote is from Hadoop, Definitive Guide.
On Thu, Aug 7, 2014 at 6:04 PM, Pedro Magalhaes pedror...@gmail.com wrote:
Thanks for reply..
Really, what i am doing is trying to implement a mapside join. In my
mind, i am gonna need that files must be no splittable, so each map will
Map Side joins will use the CompositeInputFormat. They will only really be
worth doing if one data set is small, and the other is large.
This is a good example :
http://www.congiu.com/joins-in-hadoop-using-compositeinputformat/
the trick is to google for CompositeInputFormat.compose() :)
]
Sent: Thursday, July 11, 2013 5:10 PM
To: common-u...@hadoop.apache.org
Subject: Re: CompositeInputFormat
Map Side joins will use the CompositeInputFormat. They will only really be
worth doing if one data set is small, and the other is large.
This is a good example :
http://www.congiu.com/joins
]
Sent: 12 July 2013 03:33
To: user@hadoop.apache.org
Subject: RE: CompositeInputFormat
Sorry I should've specified that I need an example of CompositeInputFormat that
uses the new API.
The example linked below uses old API objects like JobConf.
Any known examples of CompositeInputFormat using
Mike,
The mapred.* API has been undeprecated and continues to be the stable
API. In 1.0.0, the new API is/was unfinished and lacks a lot of ports
from the mapred.lib.* components. This is being addressed by
https://issues.apache.org/jira/browse/MAPREDUCE-3607 if you are
interested in backporting
The join package does a streaming merge sort between each part-X in your
input directories,
part- will be handled a single task,
part-0001 will be handled in a single task
and so on
These jobs are essentially io bound, and hard to beat for performance.
On Wed, Jun 24, 2009 at 2:09 PM, pmg
And what decides part-, part-0001input split, block size?
So for example for 1G of data on HDFS with 64m block size get 16 blocks
mapped to different map tasks?
jason hadoop wrote:
The join package does a streaming merge sort between each part-X in your
input directories,
The input split size is Long.MAX_VALUE.
and in actual fact, the contents of each directory are sorted separately.
The number of directory entries for each has to be identical.
and all files in index position I, where I varies from 0 to the number of
files in a directory, become the input to 1