Re: CompositeInputFormat

2014-08-09 Thread Pedro Magalhaes
I forgot the quote is from Hadoop, Definitive Guide. On Thu, Aug 7, 2014 at 6:04 PM, Pedro Magalhaes pedror...@gmail.com wrote: Thanks for reply.. Really, what i am doing is trying to implement a mapside join. In my mind, i am gonna need that files must be no splittable, so each map will

Re: CompositeInputFormat

2013-07-11 Thread Jay Vyas
Map Side joins will use the CompositeInputFormat. They will only really be worth doing if one data set is small, and the other is large. This is a good example : http://www.congiu.com/joins-in-hadoop-using-compositeinputformat/ the trick is to google for CompositeInputFormat.compose() :)

RE: CompositeInputFormat

2013-07-11 Thread Botelho, Andrew
] Sent: Thursday, July 11, 2013 5:10 PM To: common-u...@hadoop.apache.org Subject: Re: CompositeInputFormat Map Side joins will use the CompositeInputFormat. They will only really be worth doing if one data set is small, and the other is large. This is a good example : http://www.congiu.com/joins

RE: CompositeInputFormat

2013-07-11 Thread Devaraj k
] Sent: 12 July 2013 03:33 To: user@hadoop.apache.org Subject: RE: CompositeInputFormat Sorry I should've specified that I need an example of CompositeInputFormat that uses the new API. The example linked below uses old API objects like JobConf. Any known examples of CompositeInputFormat using

Re: CompositeInputFormat - why in mapred but not mapreduce?

2012-01-15 Thread Harsh J
Mike, The mapred.* API has been undeprecated and continues to be the stable API. In 1.0.0, the new API is/was unfinished and lacks a lot of ports from the mapred.lib.* components. This is being addressed by https://issues.apache.org/jira/browse/MAPREDUCE-3607 if you are interested in backporting

Re: CompositeInputFormat scalbility

2009-06-24 Thread jason hadoop
The join package does a streaming merge sort between each part-X in your input directories, part- will be handled a single task, part-0001 will be handled in a single task and so on These jobs are essentially io bound, and hard to beat for performance. On Wed, Jun 24, 2009 at 2:09 PM, pmg

Re: CompositeInputFormat scalbility

2009-06-24 Thread pmg
And what decides part-, part-0001input split, block size? So for example for 1G of data on HDFS with 64m block size get 16 blocks mapped to different map tasks? jason hadoop wrote: The join package does a streaming merge sort between each part-X in your input directories,

Re: CompositeInputFormat scalbility

2009-06-24 Thread jason hadoop
The input split size is Long.MAX_VALUE. and in actual fact, the contents of each directory are sorted separately. The number of directory entries for each has to be identical. and all files in index position I, where I varies from 0 to the number of files in a directory, become the input to 1