Is that quote from product documentation? Whether the output files are splittable is a practical consideration when setting up the join; the quote is identifying a common case that satisfies the constraints. The size of each partition is irrelevant, provided that the splits are generated consistently across all InputFormats involved in the expression (i.e., given datasets A,B in a join expression and a key K in A, K is in partition N iff K is in partition N for InputFormat B OR K is not in B). -C
On Mon, Aug 4, 2014 at 1:36 PM, Pedro Magalhaes <pedror...@gmail.com> wrote: > I saw that one of the requirements to use CompositeInputFormat is: > "A map-side join can be used to join the outputs of several jobs that had > the same number of reducers, the same keys, and output files that are not > splittable (by being smaller than an HDFS block, or by virtue of being gzip > compressed, for example)" > > So Does my partitions size must be equal or smaller than the HDFS Block? > > If i have a 1 GB File = 1024 mb, i will have 16 partitions of 64 MB? > > How can i control the size of the partition? > >