Re: Multiple input formats and multiple output formats in Hadoop 0.20.2

Dino Kečo Wed, 10 Aug 2011 09:20:55 -0700

Hi John,

I think this is what are you looking for:


http://archive.cloudera.com/cdh/3/hadoop/api/org/apache/hadoop/mapreduce/lib/input/MultipleInputs.html

http://archive.cloudera.com/cdh/3/hadoop/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html

Examples of usages are part of API doc.

Regards,
Dino Kečo


On Wed, Aug 10, 2011 at 6:08 PM, Jian Fang <jian.fang.subscr...@gmail.com>wrote:

> Hi,
>
> I am working on a project, which requires multiple input formats and
> multiple output formats. Basically, I store some sales rank data to a
> Cassandra cluster and I get a sales rank update file each day to update the
> ranks in the Cassandra. In the meanwhile, I need to find all the products
> whose rank change exceeds a threshold and output them to a file. That is to
> say, I need two input formats, one from the file system (sales rank update
> file) and one from the Cassandra (current sales rank), and two output
> formats, one to the file system (products whose rank change exceeds a
> threshold) and one to Cassandra (write the new rank to Cassandra).
>
> Right now, I used multiple cascading jobs to do the work and use HDFS to
> share data among jobs. But this is not very efficient since some
> intermediate files need to be read multiple times in different jobs. I
> wonder if there is a more elegant way to solve this problem. Seems Hadoop
> 0.19 supports multiple input/output formats. It would be great if I could
> merge the multiple jobs to one with multiple input formats and multiple
> output formats. Is this doable in Hadoop 0.20.2?  Are there any examples of
> multiple input formats and multiple output formats for Hadoop 0.20.2?
>
> Thanks in advance,
>
> John
>
>

Re: Multiple input formats and multiple output formats in Hadoop 0.20.2

Reply via email to