Re: reuse hadoop code in Spark

Matei Zaharia Wed, 04 Jun 2014 13:24:49 -0700

Yes, you can write some glue in Spark to call these. Some functions to look at:


- SparkContext.hadoopRDD lets you create an input RDD from an existing JobConf 
configured by Hadoop (including InputFormat, paths, etc)
- RDD.mapPartitions lets you operate in all the values on one partition (block) 
at a time, similar to how Mappers in MapReduce work
- PairRDDFunctions.reduceByKey and groupByKey can be used for aggregation.
- RDD.pipe() can be used to call out to a script or binary, like Hadoop 
Streaming.

A fair number of people have been running both Java and Hadoop Streaming apps 
like this.

Matei

On Jun 4, 2014, at 1:08 PM, Wei Tan <w...@us.ibm.com> wrote:

> Hello, 
> 
>   I am trying to use spark in such a scenario: 
> 
>   I have code written in Hadoop and now I try to migrate to Spark. The 
> mappers and reducers are fairly complex. So I wonder if I can reuse the map() 
> functions I already wrote in Hadoop (Java), and use Spark to chain them, 
> mixing the Java map() functions with Spark operators? 
> 
>   Another related question, can I use binary as operators, like Hadoop 
> streaming? 
> 
>   Thanks! 
> Wei 
> 
>

Re: reuse hadoop code in Spark

Reply via email to