Re: reuse hadoop code in Spark

Wei Tan Thu, 05 Jun 2014 08:14:58 -0700

Thanks Matei.

Using your pointers I can import data frrom HDFS, what I want to do now is 
something like this in Spark:


-----------------------
import myown.mapper

rdd.map (mapper.map)
-----------------------

The reason why I want this: myown.mapper is a java class I already 
developed. I used to run it in Hadoop. It is fairly complex and relies on 
a lot of utility java classes I wrote. Can I reuse the map function in 
java and port it into Spark?

Best regards,
Wei


---------------------------------
Wei Tan, PhD
Research Staff Member
IBM T. J. Watson Research Center
http://researcher.ibm.com/person/us-wtan



From:   Matei Zaharia <matei.zaha...@gmail.com>
To:     user@spark.apache.org, 
Date:   06/04/2014 04:28 PM
Subject:        Re: reuse hadoop code in Spark



Yes, you can write some glue in Spark to call these. Some functions to 
look at:

- SparkContext.hadoopRDD lets you create an input RDD from an existing 
JobConf configured by Hadoop (including InputFormat, paths, etc)
- RDD.mapPartitions lets you operate in all the values on one partition 
(block) at a time, similar to how Mappers in MapReduce work
- PairRDDFunctions.reduceByKey and groupByKey can be used for aggregation.
- RDD.pipe() can be used to call out to a script or binary, like Hadoop 
Streaming.

A fair number of people have been running both Java and Hadoop Streaming 
apps like this.

Matei

On Jun 4, 2014, at 1:08 PM, Wei Tan <w...@us.ibm.com> wrote:

Hello, 

  I am trying to use spark in such a scenario: 

  I have code written in Hadoop and now I try to migrate to Spark. The 
mappers and reducers are fairly complex. So I wonder if I can reuse the 
map() functions I already wrote in Hadoop (Java), and use Spark to chain 
them, mixing the Java map() functions with Spark operators? 

  Another related question, can I use binary as operators, like Hadoop 
streaming? 

  Thanks! 
Wei

Re: reuse hadoop code in Spark

Reply via email to