Thanks Matei. Using your pointers I can import data frrom HDFS, what I want to do now is something like this in Spark:
----------------------- import myown.mapper rdd.map (mapper.map) ----------------------- The reason why I want this: myown.mapper is a java class I already developed. I used to run it in Hadoop. It is fairly complex and relies on a lot of utility java classes I wrote. Can I reuse the map function in java and port it into Spark? Best regards, Wei --------------------------------- Wei Tan, PhD Research Staff Member IBM T. J. Watson Research Center http://researcher.ibm.com/person/us-wtan From: Matei Zaharia <matei.zaha...@gmail.com> To: user@spark.apache.org, Date: 06/04/2014 04:28 PM Subject: Re: reuse hadoop code in Spark Yes, you can write some glue in Spark to call these. Some functions to look at: - SparkContext.hadoopRDD lets you create an input RDD from an existing JobConf configured by Hadoop (including InputFormat, paths, etc) - RDD.mapPartitions lets you operate in all the values on one partition (block) at a time, similar to how Mappers in MapReduce work - PairRDDFunctions.reduceByKey and groupByKey can be used for aggregation. - RDD.pipe() can be used to call out to a script or binary, like Hadoop Streaming. A fair number of people have been running both Java and Hadoop Streaming apps like this. Matei On Jun 4, 2014, at 1:08 PM, Wei Tan <w...@us.ibm.com> wrote: Hello, I am trying to use spark in such a scenario: I have code written in Hadoop and now I try to migrate to Spark. The mappers and reducers are fairly complex. So I wonder if I can reuse the map() functions I already wrote in Hadoop (Java), and use Spark to chain them, mixing the Java map() functions with Spark operators? Another related question, can I use binary as operators, like Hadoop streaming? Thanks! Wei