Hi, Since getting Spark + MongoDB to work together was not very obvious (at least to me) I wrote a tutorial about it in my blog with an example application: http://codeforhire.com/2014/02/18/using-spark-with-mongodb/
Hope it's of use to someone else as well. Cheers, * Sampo Niskanen* *Lead developer / Wellmo* sampo.niska...@wellmo.com +358 40 820 5291 On Tue, Feb 4, 2014 at 10:46 PM, Tathagata Das <tathagata.das1...@gmail.com>wrote: > Can you try using sc.newAPIHadoop** ? > There are two kinds of classes because the Hadoop API for input and output > format had undergone a significant change a few years ago. > > TD > > > On Tue, Feb 4, 2014 at 5:58 AM, Sampo Niskanen > <sampo.niska...@wellmo.com>wrote: > >> Hi, >> >> Thanks for the pointer. However, I'm still unable to generate the RDD >> using MongoInputFormat. I'm trying to add the mongo-hadoop connector to >> the Java SimpleApp in the quickstart at >> http://spark.incubator.apache.org/docs/latest/quick-start.html >> >> >> The mongo-hadoop connector contains two versions of MongoInputFormat, one >> extending org.apache.hadoop.mapreduce.InputFormat<Object, BSONObject>, >> the other extending org.apache.hadoop.mapred.InputFormat<Object, >> BSONObject>. Neither of them is accepted by the compiler, and I'm >> unsure why: >> >> JavaSparkContext sc = new JavaSparkContext("local", "Simple App"); >> sc.hadoopRDD(job, >> com.mongodb.hadoop.mapred.MongoInputFormat.class, Object.class, >> BSONObject.class); >> sc.hadoopRDD(job, com.mongodb.hadoop.MongoInputFormat.class, >> Object.class, BSONObject.class); >> >> Eclipse gives the following error for both the the latter two lines: >> >> Bound mismatch: The generic method hadoopRDD(JobConf, Class<F>, >> Class<K>, Class<V>) of type JavaSparkContext is not applicable for the >> arguments (JobConf, Class<MongoInputFormat>, Class<Object>, >> Class<BSONObject>). The inferred type MongoInputFormat is not a valid >> substitute for the bounded parameter <F extends InputFormat<K,V>> >> >> >> >> I'm using Spark 0.9.0. Might this be caused by a conflict of Hadoop >> versions? I downloaded the mongo-hadoop connector for Hadoop 2.2. I >> haven't figured out how to select which Hadoop version Spark uses, when >> required from an sbt file. (The SBT file is the one described in the >> quickstart.) >> >> >> Thanks for any help. >> >> >> Best regards, >> Sampo N. >> >> >> >> On Fri, Jan 31, 2014 at 5:34 AM, Tathagata Das < >> tathagata.das1...@gmail.com> wrote: >> >>> I walked through the example in the second link you gave. The Treasury >>> Yield example referred there is >>> here<https://github.com/mongodb/mongo-hadoop/blob/master/examples/treasury_yield/src/main/java/com/mongodb/hadoop/examples/treasury/TreasuryYieldXMLConfigV2.java>. >>> Note the InputFormat and OutputFormat used in the job configuration. This >>> InputFormat and OutputFormat specifies how to write data in and out of >>> MongoDB. You should be able to use the same InputFormat and outputFormat >>> class in Spark as well. For saving files to MongoDB, use >>> yourRDD.saveAsHadoopFile(.... specify the output format class ...) and to >>> read from MongoDB sparkContext.hadoopFile(..... specify input format class >>> ....) . >>> >>> TD >>> >>> >>> On Thu, Jan 30, 2014 at 12:36 PM, Sampo Niskanen < >>> sampo.niska...@wellmo.com> wrote: >>> >>>> Hi, >>>> >>>> We're starting to build an analytics framework for our wellness >>>> service. While our data is not yet Big, we'd like to use a framework that >>>> will scale as needed, and Spark seems to be the best around. >>>> >>>> I'm new to Hadoop and Spark, and I'm having difficulty figuring out how >>>> to use Spark in connection with MongoDB. Apparently, I should be able to >>>> use the mongo-hadoop connector (https://github.com/mongodb/mongo-hadoop) >>>> also with Spark, but haven't figured out how. >>>> >>>> I've run through the Spark tutorials and been able to setup a >>>> single-machine Hadoop system with the MongoDB connector as instructed at >>>> >>>> http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/ >>>> and >>>> http://docs.mongodb.org/ecosystem/tutorial/getting-started-with-hadoop/ >>>> >>>> Could someone give some instructions or pointers on how to configure >>>> and use the mongo-hadoop connector with Spark? I haven't been able to find >>>> any documentation about this. >>>> >>>> >>>> Thanks. >>>> >>>> >>>> Best regards, >>>> Sampo N. >>>> >>>> >>>> >>> >> >