Hey Mike, I quickly looked through the example and I found major performance issue. You are collecting the RDDs to the driver and then sending them to Mongo in a foreach. Why not doing a distributed push to Mongo?
WHAT YOU HAVE val mongoConnection = ... WHAT YOU SHUOLD DO rdd.foreachPartition { iterator => val connection = createConnection() iterator.foreach { ... push partition using connection ... } } On Thu, Feb 26, 2015 at 1:25 PM, Mike Trienis <mike.trie...@orcsol.com> wrote: > Hi All, > > I have Spark Streaming setup to write data to a replicated MongoDB > database and would like to understand if there would be any issues using > the Reactive Mongo library to write directly to the mongoDB? My stack is > Apache Spark sitting on top of Cassandra for the datastore, so my thinking > is that the MongoDB connector for Hadoop will not be particular useful for > me since I'm not using HDFS? Is there anything that I'm missing? > > Here is an example of code that I'm planning on using as a starting point > for my implementation. > > LogAggregator > <https://github.com/chimpler/blog-spark-streaming-log-aggregation/blob/master/src/main/scala/com/chimpler/sparkstreaminglogaggregation/LogAggregator.scala> > > Thanks, Mike. >