Hey Mike,

I quickly looked through the example and I found major performance issue.
You are collecting the RDDs to the driver and then sending them to Mongo in
a foreach. Why not doing a distributed push to Mongo?

WHAT YOU HAVE
val mongoConnection = ...

WHAT YOU SHUOLD DO

rdd.foreachPartition { iterator =>
   val connection = createConnection()
   iterator.foreach { ... push partition using connection ...  }
}


On Thu, Feb 26, 2015 at 1:25 PM, Mike Trienis <mike.trie...@orcsol.com>
wrote:

> Hi All,
>
> I have Spark Streaming setup to write data to a replicated MongoDB
> database and would like to understand if there would be any issues using
> the Reactive Mongo library to write directly to the mongoDB? My stack is
> Apache Spark sitting on top of Cassandra for the datastore, so my thinking
> is that the MongoDB connector for Hadoop will not be particular useful for
> me since I'm not using HDFS? Is there anything that I'm missing?
>
> Here is an example of code that I'm planning on using as a starting point
> for my implementation.
>
> LogAggregator
> <https://github.com/chimpler/blog-spark-streaming-log-aggregation/blob/master/src/main/scala/com/chimpler/sparkstreaminglogaggregation/LogAggregator.scala>
>
> Thanks, Mike.
>

Reply via email to