bulkLoad has the connection to MongoDB ? On Fri, Nov 21, 2014 at 4:34 PM, Benny Thompson <ben.d.tho...@gmail.com> wrote:
> I tried using RDD#mapPartitions but my job completes prematurely and > without error as if nothing gets done. What I have is fairly simple.... > > sc > .textFile(inputFile) > .map(parser.parse) > .mapPartitions(bulkLoad) > > But the Iterator[T] of mapPartitions is always empty, even though I know > map is generating records. > > > On Thu Nov 20 2014 at 9:25:54 PM Soumya Simanta <soumya.sima...@gmail.com> > wrote: > >> On Thu, Nov 20, 2014 at 10:18 PM, Benny Thompson <ben.d.tho...@gmail.com> >> wrote: >> >>> I'm trying to use MongoDB as a destination for an ETL I'm writing in >>> Spark. It appears I'm gaining a lot of overhead in my system databases >>> (and possibly in the primary documents themselves); I can only assume it's >>> because I'm left to using PairRDD.saveAsNewAPIHadoopFile. >>> >>> - Is there a way to batch some of the data together and use Casbah >>> natively so I can use bulk inserts? >>> >> >> Why cannot you write Mongo in a RDD#mapPartition ? >> >> >>> >>> - Is there maybe a less "hacky" way to load to MongoDB (instead of >>> using saveAsNewAPIHadoopFile)? >>> >>> >> If the latency (time by which all data should be in Mongo) is not a >> concern you can try a separate process that uses Akka/Casbah to write from >> HDFS into Mongo. >> >> >> >>