bulkLoad has the connection to MongoDB ?

On Fri, Nov 21, 2014 at 4:34 PM, Benny Thompson <ben.d.tho...@gmail.com>
wrote:

> I tried using RDD#mapPartitions but my job completes prematurely and
> without error as if nothing gets done.  What I have is fairly simple....
>
> sc
>                 .textFile(inputFile)
>                 .map(parser.parse)
>                 .mapPartitions(bulkLoad)
>
> But the Iterator[T] of mapPartitions is always empty, even though I know
> map is generating records.
>
>
> On Thu Nov 20 2014 at 9:25:54 PM Soumya Simanta <soumya.sima...@gmail.com>
> wrote:
>
>> On Thu, Nov 20, 2014 at 10:18 PM, Benny Thompson <ben.d.tho...@gmail.com>
>> wrote:
>>
>>> I'm trying to use MongoDB as a destination for an ETL I'm writing in
>>> Spark.  It appears I'm gaining a lot of overhead in my system databases
>>> (and possibly in the primary documents themselves);  I can only assume it's
>>> because I'm left to using PairRDD.saveAsNewAPIHadoopFile.
>>>
>>> - Is there a way to batch some of the data together and use Casbah
>>> natively so I can use bulk inserts?
>>>
>>
>> Why cannot you write Mongo in a RDD#mapPartition ?
>>
>>
>>>
>>> - Is there maybe a less "hacky" way to load to MongoDB (instead of
>>> using saveAsNewAPIHadoopFile)?
>>>
>>>
>> If the latency (time by which all data should be in Mongo) is not a
>> concern you can try a separate process that uses Akka/Casbah to write from
>> HDFS into Mongo.
>>
>>
>>
>>

Reply via email to