Step 1: use rdd.mapPartitions(). Thats equivalent to mapper of the MapReduce. You can open connection, get all the data and buffer it, close connection, return iterator to the buffer Step 2: Make step 1 better, by making it reuse connections. You can use singletons / static vars, to lazily initialize and reuse a pool of connections. You will have to take care of concurrency, as multiple tasks may using the database in parallel in the same worker JVM.
TD On Thu, Jul 17, 2014 at 4:41 PM, Guangle Fan <fanguan...@gmail.com> wrote: > Hi, All > > When I run spark streaming, in one of the flatMap stage, I want to access > database. > > Code looks like : > > stream.flatMap( > new FlatMapFunction { > call () { > //access database cluster > } > } > ) > > Since I don't want to create database connection every time call() was > called, where is the best place do I create the connection and reuse it on > per-host basis (Like one database connection per Mapper/Reducer ) ? > > Regards, > > Guangle > >