Step 1: use rdd.mapPartitions(). Thats equivalent to mapper of the
MapReduce. You can open connection, get all the data and buffer it, close
connection, return iterator to the buffer
Step 2: Make step 1 better, by making it reuse connections. You can use
singletons / static vars, to lazily initialize and reuse a pool of
connections. You will have to take care of concurrency, as multiple tasks
may using the database in parallel in the same worker JVM.

TD


On Thu, Jul 17, 2014 at 4:41 PM, Guangle Fan <fanguan...@gmail.com> wrote:

> Hi, All
>
> When I run spark streaming, in one of the flatMap stage, I want to access
> database.
>
> Code looks like :
>
> stream.flatMap(
> new FlatMapFunction {
>     call () {
>         //access database cluster
>     }
>   }
> )
>
> Since I don't want to create database connection every time call() was
> called, where is the best place do I create the connection and reuse it on
> per-host basis (Like one database connection per Mapper/Reducer ) ?
>
> Regards,
>
> Guangle
>
>

Reply via email to