How long does each executor keep the connection open for? How many connections does each executor open?
Are you certain that connection pooling is a performant and suitable solution? Are you running out of resources on the database server and cannot tolerate each executor having a single connection? If you need a solution that limits the number of open connections [resource starvation on the DB server] I think you'd have to fake it with a centralized counter of active connections, and logic within each executor that blocks when the counter is at a given threshold. If the counter is not at threshold, then an active connection can be created (after incrementing the shared counter). You could use something like ZooKeeper to store the counter value. This would have the overall effect of decreasing performance if your required number of connections outstrips the database's resources. On Fri, Apr 3, 2015 at 12:22 AM Sateesh Kavuri <sateesh.kav...@gmail.com> wrote: > But this basically means that the pool is confined to the job (of a single > app) in question, but is not sharable across multiple apps? > The setup we have is a job server (the spark-jobserver) that creates jobs. > Currently, we have each job opening and closing a connection to the > database. What we would like to achieve is for each of the jobs to obtain a > connection from a db pool > > Any directions on how this can be achieved? > > -- > Sateesh > > On Thu, Apr 2, 2015 at 7:00 PM, Cody Koeninger <c...@koeninger.org> wrote: > >> Connection pools aren't serializable, so you generally need to set them >> up inside of a closure. Doing that for every item is wasteful, so you >> typically want to use mapPartitions or foreachPartition >> >> rdd.mapPartition { part => >> setupPool >> part.map { ... >> >> >> >> See "Design Patterns for using foreachRDD" in >> http://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams >> >> On Thu, Apr 2, 2015 at 7:52 AM, Sateesh Kavuri <sateesh.kav...@gmail.com> >> wrote: >> >>> Right, I am aware on how to use connection pooling with oracle, but the >>> specific question is how to use it in the context of spark job execution >>> On 2 Apr 2015 17:41, "Ted Yu" <yuzhih...@gmail.com> wrote: >>> >>>> http://docs.oracle.com/cd/B10500_01/java.920/a96654/connpoca.htm >>>> >>>> The question doesn't seem to be Spark specific, btw >>>> >>>> >>>> >>>> >>>> > On Apr 2, 2015, at 4:45 AM, Sateesh Kavuri <sateesh.kav...@gmail.com> >>>> wrote: >>>> > >>>> > Hi, >>>> > >>>> > We have a case that we will have to run concurrent jobs (for the same >>>> algorithm) on different data sets. And these jobs can run in parallel and >>>> each one of them would be fetching the data from the database. >>>> > We would like to optimize the database connections by making use of >>>> connection pooling. Any suggestions / best known ways on how to achieve >>>> this. The database in question is Oracle >>>> > >>>> > Thanks, >>>> > Sateesh >>>> >>> >> >