Each executor runs for about 5 secs until which time the db connection can potentially be open. Each executor will have 1 connection open. Connection pooling surely has its advantages of performance and not hitting the dbserver for every open/close. The database in question is not just used by the spark jobs, but is shared by other systems and so the spark jobs have to better at managing the resources.
I am not really looking for a db connections counter (will let the db handle that part), but rather have a pool of connections on spark end so that the connections can be reused across jobs On Fri, Apr 3, 2015 at 10:21 AM, Charles Feduke <charles.fed...@gmail.com> wrote: > How long does each executor keep the connection open for? How many > connections does each executor open? > > Are you certain that connection pooling is a performant and suitable > solution? Are you running out of resources on the database server and > cannot tolerate each executor having a single connection? > > If you need a solution that limits the number of open connections > [resource starvation on the DB server] I think you'd have to fake it with a > centralized counter of active connections, and logic within each executor > that blocks when the counter is at a given threshold. If the counter is not > at threshold, then an active connection can be created (after incrementing > the shared counter). You could use something like ZooKeeper to store the > counter value. This would have the overall effect of decreasing performance > if your required number of connections outstrips the database's resources. > > On Fri, Apr 3, 2015 at 12:22 AM Sateesh Kavuri <sateesh.kav...@gmail.com> > wrote: > >> But this basically means that the pool is confined to the job (of a >> single app) in question, but is not sharable across multiple apps? >> The setup we have is a job server (the spark-jobserver) that creates >> jobs. Currently, we have each job opening and closing a connection to the >> database. What we would like to achieve is for each of the jobs to obtain a >> connection from a db pool >> >> Any directions on how this can be achieved? >> >> -- >> Sateesh >> >> On Thu, Apr 2, 2015 at 7:00 PM, Cody Koeninger <c...@koeninger.org> >> wrote: >> >>> Connection pools aren't serializable, so you generally need to set them >>> up inside of a closure. Doing that for every item is wasteful, so you >>> typically want to use mapPartitions or foreachPartition >>> >>> rdd.mapPartition { part => >>> setupPool >>> part.map { ... >>> >>> >>> >>> See "Design Patterns for using foreachRDD" in >>> http://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams >>> >>> On Thu, Apr 2, 2015 at 7:52 AM, Sateesh Kavuri <sateesh.kav...@gmail.com >>> > wrote: >>> >>>> Right, I am aware on how to use connection pooling with oracle, but the >>>> specific question is how to use it in the context of spark job execution >>>> On 2 Apr 2015 17:41, "Ted Yu" <yuzhih...@gmail.com> wrote: >>>> >>>>> http://docs.oracle.com/cd/B10500_01/java.920/a96654/connpoca.htm >>>>> >>>>> The question doesn't seem to be Spark specific, btw >>>>> >>>>> >>>>> >>>>> >>>>> > On Apr 2, 2015, at 4:45 AM, Sateesh Kavuri <sateesh.kav...@gmail.com> >>>>> wrote: >>>>> > >>>>> > Hi, >>>>> > >>>>> > We have a case that we will have to run concurrent jobs (for the >>>>> same algorithm) on different data sets. And these jobs can run in parallel >>>>> and each one of them would be fetching the data from the database. >>>>> > We would like to optimize the database connections by making use of >>>>> connection pooling. Any suggestions / best known ways on how to achieve >>>>> this. The database in question is Oracle >>>>> > >>>>> > Thanks, >>>>> > Sateesh >>>>> >>>> >>> >>