Re: Connection pooling in spark jobs

Sateesh Kavuri Thu, 02 Apr 2015 22:39:42 -0700

Each executor runs for about 5 secs until which time the db connection can
potentially be open. Each executor will have 1 connection open.
Connection pooling surely has its advantages of performance and not hitting
the dbserver for every open/close. The database in question is not just
used by the spark jobs, but is shared by other systems and so the spark
jobs have to better at managing the resources.


I am not really looking for a db connections counter (will let the db
handle that part), but rather have a pool of connections on spark end so
that the connections can be reused across jobs


On Fri, Apr 3, 2015 at 10:21 AM, Charles Feduke <charles.fed...@gmail.com>
wrote:

> How long does each executor keep the connection open for? How many
> connections does each executor open?
>
> Are you certain that connection pooling is a performant and suitable
> solution? Are you running out of resources on the database server and
> cannot tolerate each executor having a single connection?
>
> If you need a solution that limits the number of open connections
> [resource starvation on the DB server] I think you'd have to fake it with a
> centralized counter of active connections, and logic within each executor
> that blocks when the counter is at a given threshold. If the counter is not
> at threshold, then an active connection can be created (after incrementing
> the shared counter). You could use something like ZooKeeper to store the
> counter value. This would have the overall effect of decreasing performance
> if your required number of connections outstrips the database's resources.
>
> On Fri, Apr 3, 2015 at 12:22 AM Sateesh Kavuri <sateesh.kav...@gmail.com>
> wrote:
>
>> But this basically means that the pool is confined to the job (of a
>> single app) in question, but is not sharable across multiple apps?
>> The setup we have is a job server (the spark-jobserver) that creates
>> jobs. Currently, we have each job opening and closing a connection to the
>> database. What we would like to achieve is for each of the jobs to obtain a
>> connection from a db pool
>>
>> Any directions on how this can be achieved?
>>
>> --
>> Sateesh
>>
>> On Thu, Apr 2, 2015 at 7:00 PM, Cody Koeninger <c...@koeninger.org>
>> wrote:
>>
>>> Connection pools aren't serializable, so you generally need to set them
>>> up inside of a closure.  Doing that for every item is wasteful, so you
>>> typically want to use mapPartitions or foreachPartition
>>>
>>> rdd.mapPartition { part =>
>>> setupPool
>>> part.map { ...
>>>
>>>
>>>
>>> See "Design Patterns for using foreachRDD" in
>>> http://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams
>>>
>>> On Thu, Apr 2, 2015 at 7:52 AM, Sateesh Kavuri <sateesh.kav...@gmail.com
>>> > wrote:
>>>
>>>> Right, I am aware on how to use connection pooling with oracle, but the
>>>> specific question is how to use it in the context of spark job execution
>>>> On 2 Apr 2015 17:41, "Ted Yu" <yuzhih...@gmail.com> wrote:
>>>>
>>>>> http://docs.oracle.com/cd/B10500_01/java.920/a96654/connpoca.htm
>>>>>
>>>>> The question doesn't seem to be Spark specific, btw
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> > On Apr 2, 2015, at 4:45 AM, Sateesh Kavuri <sateesh.kav...@gmail.com>
>>>>> wrote:
>>>>> >
>>>>> > Hi,
>>>>> >
>>>>> > We have a case that we will have to run concurrent jobs (for the
>>>>> same algorithm) on different data sets. And these jobs can run in parallel
>>>>> and each one of them would be fetching the data from the database.
>>>>> > We would like to optimize the database connections by making use of
>>>>> connection pooling. Any suggestions / best known ways on how to achieve
>>>>> this. The database in question is Oracle
>>>>> >
>>>>> > Thanks,
>>>>> > Sateesh
>>>>>
>>>>
>>>
>>

Re: Connection pooling in spark jobs

Reply via email to