Re: Connection pooling in spark jobs

Charles Feduke Thu, 02 Apr 2015 21:52:59 -0700

How long does each executor keep the connection open for? How many
connections does each executor open?


Are you certain that connection pooling is a performant and suitable
solution? Are you running out of resources on the database server and
cannot tolerate each executor having a single connection?

If you need a solution that limits the number of open connections [resource
starvation on the DB server] I think you'd have to fake it with a
centralized counter of active connections, and logic within each executor
that blocks when the counter is at a given threshold. If the counter is not
at threshold, then an active connection can be created (after incrementing
the shared counter). You could use something like ZooKeeper to store the
counter value. This would have the overall effect of decreasing performance
if your required number of connections outstrips the database's resources.

On Fri, Apr 3, 2015 at 12:22 AM Sateesh Kavuri <sateesh.kav...@gmail.com>
wrote:

> But this basically means that the pool is confined to the job (of a single
> app) in question, but is not sharable across multiple apps?
> The setup we have is a job server (the spark-jobserver) that creates jobs.
> Currently, we have each job opening and closing a connection to the
> database. What we would like to achieve is for each of the jobs to obtain a
> connection from a db pool
>
> Any directions on how this can be achieved?
>
> --
> Sateesh
>
> On Thu, Apr 2, 2015 at 7:00 PM, Cody Koeninger <c...@koeninger.org> wrote:
>
>> Connection pools aren't serializable, so you generally need to set them
>> up inside of a closure.  Doing that for every item is wasteful, so you
>> typically want to use mapPartitions or foreachPartition
>>
>> rdd.mapPartition { part =>
>> setupPool
>> part.map { ...
>>
>>
>>
>> See "Design Patterns for using foreachRDD" in
>> http://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams
>>
>> On Thu, Apr 2, 2015 at 7:52 AM, Sateesh Kavuri <sateesh.kav...@gmail.com>
>> wrote:
>>
>>> Right, I am aware on how to use connection pooling with oracle, but the
>>> specific question is how to use it in the context of spark job execution
>>> On 2 Apr 2015 17:41, "Ted Yu" <yuzhih...@gmail.com> wrote:
>>>
>>>> http://docs.oracle.com/cd/B10500_01/java.920/a96654/connpoca.htm
>>>>
>>>> The question doesn't seem to be Spark specific, btw
>>>>
>>>>
>>>>
>>>>
>>>> > On Apr 2, 2015, at 4:45 AM, Sateesh Kavuri <sateesh.kav...@gmail.com>
>>>> wrote:
>>>> >
>>>> > Hi,
>>>> >
>>>> > We have a case that we will have to run concurrent jobs (for the same
>>>> algorithm) on different data sets. And these jobs can run in parallel and
>>>> each one of them would be fetching the data from the database.
>>>> > We would like to optimize the database connections by making use of
>>>> connection pooling. Any suggestions / best known ways on how to achieve
>>>> this. The database in question is Oracle
>>>> >
>>>> > Thanks,
>>>> > Sateesh
>>>>
>>>
>>
>

Re: Connection pooling in spark jobs

Reply via email to