Re: Fighting against performance: JDBC RDD badly distributed

shenyan zhen Tue, 28 Jul 2015 12:18:54 -0700

Hi Saif,

Are you using JdbcRDD directly from Spark?
If yes, then the poor distribution could be due to the bound key you used.


See the JdbcRDD Scala doc at
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.JdbcRDD
:
sql

the text of the query. The query must contain two ? placeholders for
parameters used to partition the results. E.g. "select title, author from
books where ? <= id and id <= ?"
lowerBound

the minimum value of the first placeholder
upperBound

the maximum value of the second placeholder The lower and upper bounds are
inclusive.
numPartitions

the number of partitions. Given a lowerBound of 1, an upperBound of 20, and
a numPartitions of 2, the query would be executed twice, once with (1, 10)
and once with (11, 20)

Shenyan


On Tue, Jul 28, 2015 at 2:41 PM, <saif.a.ell...@wellsfargo.com> wrote:

>  Hi all,
>
> I am experimenting and learning performance on big tasks locally, with a
> 32 cores node and more than 64GB of Ram, data is loaded from a database
> through JDBC driver, and launching heavy computations against it. I am
> presented with two questions:
>
>
>    1. My RDD is poorly distributed. I am partitioning into 32 pieces, but
>    first 31 pieces are extremely lightweight compared to piece 32
>
>
> 15/07/28 13:37:48 INFO Executor: Finished task 30.0 in stage 0.0 (TID 30).
> 1419 bytes result sent to driver
> 15/07/28 13:37:48 INFO TaskSetManager: Starting task 31.0 in stage 0.0
> (TID 31, localhost, PROCESS_LOCAL, 1539 bytes)
> 15/07/28 13:37:48 INFO Executor: Running task 31.0 in stage 0.0 (TID 31)
> 15/07/28 13:37:48 INFO TaskSetManager: Finished task 30.0 in stage 0.0
> (TID 30) in 2798 ms on localhost (31/32)
> 15/07/28 13:37:48 INFO CacheManager: Partition rdd_2_31 not found,
> computing it
> *...All pieces take 3 seconds while last one takes around 15 minutes to
> compute...*
>
> Is there anything I can do about this? preferrably without reshufling,
> i.e. in the DataFrameReader JDBC options (lowerBound, upperBound, partition
> column)
>
>
>    1. After long time of processing, sometimes I get OOMs, I fail to find
>    a how-to for fallback and give retries to already persisted data to avoid
>    time.
>
>
> Thanks,
> Saif
>
>

Re: Fighting against performance: JDBC RDD badly distributed

Reply via email to