Hi Saif,

Are you using JdbcRDD directly from Spark?
If yes, then the poor distribution could be due to the bound key you used.

See the JdbcRDD Scala doc at
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.JdbcRDD
:
sql

the text of the query. The query must contain two ? placeholders for
parameters used to partition the results. E.g. "select title, author from
books where ? <= id and id <= ?"
lowerBound

the minimum value of the first placeholder
upperBound

the maximum value of the second placeholder The lower and upper bounds are
inclusive.
numPartitions

the number of partitions. Given a lowerBound of 1, an upperBound of 20, and
a numPartitions of 2, the query would be executed twice, once with (1, 10)
and once with (11, 20)

Shenyan


On Tue, Jul 28, 2015 at 2:41 PM, <saif.a.ell...@wellsfargo.com> wrote:

>  Hi all,
>
> I am experimenting and learning performance on big tasks locally, with a
> 32 cores node and more than 64GB of Ram, data is loaded from a database
> through JDBC driver, and launching heavy computations against it. I am
> presented with two questions:
>
>
>    1. My RDD is poorly distributed. I am partitioning into 32 pieces, but
>    first 31 pieces are extremely lightweight compared to piece 32
>
>
> 15/07/28 13:37:48 INFO Executor: Finished task 30.0 in stage 0.0 (TID 30).
> 1419 bytes result sent to driver
> 15/07/28 13:37:48 INFO TaskSetManager: Starting task 31.0 in stage 0.0
> (TID 31, localhost, PROCESS_LOCAL, 1539 bytes)
> 15/07/28 13:37:48 INFO Executor: Running task 31.0 in stage 0.0 (TID 31)
> 15/07/28 13:37:48 INFO TaskSetManager: Finished task 30.0 in stage 0.0
> (TID 30) in 2798 ms on localhost (31/32)
> 15/07/28 13:37:48 INFO CacheManager: Partition rdd_2_31 not found,
> computing it
> *...All pieces take 3 seconds while last one takes around 15 minutes to
> compute...*
>
> Is there anything I can do about this? preferrably without reshufling,
> i.e. in the DataFrameReader JDBC options (lowerBound, upperBound, partition
> column)
>
>
>    1. After long time of processing, sometimes I get OOMs, I fail to find
>    a how-to for fallback and give retries to already persisted data to avoid
>    time.
>
>
> Thanks,
> Saif
>
>

Reply via email to