Hi Saif, Are you using JdbcRDD directly from Spark? If yes, then the poor distribution could be due to the bound key you used.
See the JdbcRDD Scala doc at https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.JdbcRDD : sql the text of the query. The query must contain two ? placeholders for parameters used to partition the results. E.g. "select title, author from books where ? <= id and id <= ?" lowerBound the minimum value of the first placeholder upperBound the maximum value of the second placeholder The lower and upper bounds are inclusive. numPartitions the number of partitions. Given a lowerBound of 1, an upperBound of 20, and a numPartitions of 2, the query would be executed twice, once with (1, 10) and once with (11, 20) Shenyan On Tue, Jul 28, 2015 at 2:41 PM, <saif.a.ell...@wellsfargo.com> wrote: > Hi all, > > I am experimenting and learning performance on big tasks locally, with a > 32 cores node and more than 64GB of Ram, data is loaded from a database > through JDBC driver, and launching heavy computations against it. I am > presented with two questions: > > > 1. My RDD is poorly distributed. I am partitioning into 32 pieces, but > first 31 pieces are extremely lightweight compared to piece 32 > > > 15/07/28 13:37:48 INFO Executor: Finished task 30.0 in stage 0.0 (TID 30). > 1419 bytes result sent to driver > 15/07/28 13:37:48 INFO TaskSetManager: Starting task 31.0 in stage 0.0 > (TID 31, localhost, PROCESS_LOCAL, 1539 bytes) > 15/07/28 13:37:48 INFO Executor: Running task 31.0 in stage 0.0 (TID 31) > 15/07/28 13:37:48 INFO TaskSetManager: Finished task 30.0 in stage 0.0 > (TID 30) in 2798 ms on localhost (31/32) > 15/07/28 13:37:48 INFO CacheManager: Partition rdd_2_31 not found, > computing it > *...All pieces take 3 seconds while last one takes around 15 minutes to > compute...* > > Is there anything I can do about this? preferrably without reshufling, > i.e. in the DataFrameReader JDBC options (lowerBound, upperBound, partition > column) > > > 1. After long time of processing, sometimes I get OOMs, I fail to find > a how-to for fallback and give retries to already persisted data to avoid > time. > > > Thanks, > Saif > >