Hi all,

I am experimenting and learning performance on big tasks locally, with a 32 
cores node and more than 64GB of Ram, data is loaded from a database through 
JDBC driver, and launching heavy computations against it. I am presented with 
two questions:

1.      My RDD is poorly distributed. I am partitioning into 32 pieces, but 
first 31 pieces are extremely lightweight compared to piece 32

      15/07/28 13:37:48 INFO Executor: Finished task 30.0 in stage 0.0 (TID 
30). 1419 bytes result sent to driver
      15/07/28 13:37:48 INFO TaskSetManager: Starting task 31.0 in stage 0.0 
(TID 31, localhost, PROCESS_LOCAL, 1539 bytes)
      15/07/28 13:37:48 INFO Executor: Running task 31.0 in stage 0.0 (TID 31)
      15/07/28 13:37:48 INFO TaskSetManager: Finished task 30.0 in stage 0.0 
(TID 30) in 2798 ms on localhost (31/32)
      15/07/28 13:37:48 INFO CacheManager: Partition rdd_2_31 not found, 
computing it
      ...All pieces take 3 seconds while last one takes around 15 minutes to 
compute...

      Is there anything I can do about this? preferrably without reshufling, 
i.e. in the DataFrameReader JDBC options (lowerBound, upperBound, partition 
column)

2.      After long time of processing, sometimes I get OOMs, I fail to find a 
how-to for fallback and give retries to already persisted data to avoid time.

Thanks,
Saif

Reply via email to