Hi all, I am experimenting and learning performance on big tasks locally, with a 32 cores node and more than 64GB of Ram, data is loaded from a database through JDBC driver, and launching heavy computations against it. I am presented with two questions:
1. My RDD is poorly distributed. I am partitioning into 32 pieces, but first 31 pieces are extremely lightweight compared to piece 32 15/07/28 13:37:48 INFO Executor: Finished task 30.0 in stage 0.0 (TID 30). 1419 bytes result sent to driver 15/07/28 13:37:48 INFO TaskSetManager: Starting task 31.0 in stage 0.0 (TID 31, localhost, PROCESS_LOCAL, 1539 bytes) 15/07/28 13:37:48 INFO Executor: Running task 31.0 in stage 0.0 (TID 31) 15/07/28 13:37:48 INFO TaskSetManager: Finished task 30.0 in stage 0.0 (TID 30) in 2798 ms on localhost (31/32) 15/07/28 13:37:48 INFO CacheManager: Partition rdd_2_31 not found, computing it ...All pieces take 3 seconds while last one takes around 15 minutes to compute... Is there anything I can do about this? preferrably without reshufling, i.e. in the DataFrameReader JDBC options (lowerBound, upperBound, partition column) 2. After long time of processing, sometimes I get OOMs, I fail to find a how-to for fallback and give retries to already persisted data to avoid time. Thanks, Saif