Hi, I recently upgraded from 1.2.1 to 1.3.1 (through HDP).
I have a job that does a cartesian product on two datasets (2K and 500K lines minimum) to do string matching. I updated it to use Dataframes because the old code wouldn’t run anymore (deprecated RDD functions). It used to run very well and use all allocated memory/cores but doesn’t anymore and I can’t figure out why. My cluster has 4 workers each with 14 available vCores and 40 GB of RAM I run the job with the following properties: —master yarn-client —num-executors 12 —executor-cores 4 —executor-memory 12G That should give me 3 executors per workers however I always end up with ~3 executors (total) and 2 tasks that end up failing on « OutOfMemoryError: GC overhead limit exceeded », restart and fail again because output directory exists. Any idea ? Thanks ! César. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org