Yea this is a good suggestion; also check 25th percentile, median, and 75th
percentile to see how skewed the input data is.
If you find that the RDD’s partitions are skewed you can solve it either by
changing the partitioner when you read the files like already suggested, or
call repartition()
Hi,
Can you check if the RDD is partitioned correctly with correct partition
number (if you are manually setting the partition value.) . Try using Hash
partitioner while reading the files.
One way you can debug is by checking the number of records that executor
has compared to others in the
Hi,
I have a cluster with 15 nodes of which 5 are HDFS nodes. I kick off a job
that creates some 120 stages. Eventually, the active and pending stages
reduce down to a small bottleneck and it never fails... the tasks
associated with the 10 (or so) running tasks are always allocated to the
same