Hello In one of my use cases, i need to process list of folders in parallel. I used Sc.parallelize (list,list.size).map(" logic to process the folder"). I have a six node cluster and there are six folders to process. Ideally i expect that each of my node process one folder. But, i see that a node process multiple folders while one or two of the nodes do not get any job. In the end, the spark- submit crashes with the exception saying "remote RPC client dissociated". Can someone give me a hint on what's going wrong here? Please note that this issue does not arise if i comment my logic that processes the folder but simply print folder name. In this case, every node gets one folder to process. I inserted a sleep of 40 seconds inside the map. No issue. But when i uncomment my logic i see this issue. Also, before crashing it does process some of the folders successfully. Successfully means the business logic generates a file in a shared file system.
Regards Bala