Hi, I'm running Spark on Yarn from an edge node, and the tasks on the run Data Nodes. My job fails with the "Too many open files" error once it gets to groupByKey(). Alternatively I can make it fail immediately if I repartition the data when I create the RDD.
Where do I need to make sure that ulimit -n is high enough? On the edge node it is small, 1024, but on the data nodes, the "yarn" user has a high limit, 32k. But is the yarn user the relevant user? And, is the 1024 limit for myself on the edge node a problem or is that limit not relevant? Arun