Here is the code in which NewHadoopRDD register close handler and be called when the task is completed ( https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L136 ).
>From my understanding, possibly the reason is that this `foreach` code in your implementation is not executed Spark job one by one in loop as expected, on the contrary all the jobs are submitted to the DAGScheduler simultaneously, since each job has no dependency to others, Spark's scheduler will unwrap the loop and submit jobs in parallelism, so maybe several map stages are running and pending, this makes your node out of file handler. You could check Spark web portal to see if there's several map stages running simultaneously, or some of them are running while others are pending. Thanks Jerry On Wed, Sep 2, 2015 at 9:09 PM, Sigurd Knippenberg <sig...@knippenberg.com> wrote: > Yep. I know. It's was set to 32K when I ran this test. If I bump it to 64K > the issue goes away. It still doesn't make sense to me that the Spark job > doesn't release its file handles until the end of the job instead of doing > that while my loop iterates. > > Sigurd > > On Wed, Sep 2, 2015 at 4:33 AM, Steve Loughran <ste...@hortonworks.com> > wrote: > >> >> On 31 Aug 2015, at 19:49, Sigurd Knippenberg <sig...@knippenberg.com> >> wrote: >> >> I know I can adjust the max open files allowed by the OS but I'd rather >> fix the underlaying issue. >> >> >> >> bumping up the OS handle limits is step #1 of installing a hadoop cluster >> >> https://wiki.apache.org/hadoop/TooManyOpenFiles >> > >