Thanks for the quick follow up Reynold and Patrick. Tried a run with
significantly higher ulimit, doesn't seem to help. The executors have 35GB
each. Btw, with a recent version of the branch, the error message is "fetch
failures" as opposed to "too many open files". Not sure if they are
related.  Please note that the workload runs fine with head set to 066765d.
In case you want to reproduce the problem: I'm running slightly modified
ScalaPageRank (with KryoSerializer and persistence level
memory_and_disk_ser) on a 30GB input dataset and a 6-node cluster.

Thanks,
Nishkam

On Sun, Sep 21, 2014 at 10:32 PM, Patrick Wendell <pwend...@gmail.com>
wrote:

> Ah I see it was SPARK-2711 (and PR1707). In that case, it's possible
> that you are just having more spilling as a result of the patch and so
> the filesystem is opening more files. I would try increasing the
> ulimit.
>
> How much memory do your executors have?
>
> - Patrick
>
> On Sun, Sep 21, 2014 at 10:29 PM, Patrick Wendell <pwend...@gmail.com>
> wrote:
> > Hey the numbers you mentioned don't quite line up - did you mean PR 2711?
> >
> > On Sun, Sep 21, 2014 at 8:45 PM, Reynold Xin <r...@databricks.com>
> wrote:
> >> It seems like you just need to raise the ulimit?
> >>
> >>
> >> On Sun, Sep 21, 2014 at 8:41 PM, Nishkam Ravi <nr...@cloudera.com>
> wrote:
> >>
> >>> Recently upgraded to 1.1.0. Saw a bunch of fetch failures for one of
> the
> >>> workloads. Tried tracing the problem through change set analysis. Looks
> >>> like the offending commit is 4fde28c from Aug 4th for PR1707. Please
> see
> >>> SPARK-3633 for more details.
> >>>
> >>> Thanks,
> >>> Nishkam
> >>>
>

Reply via email to