Re: BlockManager issues

Christoph Sawade Mon, 22 Sep 2014 03:10:27 -0700

Hey all. We had also the same problem described by Nishkam almost in the
same big data setting. We fixed the fetch failure by increasing the timeout
for acks in the driver:


set("spark.core.connection.ack.wait.timeout", "600") // 10 minutes timeout
for acks between nodes

Cheers, Christoph

2014-09-22 9:24 GMT+02:00 Hortonworks <[email protected]>:

> Actually I met similar issue when doing groupByKey and then count if the
> shuffle size is big e.g. 1tb.
>
> Thanks.
>
> Zhan Zhang
>
> Sent from my iPhone
>
> > On Sep 21, 2014, at 10:56 PM, Nishkam Ravi <[email protected]> wrote:
> >
> > Thanks for the quick follow up Reynold and Patrick. Tried a run with
> > significantly higher ulimit, doesn't seem to help. The executors have
> 35GB
> > each. Btw, with a recent version of the branch, the error message is
> "fetch
> > failures" as opposed to "too many open files". Not sure if they are
> > related.  Please note that the workload runs fine with head set to
> 066765d.
> > In case you want to reproduce the problem: I'm running slightly modified
> > ScalaPageRank (with KryoSerializer and persistence level
> > memory_and_disk_ser) on a 30GB input dataset and a 6-node cluster.
> >
> > Thanks,
> > Nishkam
> >
> > On Sun, Sep 21, 2014 at 10:32 PM, Patrick Wendell <[email protected]>
> > wrote:
> >
> >> Ah I see it was SPARK-2711 (and PR1707). In that case, it's possible
> >> that you are just having more spilling as a result of the patch and so
> >> the filesystem is opening more files. I would try increasing the
> >> ulimit.
> >>
> >> How much memory do your executors have?
> >>
> >> - Patrick
> >>
> >> On Sun, Sep 21, 2014 at 10:29 PM, Patrick Wendell <[email protected]>
> >> wrote:
> >>> Hey the numbers you mentioned don't quite line up - did you mean PR
> 2711?
> >>>
> >>> On Sun, Sep 21, 2014 at 8:45 PM, Reynold Xin <[email protected]>
> >> wrote:
> >>>> It seems like you just need to raise the ulimit?
> >>>>
> >>>>
> >>>> On Sun, Sep 21, 2014 at 8:41 PM, Nishkam Ravi <[email protected]>
> >> wrote:
> >>>>
> >>>>> Recently upgraded to 1.1.0. Saw a bunch of fetch failures for one of
> >> the
> >>>>> workloads. Tried tracing the problem through change set analysis.
> Looks
> >>>>> like the offending commit is 4fde28c from Aug 4th for PR1707. Please
> >> see
> >>>>> SPARK-3633 for more details.
> >>>>>
> >>>>> Thanks,
> >>>>> Nishkam
> >>
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: BlockManager issues

Reply via email to