Sorry, to clarify: Spark *does* effectively turn Akka's failure detector off.
On Tue, May 27, 2014 at 10:47 AM, Aaron Davidson <ilike...@gmail.com> wrote: > Spark should effectively turn Akka's failure detector off, because we > historically had problems with GCs and other issues causing > disassociations. The only thing that should cause these messages nowadays > is if the TCP connection (which Akka sustains between Actor Systems on > different machines) actually drops. TCP connections are pretty resilient, > so one common cause of this is actual Executor failure -- recently, I have > experienced a similar-sounding problem due to my machine's OOM killer > terminating my Executors, such that they didn't produce any error output. > > > On Thu, May 22, 2014 at 9:19 AM, Chanwit Kaewkasi <chan...@gmail.com>wrote: > >> Hi all, >> >> On an ARM cluster, I have been testing a wordcount program with JRE 7 >> and everything is OK. But when changing to the embedded version of >> Java SE (Oracle's eJRE), the same program cannot complete all >> computing stages. >> >> It is failed by many Akka's disassociation. >> >> - I've been trying to increase Akka's timeout but still stuck. I am >> not sure what is the right way to do so? (I suspected that GC pausing >> the world is causing this). >> >> - Another question is that how could I properly turn on Akka's logging >> to see what's the root cause of this disassociation problem? (If my >> guess about GC is wrong). >> >> Best regards, >> >> -chanwit >> >> -- >> Chanwit Kaewkasi >> linkedin.com/in/chanwit >> > >