Did a bit more experimentation: whether it works or not seems to vary depending on the network my laptop is connected to. It works at my home, but it doesn't work at my girlfriend's apartment! Also whether or not I'm connected to the company's VPN seems to make a difference.
It might be due to DNS: looks like YARN does some lookups to determine the current machine's FQDN. That's probably very useful in a datacenter, but the results are somewhat undefined when using a laptop on a wifi connection of dubious quality. So far I'm having success with the following config: 1. A change to yarn-site.xml, telling it to always look for the RM on localhost: https://github.com/linkedin/hello-samza/pull/20 2. echo "127.0.0.1 `hostname`" >> /etc/hosts (otherwise the RM refuses to start up if it can't reach a DNS server to resolve the hostname) Cheers, Martin On 20 Feb 2014, at 00:34, Martin Kleppmann <[email protected]> wrote: > Yeah, I checked -- no old YARN processes running. ZK and Kafka are the only > other two Java processes running on my machine. > > Martin > > On 20 Feb 2014, at 00:20, Chris Riccomini <[email protected]> wrote: >> Hey Martin, >> >> Have you checked if you've leaked a NM process? >> >> I've seen cases in the past where an NM wasn't properly shutdown, and the >> pid was over-written. Could be that. >> >> Cheers, >> Chris >> >> On 2/19/14 4:18 PM, "Martin Kleppmann" <[email protected]> wrote: >> >>> Hi, >>> >>> I'm suddenly having problems with YARN as set up by hello-samza. It was >>> working fine earlier today and I don't recall changing anything in my >>> setup -- so I just wanted to check if anyone has seen this before. >>> >>> The YARN resourcemanager seems to start up fine (at least the web UI >>> works, and nothing strange-looking in the log). But when the nodemanager >>> starts, I see a lot of this in its logs: >>> >>> 14/02/20 00:00:04 INFO ipc.Client: Retrying connect to server: >>> 0.0.0.0/0.0.0.0:8031. Already tried 0 time(s); maxRetries=45 >>> 14/02/20 00:00:08 INFO ipc.Client: Retrying connect to server: >>> 0.0.0.0/0.0.0.0:8031. Already tried 0 time(s); retry policy is >>> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) >>> 14/02/20 00:00:09 INFO ipc.Client: Retrying connect to server: >>> 0.0.0.0/0.0.0.0:8031. Already tried 1 time(s); retry policy is >>> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) >>> 14/02/20 00:00:11 INFO ipc.Client: Retrying connect to server: >>> 0.0.0.0/0.0.0.0:8031. Already tried 2 time(s); retry policy is >>> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) >>> 14/02/20 00:00:12 INFO ipc.Client: Retrying connect to server: >>> 0.0.0.0/0.0.0.0:8031. Already tried 3 time(s); retry policy is >>> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) >>> >>> ...etc repeating every few seconds, and never connecting. But the RM is >>> listening on localhost:8031 (verified with netcat). >>> >>> run-job.sh similarly sits there, writing a similar message to >>> hello-samza/deploy/samza/undefined-samza-container-name.log every few >>> seconds (but with port 8032 instead of 8031). >>> >>> Any ideas? >>> >>> Thanks, >>> Martin >>> >> >
