Hey Martin,

Mac OS and Linux. I only tested the changes on my Mac box, which is the
one that didn't work with the yarn-site.xml change. I didn't test on the
Linux (RHEL) box.

Cheers,
Chris

On 2/21/14 2:02 PM, "Martin Kleppmann" <[email protected]> wrote:

>Thanks Chris. I'll keep digging into it. It's a really strange issue...
>on my machine, run-job.sh still works fine (with the 127.0.0.1 setting
>and without the YARN_HOME env var). I'll check out the YARN source and
>see if that gives any clues.
>
>I'm running on Mac OS -- you too?
>
>Martin
>
>On 21 Feb 2014, at 19:26, Chris Riccomini <[email protected]> wrote:
>> Hey Martin,
>> 
>> Yea, that's somewhat alarming. I merged your commit, but after that
>> commit, now I'm unable to get run-job.sh working. I opened a JIRA for
>>this:
>> 
>>  https://issues.apache.org/jira/browse/SAMZA-154
>> 
>> 
>> I'm going to revert the commit for now, until we understand this better.
>> 
>> Cheers,
>> Chris
>> 
>> On 2/21/14 7:11 AM, "Martin Kleppmann" <[email protected]> wrote:
>> 
>>> Did a bit more experimentation: whether it works or not seems to vary
>>> depending on the network my laptop is connected to. It works at my
>>>home,
>>> but it doesn't work at my girlfriend's apartment! Also whether or not
>>>I'm
>>> connected to the company's VPN seems to make a difference.
>>> 
>>> It might be due to DNS: looks like YARN does some lookups to determine
>>> the current machine's FQDN. That's probably very useful in a
>>>datacenter,
>>> but the results are somewhat undefined when using a laptop on a wifi
>>> connection of dubious quality.
>>> 
>>> So far I'm having success with the following config:
>>> 
>>> 1. A change to yarn-site.xml, telling it to always look for the RM on
>>> localhost: https://github.com/linkedin/hello-samza/pull/20
>>> 
>>> 2. echo "127.0.0.1 `hostname`" >> /etc/hosts  (otherwise the RM refuses
>>> to start up if it can't reach a DNS server to resolve the hostname)
>>> 
>>> Cheers,
>>> Martin
>>> 
>>> On 20 Feb 2014, at 00:34, Martin Kleppmann <[email protected]>
>>> wrote:
>>>> Yeah, I checked -- no old YARN processes running. ZK and Kafka are the
>>>> only other two Java processes running on my machine.
>>>> 
>>>> Martin
>>>> 
>>>> On 20 Feb 2014, at 00:20, Chris Riccomini <[email protected]>
>>>> wrote:
>>>>> Hey Martin,
>>>>> 
>>>>> Have you checked if you've leaked a NM process?
>>>>> 
>>>>> I've seen cases in the past where an NM wasn't properly shutdown, and
>>>>> the
>>>>> pid was over-written. Could be that.
>>>>> 
>>>>> Cheers,
>>>>> Chris
>>>>> 
>>>>> On 2/19/14 4:18 PM, "Martin Kleppmann" <[email protected]>
>>>>>wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> I'm suddenly having problems with YARN as set up by hello-samza. It
>>>>>> was
>>>>>> working fine earlier today and I don't recall changing anything in
>>>>>>my
>>>>>> setup -- so I just wanted to check if anyone has seen this before.
>>>>>> 
>>>>>> The YARN resourcemanager seems to start up fine (at least the web UI
>>>>>> works, and nothing strange-looking in the log). But when the
>>>>>> nodemanager
>>>>>> starts, I see a lot of this in its logs:
>>>>>> 
>>>>>> 14/02/20 00:00:04 INFO ipc.Client: Retrying connect to server:
>>>>>> 0.0.0.0/0.0.0.0:8031. Already tried 0 time(s); maxRetries=45
>>>>>> 14/02/20 00:00:08 INFO ipc.Client: Retrying connect to server:
>>>>>> 0.0.0.0/0.0.0.0:8031. Already tried 0 time(s); retry policy is
>>>>>> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1
>>>>>> SECONDS)
>>>>>> 14/02/20 00:00:09 INFO ipc.Client: Retrying connect to server:
>>>>>> 0.0.0.0/0.0.0.0:8031. Already tried 1 time(s); retry policy is
>>>>>> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1
>>>>>> SECONDS)
>>>>>> 14/02/20 00:00:11 INFO ipc.Client: Retrying connect to server:
>>>>>> 0.0.0.0/0.0.0.0:8031. Already tried 2 time(s); retry policy is
>>>>>> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1
>>>>>> SECONDS)
>>>>>> 14/02/20 00:00:12 INFO ipc.Client: Retrying connect to server:
>>>>>> 0.0.0.0/0.0.0.0:8031. Already tried 3 time(s); retry policy is
>>>>>> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1
>>>>>> SECONDS)
>>>>>> 
>>>>>> ...etc repeating every few seconds, and never connecting. But the RM
>>>>>> is
>>>>>> listening on localhost:8031 (verified with netcat).
>>>>>> 
>>>>>> run-job.sh similarly sits there, writing a similar message to
>>>>>> hello-samza/deploy/samza/undefined-samza-container-name.log every
>>>>>>few
>>>>>> seconds (but with port 8032 instead of 8031).
>>>>>> 
>>>>>> Any ideas?
>>>>>> 
>>>>>> Thanks,
>>>>>> Martin
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>

Reply via email to