Re: Master (1.1-SNAPSHOT) Can't run on YARN

2016-04-21 Thread Maximilian Michels
Hi Stefano,

Thanks for reporting. I wasn't able to reproduce the problem. I ran
./bin/yarn-session.sh -n 1 -s 2 -jm 2048 -tm 2048 on a Yarn cluster
and it created a Flink cluster with a JobManager and a TaskManager
with two task slots. By the way, if you omit the "-s 2" flag, then the
default is read from the config, which is one task slot.

Could it be that an old TaskManager instance is trying to register
with a new JobManager? It looks like it from the log messages because
the ResourceManager (which creates TaskManagers) is not aware of it.
Still questionable why that instance is lingering around. Could you
try to kill the instance and try bringing up a cluster several times
to see if that solved the problem? If not, could you send me the full
logs to my email address?

Thanks,
Max

On Wed, Apr 20, 2016 at 4:30 PM, Ufuk Celebi  wrote:
> The user list is OK since you are reporting a bug here ;-) I'm
> confident that this will be fixed soon! :-)
>
> On Wed, Apr 20, 2016 at 11:28 AM, Stefano Baghino
>  wrote:
>> Not exactly, I just wanted to let you know about it and know if someone else
>> experimented this issue; perhaps it's more of a dev mailing list discussion,
>> sorry for posting this here. Feel free to continue the discussion on the
>> other list if you feel it's more appropriate.
>>
>> On Tue, Apr 19, 2016 at 6:53 PM, Ufuk Celebi  wrote:
>>>
>>> Hey Stefano,
>>>
>>> Flink's resource management has been refactored for 1.1 recently. This
>>> could be a regression introduced by this. Max can probably help you
>>> with more details. Is this currently a blocker for you?
>>>
>>> – Ufuk
>>>
>>> On Tue, Apr 19, 2016 at 6:31 PM, Stefano Baghino
>>>  wrote:
>>> > Hi everyone,
>>> >
>>> > I'm currently experiencing a weird situation, I hope you can help me out
>>> > with this.
>>> >
>>> > I've cloned and built from the master, then I've edited the default
>>> > config
>>> > fil by adding my Hadoop config path, exported the HADOOP_CONF_DIR env
>>> > var
>>> > and ran bin/yarn-session.sh -n 1 -s 2 -jm 2048 -tm 2048
>>> >
>>> > The first thing I noticed is that I had to put "-s 2" or the task
>>> > managers
>>> > gets created with -1 slots (!) by default.
>>> >
>>> > After putting "-s 2" the YARN session startup hangs when trying to
>>> > register
>>> > the task managers. I've stopped the session and aggregated the logs and
>>> > read
>>> > a lot (several thousands) of the messages I attach at the bottom; any
>>> > idea
>>> > of what this may be?
>>> >
>>> > Thank you a lot in advance!
>>> >
>>> > 2016-04-19 12:15:59,507 INFO  org.apache.flink.yarn.YarnTaskManager
>>> > - Trying to register at JobManager
>>> > akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 1,
>>> > timeout:
>>> > 500 milliseconds)
>>> >
>>> > 2016-04-19 12:15:59,649 ERROR org.apache.flink.yarn.YarnTaskManager
>>> > - The registration at JobManager
>>> > Some(akka.tcp://flink@172.31.20.101:57379/user/jobmanager) was refused,
>>> > because: java.lang.IllegalStateException: Resource
>>> > ResourceID{resourceId='container_e02_1461077293721_0016_01_02'} not
>>> > registered with resource manager.. Retrying later...
>>> >
>>> > 2016-04-19 12:16:00,025 INFO  org.apache.flink.yarn.YarnTaskManager
>>> > - Trying to register at JobManager
>>> > akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 2,
>>> > timeout:
>>> > 1000 milliseconds)
>>> >
>>> > 2016-04-19 12:16:00,033 ERROR org.apache.flink.yarn.YarnTaskManager
>>> > - The registration at JobManager
>>> > Some(akka.tcp://flink@172.31.20.101:57379/user/jobmanager) was refused,
>>> > because: java.lang.IllegalStateException: Resource
>>> > ResourceID{resourceId='container_e02_1461077293721_0016_01_02'} not
>>> > registered with resource manager.. Retrying later...
>>> >
>>> > 2016-04-19 12:16:01,045 INFO  org.apache.flink.yarn.YarnTaskManager
>>> > - Trying to register at JobManager
>>> > akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 3,
>>> > timeout:
>>> > 2000 milliseconds)
>>> >
>>> > 2016-04-19 12:16:01,053 ERROR org.apache.flink.yarn.YarnTaskManager
>>> > - The registration at JobManager
>>> > Some(akka.tcp://flink@172.31.20.101:57379/user/jobmanager) was refused,
>>> > because: java.lang.IllegalStateException: Resource
>>> > ResourceID{resourceId='container_e02_1461077293721_0016_01_02'} not
>>> > registered with resource manager.. Retrying later...
>>> >
>>> > 2016-04-19 12:16:03,064 INFO  org.apache.flink.yarn.YarnTaskManager
>>> > - Trying to register at JobManager
>>> > akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 4,
>>> > timeout:
>>> > 4000 milliseconds)
>>> >
>>> > 2016-04-19 12:16:03,072 ERROR org.apache.flink.yarn.YarnTaskManager
>>> > - The registration at JobManager
>>> > Some(akka.tcp://flink@172.31.20.101:57379/user/jobmanager) was refused,
>>> > because: java.lang.IllegalStateException: Resource
>>> > 

Re: Master (1.1-SNAPSHOT) Can't run on YARN

2016-04-20 Thread Ufuk Celebi
The user list is OK since you are reporting a bug here ;-) I'm
confident that this will be fixed soon! :-)

On Wed, Apr 20, 2016 at 11:28 AM, Stefano Baghino
 wrote:
> Not exactly, I just wanted to let you know about it and know if someone else
> experimented this issue; perhaps it's more of a dev mailing list discussion,
> sorry for posting this here. Feel free to continue the discussion on the
> other list if you feel it's more appropriate.
>
> On Tue, Apr 19, 2016 at 6:53 PM, Ufuk Celebi  wrote:
>>
>> Hey Stefano,
>>
>> Flink's resource management has been refactored for 1.1 recently. This
>> could be a regression introduced by this. Max can probably help you
>> with more details. Is this currently a blocker for you?
>>
>> – Ufuk
>>
>> On Tue, Apr 19, 2016 at 6:31 PM, Stefano Baghino
>>  wrote:
>> > Hi everyone,
>> >
>> > I'm currently experiencing a weird situation, I hope you can help me out
>> > with this.
>> >
>> > I've cloned and built from the master, then I've edited the default
>> > config
>> > fil by adding my Hadoop config path, exported the HADOOP_CONF_DIR env
>> > var
>> > and ran bin/yarn-session.sh -n 1 -s 2 -jm 2048 -tm 2048
>> >
>> > The first thing I noticed is that I had to put "-s 2" or the task
>> > managers
>> > gets created with -1 slots (!) by default.
>> >
>> > After putting "-s 2" the YARN session startup hangs when trying to
>> > register
>> > the task managers. I've stopped the session and aggregated the logs and
>> > read
>> > a lot (several thousands) of the messages I attach at the bottom; any
>> > idea
>> > of what this may be?
>> >
>> > Thank you a lot in advance!
>> >
>> > 2016-04-19 12:15:59,507 INFO  org.apache.flink.yarn.YarnTaskManager
>> > - Trying to register at JobManager
>> > akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 1,
>> > timeout:
>> > 500 milliseconds)
>> >
>> > 2016-04-19 12:15:59,649 ERROR org.apache.flink.yarn.YarnTaskManager
>> > - The registration at JobManager
>> > Some(akka.tcp://flink@172.31.20.101:57379/user/jobmanager) was refused,
>> > because: java.lang.IllegalStateException: Resource
>> > ResourceID{resourceId='container_e02_1461077293721_0016_01_02'} not
>> > registered with resource manager.. Retrying later...
>> >
>> > 2016-04-19 12:16:00,025 INFO  org.apache.flink.yarn.YarnTaskManager
>> > - Trying to register at JobManager
>> > akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 2,
>> > timeout:
>> > 1000 milliseconds)
>> >
>> > 2016-04-19 12:16:00,033 ERROR org.apache.flink.yarn.YarnTaskManager
>> > - The registration at JobManager
>> > Some(akka.tcp://flink@172.31.20.101:57379/user/jobmanager) was refused,
>> > because: java.lang.IllegalStateException: Resource
>> > ResourceID{resourceId='container_e02_1461077293721_0016_01_02'} not
>> > registered with resource manager.. Retrying later...
>> >
>> > 2016-04-19 12:16:01,045 INFO  org.apache.flink.yarn.YarnTaskManager
>> > - Trying to register at JobManager
>> > akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 3,
>> > timeout:
>> > 2000 milliseconds)
>> >
>> > 2016-04-19 12:16:01,053 ERROR org.apache.flink.yarn.YarnTaskManager
>> > - The registration at JobManager
>> > Some(akka.tcp://flink@172.31.20.101:57379/user/jobmanager) was refused,
>> > because: java.lang.IllegalStateException: Resource
>> > ResourceID{resourceId='container_e02_1461077293721_0016_01_02'} not
>> > registered with resource manager.. Retrying later...
>> >
>> > 2016-04-19 12:16:03,064 INFO  org.apache.flink.yarn.YarnTaskManager
>> > - Trying to register at JobManager
>> > akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 4,
>> > timeout:
>> > 4000 milliseconds)
>> >
>> > 2016-04-19 12:16:03,072 ERROR org.apache.flink.yarn.YarnTaskManager
>> > - The registration at JobManager
>> > Some(akka.tcp://flink@172.31.20.101:57379/user/jobmanager) was refused,
>> > because: java.lang.IllegalStateException: Resource
>> > ResourceID{resourceId='container_e02_1461077293721_0016_01_02'} not
>> > registered with resource manager.. Retrying later...
>> >
>> > 2016-04-19 12:16:07,085 INFO  org.apache.flink.yarn.YarnTaskManager
>> > - Trying to register at JobManager
>> > akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 5,
>> > timeout:
>> > 8000 milliseconds)
>> >
>> > 2016-04-19 12:16:07,092 ERROR org.apache.flink.yarn.YarnTaskManager
>> > - The registration at JobManager
>> > Some(akka.tcp://flink@172.31.20.101:57379/user/jobmanager) was refused,
>> > because: java.lang.IllegalStateException: Resource
>> > ResourceID{resourceId='container_e02_1461077293721_0016_01_02'} not
>> > registered with resource manager.. Retrying later...
>> >
>> > 2016-04-19 12:16:09,664 INFO  org.apache.flink.yarn.YarnTaskManager
>> > - Trying to register at JobManager
>> > akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 1,
>> > timeout:
>> > 500 milliseconds)
>> >
>> >
>> > --

Re: Master (1.1-SNAPSHOT) Can't run on YARN

2016-04-20 Thread Stefano Baghino
Not exactly, I just wanted to let you know about it and know if someone
else experimented this issue; perhaps it's more of a dev mailing list
discussion, sorry for posting this here. Feel free to continue the
discussion on the other list if you feel it's more appropriate.

On Tue, Apr 19, 2016 at 6:53 PM, Ufuk Celebi  wrote:

> Hey Stefano,
>
> Flink's resource management has been refactored for 1.1 recently. This
> could be a regression introduced by this. Max can probably help you
> with more details. Is this currently a blocker for you?
>
> – Ufuk
>
> On Tue, Apr 19, 2016 at 6:31 PM, Stefano Baghino
>  wrote:
> > Hi everyone,
> >
> > I'm currently experiencing a weird situation, I hope you can help me out
> > with this.
> >
> > I've cloned and built from the master, then I've edited the default
> config
> > fil by adding my Hadoop config path, exported the HADOOP_CONF_DIR env var
> > and ran bin/yarn-session.sh -n 1 -s 2 -jm 2048 -tm 2048
> >
> > The first thing I noticed is that I had to put "-s 2" or the task
> managers
> > gets created with -1 slots (!) by default.
> >
> > After putting "-s 2" the YARN session startup hangs when trying to
> register
> > the task managers. I've stopped the session and aggregated the logs and
> read
> > a lot (several thousands) of the messages I attach at the bottom; any
> idea
> > of what this may be?
> >
> > Thank you a lot in advance!
> >
> > 2016-04-19 12:15:59,507 INFO  org.apache.flink.yarn.YarnTaskManager
> > - Trying to register at JobManager
> > akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 1,
> timeout:
> > 500 milliseconds)
> >
> > 2016-04-19 12:15:59,649 ERROR org.apache.flink.yarn.YarnTaskManager
> > - The registration at JobManager
> > Some(akka.tcp://flink@172.31.20.101:57379/user/jobmanager) was refused,
> > because: java.lang.IllegalStateException: Resource
> > ResourceID{resourceId='container_e02_1461077293721_0016_01_02'} not
> > registered with resource manager.. Retrying later...
> >
> > 2016-04-19 12:16:00,025 INFO  org.apache.flink.yarn.YarnTaskManager
> > - Trying to register at JobManager
> > akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 2,
> timeout:
> > 1000 milliseconds)
> >
> > 2016-04-19 12:16:00,033 ERROR org.apache.flink.yarn.YarnTaskManager
> > - The registration at JobManager
> > Some(akka.tcp://flink@172.31.20.101:57379/user/jobmanager) was refused,
> > because: java.lang.IllegalStateException: Resource
> > ResourceID{resourceId='container_e02_1461077293721_0016_01_02'} not
> > registered with resource manager.. Retrying later...
> >
> > 2016-04-19 12:16:01,045 INFO  org.apache.flink.yarn.YarnTaskManager
> > - Trying to register at JobManager
> > akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 3,
> timeout:
> > 2000 milliseconds)
> >
> > 2016-04-19 12:16:01,053 ERROR org.apache.flink.yarn.YarnTaskManager
> > - The registration at JobManager
> > Some(akka.tcp://flink@172.31.20.101:57379/user/jobmanager) was refused,
> > because: java.lang.IllegalStateException: Resource
> > ResourceID{resourceId='container_e02_1461077293721_0016_01_02'} not
> > registered with resource manager.. Retrying later...
> >
> > 2016-04-19 12:16:03,064 INFO  org.apache.flink.yarn.YarnTaskManager
> > - Trying to register at JobManager
> > akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 4,
> timeout:
> > 4000 milliseconds)
> >
> > 2016-04-19 12:16:03,072 ERROR org.apache.flink.yarn.YarnTaskManager
> > - The registration at JobManager
> > Some(akka.tcp://flink@172.31.20.101:57379/user/jobmanager) was refused,
> > because: java.lang.IllegalStateException: Resource
> > ResourceID{resourceId='container_e02_1461077293721_0016_01_02'} not
> > registered with resource manager.. Retrying later...
> >
> > 2016-04-19 12:16:07,085 INFO  org.apache.flink.yarn.YarnTaskManager
> > - Trying to register at JobManager
> > akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 5,
> timeout:
> > 8000 milliseconds)
> >
> > 2016-04-19 12:16:07,092 ERROR org.apache.flink.yarn.YarnTaskManager
> > - The registration at JobManager
> > Some(akka.tcp://flink@172.31.20.101:57379/user/jobmanager) was refused,
> > because: java.lang.IllegalStateException: Resource
> > ResourceID{resourceId='container_e02_1461077293721_0016_01_02'} not
> > registered with resource manager.. Retrying later...
> >
> > 2016-04-19 12:16:09,664 INFO  org.apache.flink.yarn.YarnTaskManager
> > - Trying to register at JobManager
> > akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 1,
> timeout:
> > 500 milliseconds)
> >
> >
> > --
> > BR,
> > Stefano Baghino
> >
> > Software Engineer @ Radicalbit
>



-- 
BR,
Stefano Baghino

Software Engineer @ Radicalbit


Re: Master (1.1-SNAPSHOT) Can't run on YARN

2016-04-19 Thread Ufuk Celebi
Hey Stefano,

Flink's resource management has been refactored for 1.1 recently. This
could be a regression introduced by this. Max can probably help you
with more details. Is this currently a blocker for you?

– Ufuk

On Tue, Apr 19, 2016 at 6:31 PM, Stefano Baghino
 wrote:
> Hi everyone,
>
> I'm currently experiencing a weird situation, I hope you can help me out
> with this.
>
> I've cloned and built from the master, then I've edited the default config
> fil by adding my Hadoop config path, exported the HADOOP_CONF_DIR env var
> and ran bin/yarn-session.sh -n 1 -s 2 -jm 2048 -tm 2048
>
> The first thing I noticed is that I had to put "-s 2" or the task managers
> gets created with -1 slots (!) by default.
>
> After putting "-s 2" the YARN session startup hangs when trying to register
> the task managers. I've stopped the session and aggregated the logs and read
> a lot (several thousands) of the messages I attach at the bottom; any idea
> of what this may be?
>
> Thank you a lot in advance!
>
> 2016-04-19 12:15:59,507 INFO  org.apache.flink.yarn.YarnTaskManager
> - Trying to register at JobManager
> akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 1, timeout:
> 500 milliseconds)
>
> 2016-04-19 12:15:59,649 ERROR org.apache.flink.yarn.YarnTaskManager
> - The registration at JobManager
> Some(akka.tcp://flink@172.31.20.101:57379/user/jobmanager) was refused,
> because: java.lang.IllegalStateException: Resource
> ResourceID{resourceId='container_e02_1461077293721_0016_01_02'} not
> registered with resource manager.. Retrying later...
>
> 2016-04-19 12:16:00,025 INFO  org.apache.flink.yarn.YarnTaskManager
> - Trying to register at JobManager
> akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 2, timeout:
> 1000 milliseconds)
>
> 2016-04-19 12:16:00,033 ERROR org.apache.flink.yarn.YarnTaskManager
> - The registration at JobManager
> Some(akka.tcp://flink@172.31.20.101:57379/user/jobmanager) was refused,
> because: java.lang.IllegalStateException: Resource
> ResourceID{resourceId='container_e02_1461077293721_0016_01_02'} not
> registered with resource manager.. Retrying later...
>
> 2016-04-19 12:16:01,045 INFO  org.apache.flink.yarn.YarnTaskManager
> - Trying to register at JobManager
> akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 3, timeout:
> 2000 milliseconds)
>
> 2016-04-19 12:16:01,053 ERROR org.apache.flink.yarn.YarnTaskManager
> - The registration at JobManager
> Some(akka.tcp://flink@172.31.20.101:57379/user/jobmanager) was refused,
> because: java.lang.IllegalStateException: Resource
> ResourceID{resourceId='container_e02_1461077293721_0016_01_02'} not
> registered with resource manager.. Retrying later...
>
> 2016-04-19 12:16:03,064 INFO  org.apache.flink.yarn.YarnTaskManager
> - Trying to register at JobManager
> akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 4, timeout:
> 4000 milliseconds)
>
> 2016-04-19 12:16:03,072 ERROR org.apache.flink.yarn.YarnTaskManager
> - The registration at JobManager
> Some(akka.tcp://flink@172.31.20.101:57379/user/jobmanager) was refused,
> because: java.lang.IllegalStateException: Resource
> ResourceID{resourceId='container_e02_1461077293721_0016_01_02'} not
> registered with resource manager.. Retrying later...
>
> 2016-04-19 12:16:07,085 INFO  org.apache.flink.yarn.YarnTaskManager
> - Trying to register at JobManager
> akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 5, timeout:
> 8000 milliseconds)
>
> 2016-04-19 12:16:07,092 ERROR org.apache.flink.yarn.YarnTaskManager
> - The registration at JobManager
> Some(akka.tcp://flink@172.31.20.101:57379/user/jobmanager) was refused,
> because: java.lang.IllegalStateException: Resource
> ResourceID{resourceId='container_e02_1461077293721_0016_01_02'} not
> registered with resource manager.. Retrying later...
>
> 2016-04-19 12:16:09,664 INFO  org.apache.flink.yarn.YarnTaskManager
> - Trying to register at JobManager
> akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 1, timeout:
> 500 milliseconds)
>
>
> --
> BR,
> Stefano Baghino
>
> Software Engineer @ Radicalbit