Hi,

It seems like you have not opened some of the ports. As I pointed out in the 
first mail, please go through all of the config options regarding 
hostnames/ports (not only those that appear in the log files, maybe something 
is not being logged) 
https://ci.apache.org/projects/flink/flink-docs-release-1.3/setup/config.html#jobmanager-amp-taskmanager
 
<https://ci.apache.org/projects/flink/flink-docs-release-1.3/setup/config.html#jobmanager-amp-taskmanager>

jobmanager.rpc.port
taskmanager.rpc.port
taskmanager.data.port
blob.server.port 

And double check that they are accessible from appropriate machines, best by 
using some external tool like telnet and ncat. You network can be configured to 
accept some connections only from specific hosts (like localhost). For example 
in the case for which you attached the log files, did you check that the job 
manager host, can open a connection to the `stage_dbq_1:33633` (task manager 
host and it’s rpc port - rpc port by default is random).

Also make sure that the configurations on the task manager and job manager are 
consistent.

Piotrek

> On 18 Jan 2018, at 08:41, Reza Samee <reza.sa...@gmail.com> wrote:
> 
> Hi, 
> 
> I attached log file,
> 
> Thanks
> 
> On Mon, Jan 15, 2018 at 3:36 PM, Piotr Nowojski <pi...@data-artisans.com 
> <mailto:pi...@data-artisans.com>> wrote:
> Hi,
> 
> Could you post full job manager and task manager logs from startup until the 
> first signs of the problem?
> 
> Thanks, Piotrek
> 
> 
>> On 15 Jan 2018, at 11:21, Reza Samee <reza.sa...@gmail.com 
>> <mailto:reza.sa...@gmail.com>> wrote:
>> 
>> Thanks for response; 
>> And sorry the passed time.
>> 
>> The JobManager & TaskManager logged ports are open!
>> 
>> 
>> Is this log OK?
>> 2018-01-15 13:40:03,455 INFO  
>> org.apache.flink.runtime.webmonitor.JobManagerRetriever       - New leader 
>> reachable under akka.tcp://flink@172.16.20.18:6123/user/jobmanager:null 
>> <http://flink@172.16.20.18:6123/user/jobmanager:null>.
>> 
>> When I kill task-manger, the jobmanager logs:
>> 2018-01-15 13:32:41,419 WARN  akka.remote.ReliableDeliverySupervisor         
>>                - Association with remote system 
>> [akka.tcp://flink@stage_dbq_1:45532 <>] has failed, address is now gated for 
>> [5000] ms. Reason: [Disassociated] 
>> 
>> But it will not decrement the number of available task-managers!
>> and when I start my signle task-manager again, it logs:
>> 
>> 2018-01-15 13:32:52,753 INFO  
>> org.apache.flink.runtime.instance.InstanceManager             - Registered 
>> TaskManager at ??? (akka://flink/deadLetters <>) as 
>> 626846ae27a833cb094eeeb047a6a72c. Current number of registered hosts is 2. 
>> Current number of alive task slots is 40.
>> 
>> 
>> On Wed, Jan 10, 2018 at 11:36 AM, Piotr Nowojski <pi...@data-artisans.com 
>> <mailto:pi...@data-artisans.com>> wrote:
>> Hi,
>> 
>> Search both job manager and task manager logs for ip address(es) and port(s) 
>> that have timeouted. First of all make sure that nodes are visible to each 
>> other using some simple ping. Afterwards please check that those timeouted 
>> ports are opened and not blocked by some firewall (telnet).
>> 
>> You can search the documentation for the configuration parameters with 
>> “port” in name:
>> https://ci.apache.org/projects/flink/flink-docs-release-1.3/setup/config.html
>>  
>> <https://ci.apache.org/projects/flink/flink-docs-release-1.3/setup/config.html>
>> But note that many of them are random by default.
>> 
>> Piotrek
>> 
>>> On 9 Jan 2018, at 17:56, Reza Samee <reza.sa...@gmail.com 
>>> <mailto:reza.sa...@gmail.com>> wrote:
>>> 
>>> 
>>> I'm running a flink-cluster (a mini one with just one node); but the 
>>> problem is that my TaskManager can't reach to my JobManager!
>>> 
>>> Here are logs from TaskManager
>>> ...
>>> Trying to register at JobManager 
>>> akka.tcp://flink@MY_PRIV_IP/user/jobmanager <> (attempt 20, timeout: 30 
>>> seconds)
>>> Trying to register at JobManager 
>>> akka.tcp://flink@MY_PRIV_IP/user/jobmanager <> (attempt 21, timeout: 30 
>>> seconds)
>>> Trying to register at JobManager 
>>> akka.tcp://flink@MY_PRIV_IP/user/jobmanager <> (attempt 22, timeout: 30 
>>> seconds)
>>> Trying to register at JobManager 
>>> akka.tcp://flink@MY_PRIV_IP/user/jobmanager <> (attempt 23, timeout: 30 
>>> seconds)
>>> Trying to register at JobManager 
>>> akka.tcp://flink@MY_PRIV_IP/user/jobmanager <> (attempt 24, timeout: 30 
>>> seconds)
>>> ...
>>> 
>>> My "JobManager UI" shows my TaskManager with this Path & ID: 
>>> "akka://flink/deadLetters <>" ( in TaskManagers tab)
>>> And I found these lines in my JobManger stdout:
>>> 
>>> Resource Manager associating with leading JobManager 
>>> Actor[akka://flink/user/jobmanager#-275619168 <>] - leader session null
>>> TaskManager ResourceID{resourceId='1132cbdaf2d8204e5e42e321e8592754'} has 
>>> started.
>>> Registered TaskManager at MY_PRIV_IP (akka://flink/deadLetters <>) as 
>>> 7d9568445b4557a74d05a0771a08ad9c. Current number of registered hosts is 1. 
>>> Current number of alive task slots is 20.
>>> 
>>> 
>>> What's the meaning of these lines? Where should I look for the solution?
>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> رضا سامعی / http://samee.blog.ir <http://samee.blog.ir/>
>> 
>> 
>> 
>> -- 
>> رضا سامعی / http://samee.blog.ir <http://samee.blog.ir/>
> 
> 
> 
> -- 
> رضا سامعی / http://samee.blog.ir 
> <http://samee.blog.ir/><flink-jobmanager.out><flink-taskmanager.out>

Reply via email to