On Mon, Jan 07, 2019 at 08:00:27PM +0000, Ximeng (Simon) Guan wrote:
> We do have NetInfo properly set up to include the only one IP that is used. 

Good to know, thanks.

I couldn't rule out MTU issues offhand, but don't have time to dig in
further right now.  

Do the problematic bos invocations hang for a minute or two before
reporting the "communications failure"?

The bosserver listens on port 7007, if you hadn't found that already -- a
packet capture would help show what's going on, if you have the ability to
get one of those.

-Ben

> Can the connection failure somehow come from the non-default MTU settings we 
> are using? That thing constantly bit us in the past in different places. We 
> have  "-rxmaxmtu 1344" used across the board for all ptservers, vlserver, 
> davolserver and dafileserver instances. I was told by the network folks that 
> they could not manage default MTU of 1500 but has to use 1400 because of the 
> IPSec requirement...
> 
> Thank you!
> Simon
> 
> -----Original Message-----
> From: openafs-info-ad...@openafs.org <openafs-info-ad...@openafs.org> On 
> Behalf Of Benjamin Kaduk
> Sent: Monday, January 7, 2019 11:44 AM
> To: Ximeng (Simon) Guan <x...@royole.com>
> Cc: OpenAFS-info@openafs.org
> Subject: Re: [OpenAFS] Client connection failure: bos failed to contact 
> host's bosserver (communication failure (-1))
> 
> On Mon, Jan 07, 2019 at 07:40:36PM +0000, Ximeng (Simon) Guan wrote:
> > Hello,
> > 
> > After a power outage on Christmas Eve which forced two database servers and 
> > all the network switches in one of our offices to re-boot, our laptop 
> > clients in that office can no longer connect to one of the AFS servers 
> > hosted in the same office.
> > 
> > I am leaning towards the possibility that it is a network problem instead 
> > of an OpenAFS service problem because:
> > 
> >   1.  Remote offices can access the full AFS space, including those volumes 
> > hosted on the re-booted servers.
> >   2.  Between the servers there is no access problem. Nothing wrong with 
> > the result of "bos status", "rxdebug" or "udebug". "fs checkservers" show 
> > that all servers are running.
> >   3.  On the problematic laptops "fs checkservers" show that "All servers 
> > are running".
> >   4.  On the problematic laptops "bos status afssrv1" returns a message:
> > 
> > "bos: failed to contact host's bosserver (communications failure (-1))."
> > 
> > But on the servers both in that office and in the remote offices, the same 
> > command shows that all services are up:
> > 
> > "Instance ptserver, currently running normally.
> > 
> > Instance vlserver, currently running normally.
> > 
> > Instance buserver, currently running normally.
> > 
> > Instance upserver, currently running normally.
> > 
> > Instance backupusers, currently running normally.
> > 
> >     Auxiliary status is: run next at Tue Jan  8 04:00:00 2019.
> > 
> > Instance dafs, currently running normally.
> > 
> > Auxiliary status is: file server running."
> > 
> >   1.  On the problematic laptops "rxdebug afssrv1 -port 7000" returns 
> > *normal* output, for example:
> > 
> > "Trying 10.12.8.33 (port 7000):
> > 
> > Free packets: 2073/6357, packet reclaims: 3, calls: 81, used FDs: 36
> > 
> > not waiting for packets.
> > 
> > 0 calls waiting for a thread
> > 
> > 125 threads are idle
> > 
> > 1 calls have waited for a thread
> > 
> > Connection from host 10.9.119.50, port 7001, Cuid ae06e5b3/70fe0104
> > 
> >   serial 12,  natMTU 1344, security index 0, client conn
> > 
> >     call 0: # 4, state dally, mode: receiving, flags: receive_done
> > 
> >     call 1: # 0, state not initialized
> > 
> >     call 2: # 0, state not initialized
> > 
> >     call 3: # 0, state not initialized
> > 
> > Connection from host 10.12.4.74, port 7001, Cuid ae06e5b3/70fe0114
> > 
> >   serial 21,  natMTU 1344, security index 0, client conn
> > 
> >     call 0: # 7, state dally, mode: receiving, flags: receive_done
> > 
> >     call 1: # 0, state not initialized
> > 
> >     call 2: # 0, state not initialized
> > 
> >     call 3: # 0, state not initialized
> > 
> > Done."
> > 
> > I do not administer the network. Can I have some advice on how to futher 
> > debug the connection problem? Which udp port does the command "bos status" 
> > use?
> 
> My instinct would be that there is some multihoming going on and that 
> http://docs.openafs.org/Reference/5/NetRestrict.html and/or 
> http://docs.openafs.org/Reference/5/NetInfo.html are not properly configured.
> 
> -Ben
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info
_______________________________________________
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info

Reply via email to