I'm very grateful for your very helpful attention to my bizarre problem, Simon.

Two days ago, my efforts to instrument the nodes had the weird side-effect of making the problem go away.  So now I have no problem; only the mystery remains, and I have little hope of resolving it.

For the record, below are some details that seem relevant, at least to me.



On 2/19/20 4:50 AM, Simon Wunderlich wrote:

-----------------------------------------------------------------------

When nodes become unreachable, they do so only partially.  Consider this
weirdness I encountered two days ago:  Given nodes a, b, c, d, from the
perspective of a, d has disappeared; in other words, "a# batctl ping d"
doesn't work.  But I ssh'd from a to b, then from b to c, then from c to
d, all successfully.  And "a# batctl ping d" still wasn't working, even
though I was talking to d through that chain of ssh pipes.  Any ideas on
what that might mean?  (When I reboot a -- the gateway -- everything
always works again, usually for many hours, but never as long as a whole
day.)
Hmm, that's strange indeed. Did you have good connection between all those
devices? There is a certain "horizon", e.g. if you have many weak links in a
daisy chain the the OGMs are dropped before they are reaching the end of the
path.

Did you see node D in the originator table of node A?
As discussed below, when I added instrumentation, the problem disappeared.  (*insert muffled scream here*)

----------------------------------------------------------------------

Do I have a problem because the two meshes, and everything connected to
them, all share the same LAN?  I note "received a claim frame from
another group" in the above log excerpt.  (I don't know what that means,
but I'm guessing that the two meshes are getting each other's
maintenance traffic.)  Should the two meshes be separate subnets?
It's possible and perfectly fine if you have two meshes connected to the same
LAN like this:

https://www.open-mesh.org/projects/batman-adv/wiki/Bridge-loop-avoidance-Testcases#Two-meshes-connected-by-one-LAN

Just make sure that the meshes are properly disconnected and not rejoin from
time to time (e.g. by having different SSIDs)
I think I had missed this page.  Thanks for pointing it out.

In a similar vein: Should each node be running its own subnet?

----------------------------------------------------------------------

Should I try changing all nodes over to BATMAN_V, rebooting them all,
and hoping they re-establish contact?  (It would be massively
inconvenient to have to reset them all physically.)
No, BATMAN V will not magically fix this.
Then I won't switch to BATMAN_V.  "If it ain't broke, don't fix it."

----------------------------------------------------------------------

Should I try turning off bridge loop avoidance?
bridge loop avoidance should be on as soon as you have any two nodes connected
to the same LAN and mesh at one time.
Then I guess I don't need BLA.  I'm tempted to turn it off just to avoid the overhead, because only the gateways have wired access to the LAN, and all other nodes have only their respective meshes.

I think we should work on your a - b - c - d chain and find out why a can't
talk to d. That seems like the most obvious symptom.
I would do that if it were still broken!

Here's what I did, in some detail: rosepark dot us hash Feb182020



Reply via email to