[ 
https://issues.apache.org/jira/browse/GEODE-8890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278313#comment-17278313
 ] 

Donal Evans edited comment on GEODE-8890 at 2/3/21, 7:29 PM:
-------------------------------------------------------------

In the logs associated with the original failure, we see that an earlier call 
to {{TcpClient.requestToServer()}} in 
{{LocatorDiscovery.exchangeRemoteLocators()}} with a 
{{RemoteLocatorJoinRequest}} succeeds, as the following output is logged: 
{code:java}
[vm2] [info 2020/12/25 05:33:18.209 GMT locator-2 <WAN Locator Discovery 
Thread1> tid=0x52] Locator discovery task exchanged locator information 
7a2096ae9648[44531] with localhost[45273]: {1=[7a2096ae9648[45273]], 
2=[7a2096ae9648[44531]]}.{code}
 

Next, the code sleeps for {{WAN_LOCATOR_PING_INTERVAL}} ms (default value is 10 
seconds), then attempts to ping the remote locator with a call to 
{{TcpClient.requestToServer()}} with a {{RemoteLocatorPingRequest}}. It is this 
second call that results in the {{ClassCastException}}, as can be seen by the 
time stamp on the fatal log ({{2020/12/25 05:33:28.218 GMT}}), which is almost 
exactly 10 seconds after the first log.

What is strange is that the test which logged the exception started after the 
first logged message from the WAN Locator Discovery Thread: 
{code:java}
[vm3] [info 2020/12/25 05:33:26.063 GMT <Geode Failure Detection Server thread 
1> tid=0x33] GMSHealthMonitor server thread exiting [setup] START TEST 
org.apache.geode.internal.cache.wan.serial.WANHostNameVerificationDistributedTest.enableHostNameValidationAcrossAllComponents
 [vm0] [info 2020/12/25 05:33:26.065 GMT locator-0 <RMI TCP 
Connection(5)-172.17.0.10> tid=0x5a] Received method: 
org.apache.geode.test.dunit.rules.DistributedRestoreSystemProperties$$Lambda$19/39844442.run
 with 0 args on object:{code}
which implies that the thread doing the logging was started by a previous test.

When searching the logs for the ports used by the locators in the first logged 
message (44531 and 45273), we see that both locators that are mentioned stop 
before the test that logged the exception started: 
{code:java}
[vm2] [info 2020/12/25 05:33:25.705 GMT locator-2 <RMI TCP 
Connection(1)-172.17.0.10> tid=0x20] Distribution Locator on 
7a2096ae9648[44531] is stopped 
{code}
 
{code:java}
[vm0] [info 2020/12/25 05:33:25.989 GMT locator-0 <RMI TCP 
Connection(5)-172.17.0.10> tid=0x5a] Distribution Locator on 
7a2096ae9648[45273] is stopped {code}
but also see that a locator with the same port (45273) is started shortly 
before the test logged the exception: 
{code:java}
[vm0] [info 2020/12/25 05:33:27.188 GMT locator-0 <RMI TCP 
Connection(5)-172.17.0.10> tid=0x5a] DistributionManager 
172.17.0.10(locator-0:245:locator)<ec><v0>:41001 started on 
7a2096ae9648[45273]. There were 0 other DMs. others: [] (took 637 ms) (locator) 
{code}
which may be relevant.


was (Author: donalevans):
In the logs associated with the original failure, we see that an earlier call 
to {{TcpClient.requestToServer()}} in 
{{LocatorDiscovery.exchangeRemoteLocators()}} with a 
{{RemoteLocatorJoinRequest}} succeeds, as the following output is logged: 

{{}}
{code:java}

{code}
{{ [vm2] [info 2020/12/25 05:33:18.209 GMT locator-2 <WAN Locator Discovery 
Thread1> tid=0x52] Locator discovery task exchanged locator information 
7a2096ae9648[44531] with localhost[45273]: \{1=[7a2096ae9648[45273]], 
2=[7a2096ae9648[44531]]}.}}

 

Next, the code sleeps for {{WAN_LOCATOR_PING_INTERVAL}} ms (default value is 10 
seconds), then attempts to ping the remote locator with a call to 
{{TcpClient.requestToServer()}} with a {{RemoteLocatorPingRequest}}. It is this 
second call that results in the {{ClassCastException}}, as can be seen by the 
time stamp on the fatal log ({{2020/12/25 05:33:28.218 GMT}}), which is almost 
exactly 10 seconds after the first log.

What is strange is that the test which logged the exception started after the 
first logged message from the WAN Locator Discovery Thread: 

{{}}
{code:java}
[vm3] [info 2020/12/25 05:33:26.063 GMT <Geode Failure Detection Server thread 
1> tid=0x33] GMSHealthMonitor server thread exiting [setup] START TEST 
org.apache.geode.internal.cache.wan.serial.WANHostNameVerificationDistributedTest.enableHostNameValidationAcrossAllComponents
 [vm0] [info 2020/12/25 05:33:26.065 GMT locator-0 <RMI TCP 
Connection(5)-172.17.0.10> tid=0x5a] Received method: 
org.apache.geode.test.dunit.rules.DistributedRestoreSystemProperties$$Lambda$19/39844442.run
 with 0 args on object:{code}
{{}}

which implies that the thread doing the logging was started by a previous test.

When searching the logs for the ports used by the locators in the first logged 
message (44531 and 45273), we see that both locators that are mentioned stop 
before the test that logged the exception started: 

{{}}
{code:java}
[vm2] [info 2020/12/25 05:33:25.705 GMT locator-2 <RMI TCP 
Connection(1)-172.17.0.10> tid=0x20] Distribution Locator on 
7a2096ae9648[44531] is stopped 
{code}
{{}}

{{}}

{{}}
{code:java}
[vm0] [info 2020/12/25 05:33:25.989 GMT locator-0 <RMI TCP 
Connection(5)-172.17.0.10> tid=0x5a] Distribution Locator on 
7a2096ae9648[45273] is stopped {code}
{{}}

but also see that a locator with the same port (45273) is started shortly 
before the test logged the exception: 

{{}}
{code:java}
[vm0] [info 2020/12/25 05:33:27.188 GMT locator-0 <RMI TCP 
Connection(5)-172.17.0.10> tid=0x5a] DistributionManager 
172.17.0.10(locator-0:245:locator)<ec><v0>:41001 started on 
7a2096ae9648[45273]. There were 0 other DMs. others: [] (took 637 ms) (locator) 
{code}
{{}}

which may be relevant.

> ClassCastException in LocatorDiscovery.exchangeRemoteLocators
> -------------------------------------------------------------
>
>                 Key: GEODE-8890
>                 URL: https://issues.apache.org/jira/browse/GEODE-8890
>             Project: Geode
>          Issue Type: Bug
>          Components: wan
>    Affects Versions: 1.10.0, 1.11.0, 1.12.0, 1.13.1, 1.14.0
>            Reporter: Donal Evans
>            Assignee: Donal Evans
>            Priority: Major
>              Labels: blocks-1.14.0​, pull-request-available
>
> {noformat}
> org.apache.geode.internal.cache.wan.serial.WANHostNameVerificationDistributedTest
>  > enableHostNameValidationAcrossAllComponents FAILED
>     java.lang.AssertionError: Suspicious strings were written to the log 
> during this run.
>     Fix the strings or use IgnoredException.addIgnoredException to ignore.
>     -----------------------------------------------------------------------
>     Found suspect string in log4j at line 682
>     [fatal 2020/12/25 05:33:28.218 GMT <WAN Locator Discovery Thread1> 
> tid=82] Uncaught exception in thread Thread[WAN Locator Discovery 
> Thread1,5,RMI Runtime]
>     java.lang.ClassCastException: java.lang.Class cannot be cast to 
> org.apache.geode.cache.client.internal.locator.wan.RemoteLocatorPingResponse
>       at 
> org.apache.geode.cache.client.internal.locator.wan.LocatorDiscovery.exchangeRemoteLocators(LocatorDiscovery.java:191)
>       at 
> org.apache.geode.cache.client.internal.locator.wan.LocatorDiscovery$RemoteLocatorDiscovery.run(LocatorDiscovery.java:121)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to