Aled Sage created BROOKLYN-560:
----------------------------------
Summary: MachineEntity rebind sometimes fails to add machine
metrics feed
Key: BROOKLYN-560
URL: https://issues.apache.org/jira/browse/BROOKLYN-560
Project: Brooklyn
Issue Type: Bug
Reporter: Aled Sage
{{MachineEntityJcloudsRebindTest.testRebind}} fails non-deterministically in
1.0.0-SNAPSHOT:
{noformat}
2017-11-10 21:22:12,481 INFO TESTNG FAILED: "Surefire test" -
org.apache.brooklyn.entity.machine.MachineEntityJcloudsRebindTest.testRebind()
finished in 30882 ms
java.lang.AssertionError: failed succeeds-eventually, 75 attempts, 30001ms
elapsed: AssertionError: Commands (/etc/os-release) not contain in
[ExecCmd{...},...]
at
org.apache.brooklyn.entity.machine.MachineEntityJcloudsRebindTest.assertRecordedSshCmdContainsEventually(MachineEntityJcloudsRebindTest.java:117)
at
org.apache.brooklyn.entity.machine.MachineEntityJcloudsRebindTest.testRebind(MachineEntityJcloudsRebindTest.java:103)
Caused by: java.lang.AssertionError:
at
org.apache.brooklyn.entity.machine.MachineEntityJcloudsRebindTest.assertRecordedSshCmdContainsEventually(MachineEntityJcloudsRebindTest.java:117)
at
org.apache.brooklyn.entity.machine.MachineEntityJcloudsRebindTest.testRebind(MachineEntityJcloudsRebindTest.java:103)
{noformat}
To word it in terms of a bug in production code, when you rebind to a
{{MachineEntity}} then it sometimes fails to add the {{machineMetricsFeed}}
feed onto the entity again, which means it will not update sensors like
{{machine.loadAverage}} or {{machine.cpu}}.
The problem is that {{SoftwareProcessImpl.callRebindHooks()}} schedules a task
to call {{connectSensors}} some random time within the next 10 seconds.
However, if this executes very soon (before the current thread has got far
enough with the rest of rebind), then when connectSensors calls
{{JcloudsSshMachineLocation.inferMachineDetails}} it fails: it sees the
{{JcloudsSshMachineLocation.isManaged()}} as still false. It therefore creates
an empty BasicOsDetails, rather than executing the {{os-details.sh}} script.
You can make this test fail consistently if you change the
{{MachineEntity.MAXIMUM_REBIND_SENSOR_CONNECT_DELAY}} to {{0}} (rather than the
{{100ms}} that is set in the test).
The stacktrace of the job scheduled by {{SoftwareProcessImpl.callRebindHooks}}
is shown below:
{noformat}
Daemon Thread [brooklyn-execmanager-Vv9IUPme-1] (Suspended)
owns: Object (id=571)
JcloudsSshMachineLocation.inferMachineDetails() line: 533
JcloudsSshMachineLocation(SshMachineLocation).getMachineDetails() line:
1037
JcloudsSshMachineLocation(SshMachineLocation).getOsDetails() line: 1018
MachineEntityImpl.connectSensors() line: 58
SoftwareProcessImpl$2.call() line: 402
SoftwareProcessImpl$2.call() line: 1
BasicExecutionManager$ScheduledTaskCallable$1.call() line: 476
BasicExecutionManager$SubmissionCallable<T>.call() line: 565
FutureTask<V>.run() line: 266
ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) line: 1142
ThreadPoolExecutor$Worker.run() line: 617
Thread.run() line: 745
{noformat}
Note this relates to https://issues.apache.org/jira/browse/BROOKLYN-425, which
reported similar symptoms of the feed to being registered, but where it was
happening all the time.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)