[jira] [Updated] (AURORA-1809) Investigate flaky test TestRunnerKillProcessGroup.test_pg_is_killed
[ https://issues.apache.org/jira/browse/AURORA-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zameer Manji updated AURORA-1809: - Fix Version/s: 0.17.0 > Investigate flaky test TestRunnerKillProcessGroup.test_pg_is_killed > --- > > Key: AURORA-1809 > URL: https://issues.apache.org/jira/browse/AURORA-1809 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji > Fix For: 0.17.0 > > > If you run it apart of the full test suite it fails like this: > {noformat} > FAILURES > __ TestRunnerKillProcessGroup.test_pg_is_killed __ > > self = object at 0x7f0c79893e10> > > [1mdef test_pg_is_killed(self):[0m > [1m runner = self.start_runner()[0m > [1m tm = TaskMonitor(runner.tempdir, > runner.task_id)[0m > [1m self.wait_until_running(tm)[0m > [1m process_state, run_number = > tm.get_active_processes()[0][0m > [1m assert process_state.process == 'process'[0m > [1m assert run_number == 0[0m > [1m[0m > [1m child_pidfile = os.path.join(runner.sandbox, > runner.task_id, 'child.txt')[0m > [1m while not os.path.exists(child_pidfile):[0m > [1mtime.sleep(0.1)[0m > [1m parent_pidfile = os.path.join(runner.sandbox, > runner.task_id, 'parent.txt')[0m > [1m while not os.path.exists(parent_pidfile):[0m > [1mtime.sleep(0.1)[0m > [1m with open(child_pidfile) as fp:[0m > [1mchild_pid = int(fp.read().rstrip())[0m > [1m with open(parent_pidfile) as fp:[0m > [1mparent_pid = int(fp.read().rstrip())[0m > [1m[0m > [1m ps = ProcessProviderFactory.get()[0m > [1m ps.collect_all()[0m > [1m assert parent_pid in ps.pids()[0m > [1m assert child_pid in ps.pids()[0m > [1m assert child_pid in > ps.children_of(parent_pid)[0m > [1m[0m > [1m with open(os.path.join(runner.sandbox, > runner.task_id, 'exit.txt'), 'w') as fp:[0m > [1mfp.write('go away!')[0m > [1m[0m > [1m while tm.task_state() is not > TaskState.SUCCESS:[0m > [1mtime.sleep(0.1)[0m > [1m[0m > [1m state = tm.get_state()[0m > [1m assert state.processes['process'][0].state == > ProcessState.SUCCESS[0m > [1m[0m > [1m ps.collect_all()[0m > [1m assert parent_pid not in ps.pids()[0m > [1m> assert child_pid not in ps.pids()[0m > [1m[31mE assert 30475 not in set([1, 2, 3, 5, 7, > 8, ...])[0m > [1m[31mE + where set([1, 2, 3, 5, 7, 8, ...]) = > at 0x7f0c798b1990>>()[0m > [1m[31mE +where ProcessProvider_Procfs.pids of > at 0x7f0c798b1990>> = > at 0x7f0c798b1990>.pids[0m > > > src/test/python/apache/thermos/core/test_staged_kill.py:287: AssertionError > -- Captured stderr call -- > WARNING:root:Could not read from checkpoint > /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner > WARNING:root:Could not read from checkpoint > /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner > WARNING:root:Could not read from checkpoint > /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner > WARNING:root:Could not read from checkpoint > /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner > WARNING:root:Could not read from checkpoint > /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner > WARNING:root:Could not read from checkpoint > /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner > WARNING:root:Could not read from checkpoint > /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner > generated xml file: > /home/je
[jira] [Created] (AURORA-1809) Investigate flaky test TestRunnerKillProcessGroup.test_pg_is_killed
Zameer Manji created AURORA-1809: Summary: Investigate flaky test TestRunnerKillProcessGroup.test_pg_is_killed Key: AURORA-1809 URL: https://issues.apache.org/jira/browse/AURORA-1809 Project: Aurora Issue Type: Bug Reporter: Zameer Manji If you run it apart of the full test suite it fails like this: {noformat} FAILURES __ TestRunnerKillProcessGroup.test_pg_is_killed __ self = [1mdef test_pg_is_killed(self):[0m [1m runner = self.start_runner()[0m [1m tm = TaskMonitor(runner.tempdir, runner.task_id)[0m [1m self.wait_until_running(tm)[0m [1m process_state, run_number = tm.get_active_processes()[0][0m [1m assert process_state.process == 'process'[0m [1m assert run_number == 0[0m [1m[0m [1m child_pidfile = os.path.join(runner.sandbox, runner.task_id, 'child.txt')[0m [1m while not os.path.exists(child_pidfile):[0m [1mtime.sleep(0.1)[0m [1m parent_pidfile = os.path.join(runner.sandbox, runner.task_id, 'parent.txt')[0m [1m while not os.path.exists(parent_pidfile):[0m [1mtime.sleep(0.1)[0m [1m with open(child_pidfile) as fp:[0m [1mchild_pid = int(fp.read().rstrip())[0m [1m with open(parent_pidfile) as fp:[0m [1mparent_pid = int(fp.read().rstrip())[0m [1m[0m [1m ps = ProcessProviderFactory.get()[0m [1m ps.collect_all()[0m [1m assert parent_pid in ps.pids()[0m [1m assert child_pid in ps.pids()[0m [1m assert child_pid in ps.children_of(parent_pid)[0m [1m[0m [1m with open(os.path.join(runner.sandbox, runner.task_id, 'exit.txt'), 'w') as fp:[0m [1mfp.write('go away!')[0m [1m[0m [1m while tm.task_state() is not TaskState.SUCCESS:[0m [1mtime.sleep(0.1)[0m [1m[0m [1m state = tm.get_state()[0m [1m assert state.processes['process'][0].state == ProcessState.SUCCESS[0m [1m[0m [1m ps.collect_all()[0m [1m assert parent_pid not in ps.pids()[0m [1m> assert child_pid not in ps.pids()[0m [1m[31mE assert 30475 not in set([1, 2, 3, 5, 7, 8, ...])[0m [1m[31mE + where set([1, 2, 3, 5, 7, 8, ...]) = >()[0m [1m[31mE +where > = .pids[0m src/test/python/apache/thermos/core/test_staged_kill.py:287: AssertionError -- Captured stderr call -- WARNING:root:Could not read from checkpoint /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner WARNING:root:Could not read from checkpoint /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner WARNING:root:Could not read from checkpoint /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner WARNING:root:Could not read from checkpoint /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner WARNING:root:Could not read from checkpoint /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner WARNING:root:Could not read from checkpoint /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner WARNING:root:Could not read from checkpoint /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner generated xml file: /home/jenkins/jenkins-slave/workspace/AuroraBot/dist/test-results/415337499eb72578eab327a6487c1f5c9452b3d6.xml [1m[31m 1 failed, 719 passed, 6 skipped, 1 warnings in 206.00 seconds [0m FAILURE {noformat} If you run the test as a one off you see this: {noformat} 00:45:32 00:00 [main] (To run a reporting server: ./pants server) 00:45:32 00:00 [setup] 00:45:32 00:00 [parse]fatal: Not a git repository (or any of the parent directories): .git
[jira] [Commented] (AURORA-1808) Thermos executor should send SIGTERM to daemonized processes
[ https://issues.apache.org/jira/browse/AURORA-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15637620#comment-15637620 ] Zameer Manji commented on AURORA-1808: -- https://github.com/apache/aurora/commit/5410c229f30d6d8e331cdddc5c84b9b2b5313c01 > Thermos executor should send SIGTERM to daemonized processes > - > > Key: AURORA-1808 > URL: https://issues.apache.org/jira/browse/AURORA-1808 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji >Assignee: Zameer Manji > > Thermos loses track of double forking processes, meaning on task teardown > the daemonized process will not receive a signal to shut down cleanly. > This can be a serious issue if one is running two processes: > 1. nginx which demonizes and accepts HTTP requests. > 2. A backend processes that receives traffic from nginx over a local socket. > On task shutdown thermos will send SIGTERM to 2 and not 1, causing nginx to > still accept traffic even though the backend is dead. If thermos could also > send SIGTERM to 1, the task would tear down cleanly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1780) Offers with unknown resources types to Aurora crash the scheduler
[ https://issues.apache.org/jira/browse/AURORA-1780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15636746#comment-15636746 ] Joshua Cohen commented on AURORA-1780: -- +1 sounds like the most reasonable course of action. > Offers with unknown resources types to Aurora crash the scheduler > - > > Key: AURORA-1780 > URL: https://issues.apache.org/jira/browse/AURORA-1780 > Project: Aurora > Issue Type: Bug > Environment: vagrant >Reporter: Renan DelValle >Assignee: Renan DelValle > > Taking offers from Agents which have resources that are not known to Aurora > cause the Scheduler to crash. > Steps to reproduce: > {code} > vagrant up > sudo service mesos-slave stop > echo > "cpus(aurora-role):0.5;cpus(*):3.5;mem(aurora-role):1024;disk:2;gpus(*):4;test:200" > | sudo tee /etc/mesos-slave/resources > sudo rm -f /var/lib/mesos/meta/slaves/latest > sudo service mesos-slave start > {code} > Wait around a few moments for the offer to be made to Aurora > {code} > I0922 02:41:57.839 [Thread-19, MesosSchedulerImpl:142] Received notification > of lost agent: value: "cadaf569-171d-42fc-a417-fbd608ea5bab-S0" > I0922 02:42:30.585597 2999 log.cpp:577] Attempting to append 109 bytes to > the log > I0922 02:42:30.585654 2999 coordinator.cpp:348] Coordinator attempting to > write APPEND action at position 4 > I0922 02:42:30.585747 2999 replica.cpp:537] Replica received write request > for position 4 from (10)@192.168.33.7:8083 > I0922 02:42:30.586858 2999 leveldb.cpp:341] Persisting action (125 bytes) to > leveldb took 1.086601ms > I0922 02:42:30.586897 2999 replica.cpp:712] Persisted action at 4 > I0922 02:42:30.587020 2999 replica.cpp:691] Replica received learned notice > for position 4 from @0.0.0.0:0 > I0922 02:42:30.587785 2999 leveldb.cpp:341] Persisting action (127 bytes) to > leveldb took 746999ns > I0922 02:42:30.587805 2999 replica.cpp:712] Persisted action at 4 > I0922 02:42:30.587811 2999 replica.cpp:697] Replica learned APPEND action at > position 4 > I0922 02:42:30.601 [SchedulerImpl-0, OfferManager$OfferManagerImpl:185] > Returning offers for cadaf569-171d-42fc-a417-fbd608ea5bab-S1 for compaction. > Sep 22, 2016 2:42:38 AM > com.google.common.util.concurrent.ServiceManager$ServiceListener failed > SEVERE: Service SlotSizeCounterService [FAILED] has failed in the RUNNING > state. > java.lang.NullPointerException: Unknown Mesos resource: name: "test" > type: SCALAR > scalar { > value: 200.0 > } > role: "*" > at java.util.Objects.requireNonNull(Objects.java:228) > at > org.apache.aurora.scheduler.resources.ResourceType.fromResource(ResourceType.java:355) > at > org.apache.aurora.scheduler.resources.ResourceManager.lambda$static$0(ResourceManager.java:52) > at com.google.common.collect.Iterators$7.computeNext(Iterators.java:675) > at > com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) > at > com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) > at java.util.Iterator.forEachRemaining(Iterator.java:115) > at > java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) > at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) > at > java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) > at > java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) > at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at > java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) > at > org.apache.aurora.scheduler.resources.ResourceManager.bagFromResources(ResourceManager.java:274) > at > org.apache.aurora.scheduler.resources.ResourceManager.bagFromMesosResources(ResourceManager.java:239) > at > org.apache.aurora.scheduler.stats.AsyncStatsModule$OfferAdapter.get(AsyncStatsModule.java:153) > at > org.apache.aurora.scheduler.stats.SlotSizeCounter.run(SlotSizeCounter.java:168) > at > org.apache.aurora.scheduler.stats.AsyncStatsModule$SlotSizeCounterService.runOneIteration(AsyncStatsModule.java:130) > at > com.google.common.util.concurrent.AbstractScheduledService$ServiceDelegate$Task.run(AbstractScheduledService.java:189) > at com.google.common.util.concurrent.Callables$3.run(Callables.java:100) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.ru
[jira] [Resolved] (AURORA-1785) Populate curator latches with scheduler information
[ https://issues.apache.org/jira/browse/AURORA-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Erb resolved AURORA-1785. - Resolution: Fixed > Populate curator latches with scheduler information > --- > > Key: AURORA-1785 > URL: https://issues.apache.org/jira/browse/AURORA-1785 > Project: Aurora > Issue Type: Task >Reporter: Zameer Manji >Assignee: Jing Chen >Priority: Minor > Labels: newbie > Fix For: 0.17.0 > > > If you look at the mesos ZK node for leader election you see something like > this: > {noformat} > u'json.info_000104', > u'json.info_000102', > u'json.info_000101', > u'json.info_98', > u'json.info_97' > {noformat} > Each of these nodes contains data about the machine contending for > leadership. It is a JSON serialized {{MasterInfo}} protobuf. This means an > operator can inspect who is contending for leadership by checking the content > of the nodes. > When you check the aurora ZK node you see something like this: > {noformat} > u'_c_2884a0d3-b5b0-4445-b8d6-b271a6df6220-latch-000774', > u'_c_86a21335-c5a2-4bcb-b471-4ce128b67616-latch-000776', > u'_c_a4f8b0f7-d063-4df2-958b-7b3e6f666a95-latch-000775', > u'_c_120cd9da-3bc1-495b-b02f-2142fb22c0a0-latch-000784', > u'_c_46547c31-c5c2-4fb1-8a53-237e3cb0292f-latch-000780', > u'member_000781' > {noformat} > Only the leader node contains information. The curator latches contain no > information. It is not possible to figure out which machines are contending > for leadership purely from ZK. > I think we should attach data to the latches like mesos. > Being able to do this is invaluable to debug issues if an extra master is > added to the cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1785) Populate curator latches with scheduler information
[ https://issues.apache.org/jira/browse/AURORA-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15636194#comment-15636194 ] Stephan Erb commented on AURORA-1785: - RB https://reviews.apache.org/r/52665/ has landed. Thanks! > Populate curator latches with scheduler information > --- > > Key: AURORA-1785 > URL: https://issues.apache.org/jira/browse/AURORA-1785 > Project: Aurora > Issue Type: Task >Reporter: Zameer Manji >Assignee: Jing Chen >Priority: Minor > Labels: newbie > Fix For: 0.17.0 > > > If you look at the mesos ZK node for leader election you see something like > this: > {noformat} > u'json.info_000104', > u'json.info_000102', > u'json.info_000101', > u'json.info_98', > u'json.info_97' > {noformat} > Each of these nodes contains data about the machine contending for > leadership. It is a JSON serialized {{MasterInfo}} protobuf. This means an > operator can inspect who is contending for leadership by checking the content > of the nodes. > When you check the aurora ZK node you see something like this: > {noformat} > u'_c_2884a0d3-b5b0-4445-b8d6-b271a6df6220-latch-000774', > u'_c_86a21335-c5a2-4bcb-b471-4ce128b67616-latch-000776', > u'_c_a4f8b0f7-d063-4df2-958b-7b3e6f666a95-latch-000775', > u'_c_120cd9da-3bc1-495b-b02f-2142fb22c0a0-latch-000784', > u'_c_46547c31-c5c2-4fb1-8a53-237e3cb0292f-latch-000780', > u'member_000781' > {noformat} > Only the leader node contains information. The curator latches contain no > information. It is not possible to figure out which machines are contending > for leadership purely from ZK. > I think we should attach data to the latches like mesos. > Being able to do this is invaluable to debug issues if an extra master is > added to the cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)