[jira] [Updated] (AURORA-1809) Investigate flaky test TestRunnerKillProcessGroup.test_pg_is_killed

2016-11-04 Thread Zameer Manji (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zameer Manji updated AURORA-1809:
-
Fix Version/s: 0.17.0

> Investigate flaky test TestRunnerKillProcessGroup.test_pg_is_killed
> ---
>
> Key: AURORA-1809
> URL: https://issues.apache.org/jira/browse/AURORA-1809
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
> Fix For: 0.17.0
>
>
> If you run it apart of the full test suite it fails like this:
> {noformat}
>   FAILURES 
>  __ TestRunnerKillProcessGroup.test_pg_is_killed __
>  
>  self =  object at 0x7f0c79893e10>
>  
>  def test_pg_is_killed(self):
>    runner = self.start_runner()
>    tm = TaskMonitor(runner.tempdir, 
> runner.task_id)
>    self.wait_until_running(tm)
>    process_state, run_number = 
> tm.get_active_processes()[0]
>    assert process_state.process == 'process'
>    assert run_number == 0
>  
>    child_pidfile = os.path.join(runner.sandbox, 
> runner.task_id, 'child.txt')
>    while not os.path.exists(child_pidfile):
>  time.sleep(0.1)
>    parent_pidfile = os.path.join(runner.sandbox, 
> runner.task_id, 'parent.txt')
>    while not os.path.exists(parent_pidfile):
>  time.sleep(0.1)
>    with open(child_pidfile) as fp:
>  child_pid = int(fp.read().rstrip())
>    with open(parent_pidfile) as fp:
>  parent_pid = int(fp.read().rstrip())
>  
>    ps = ProcessProviderFactory.get()
>    ps.collect_all()
>    assert parent_pid in ps.pids()
>    assert child_pid in ps.pids()
>    assert child_pid in 
> ps.children_of(parent_pid)
>  
>    with open(os.path.join(runner.sandbox, 
> runner.task_id, 'exit.txt'), 'w') as fp:
>  fp.write('go away!')
>  
>    while tm.task_state() is not 
> TaskState.SUCCESS:
>  time.sleep(0.1)
>  
>    state = tm.get_state()
>    assert state.processes['process'][0].state == 
> ProcessState.SUCCESS
>  
>    ps.collect_all()
>    assert parent_pid not in ps.pids()
>  > assert child_pid not in ps.pids()
>  E assert 30475 not in set([1, 2, 3, 5, 7, 
> 8, ...])
>  E  +  where set([1, 2, 3, 5, 7, 8, ...]) = 
>   at 0x7f0c798b1990>>()
>  E  +where  ProcessProvider_Procfs.pids of 
>  at 0x7f0c798b1990>> = 
>  at 0x7f0c798b1990>.pids
>  
>  
> src/test/python/apache/thermos/core/test_staged_kill.py:287: AssertionError
>  -- Captured stderr call --
>  WARNING:root:Could not read from checkpoint 
> /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner
>  WARNING:root:Could not read from checkpoint 
> /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner
>  WARNING:root:Could not read from checkpoint 
> /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner
>  WARNING:root:Could not read from checkpoint 
> /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner
>  WARNING:root:Could not read from checkpoint 
> /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner
>  WARNING:root:Could not read from checkpoint 
> /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner
>  WARNING:root:Could not read from checkpoint 
> /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner
>   generated xml file: 
> /home/je

[jira] [Created] (AURORA-1809) Investigate flaky test TestRunnerKillProcessGroup.test_pg_is_killed

2016-11-04 Thread Zameer Manji (JIRA)
Zameer Manji created AURORA-1809:


 Summary: Investigate flaky test 
TestRunnerKillProcessGroup.test_pg_is_killed
 Key: AURORA-1809
 URL: https://issues.apache.org/jira/browse/AURORA-1809
 Project: Aurora
  Issue Type: Bug
Reporter: Zameer Manji


If you run it apart of the full test suite it fails like this:
{noformat}
  FAILURES 
 __ TestRunnerKillProcessGroup.test_pg_is_killed __
 
 self = 
 
 def test_pg_is_killed(self):
   runner = self.start_runner()
   tm = TaskMonitor(runner.tempdir, 
runner.task_id)
   self.wait_until_running(tm)
   process_state, run_number = 
tm.get_active_processes()[0]
   assert process_state.process == 'process'
   assert run_number == 0
 
   child_pidfile = os.path.join(runner.sandbox, 
runner.task_id, 'child.txt')
   while not os.path.exists(child_pidfile):
 time.sleep(0.1)
   parent_pidfile = os.path.join(runner.sandbox, 
runner.task_id, 'parent.txt')
   while not os.path.exists(parent_pidfile):
 time.sleep(0.1)
   with open(child_pidfile) as fp:
 child_pid = int(fp.read().rstrip())
   with open(parent_pidfile) as fp:
 parent_pid = int(fp.read().rstrip())
 
   ps = ProcessProviderFactory.get()
   ps.collect_all()
   assert parent_pid in ps.pids()
   assert child_pid in ps.pids()
   assert child_pid in 
ps.children_of(parent_pid)
 
   with open(os.path.join(runner.sandbox, 
runner.task_id, 'exit.txt'), 'w') as fp:
 fp.write('go away!')
 
   while tm.task_state() is not 
TaskState.SUCCESS:
 time.sleep(0.1)
 
   state = tm.get_state()
   assert state.processes['process'][0].state == 
ProcessState.SUCCESS
 
   ps.collect_all()
   assert parent_pid not in ps.pids()
 > assert child_pid not in ps.pids()
 E assert 30475 not in set([1, 2, 3, 5, 7, 8, 
...])
 E  +  where set([1, 2, 3, 5, 7, 8, ...]) = 
>()
 E  +where > = 
.pids
 
 
src/test/python/apache/thermos/core/test_staged_kill.py:287: AssertionError
 -- Captured stderr call --
 WARNING:root:Could not read from checkpoint 
/tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner
 WARNING:root:Could not read from checkpoint 
/tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner
 WARNING:root:Could not read from checkpoint 
/tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner
 WARNING:root:Could not read from checkpoint 
/tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner
 WARNING:root:Could not read from checkpoint 
/tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner
 WARNING:root:Could not read from checkpoint 
/tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner
 WARNING:root:Could not read from checkpoint 
/tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner
  generated xml file: 
/home/jenkins/jenkins-slave/workspace/AuroraBot/dist/test-results/415337499eb72578eab327a6487c1f5c9452b3d6.xml
 
  1 failed, 719 passed, 6 skipped, 1 warnings in 
206.00 seconds 
 
FAILURE
{noformat}


If you run the test as a one off you see this:
{noformat}
00:45:32 00:00 [main]
   (To run a reporting server: ./pants server)
00:45:32 00:00   [setup]
00:45:32 00:00 [parse]fatal: Not a git repository (or any of the parent 
directories): .git

   

[jira] [Commented] (AURORA-1808) Thermos executor should send SIGTERM to daemonized processes

2016-11-04 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15637620#comment-15637620
 ] 

Zameer Manji commented on AURORA-1808:
--

https://github.com/apache/aurora/commit/5410c229f30d6d8e331cdddc5c84b9b2b5313c01

> Thermos executor should send SIGTERM to daemonized processes 
> -
>
> Key: AURORA-1808
> URL: https://issues.apache.org/jira/browse/AURORA-1808
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Assignee: Zameer Manji
>
> Thermos loses track of double forking processes, meaning on task teardown  
> the daemonized process will not receive a signal to shut down cleanly.
> This can be a serious issue if one is running two processes: 
> 1. nginx which demonizes and accepts HTTP requests.
> 2. A backend processes that receives traffic from nginx over a local socket. 
> On task shutdown thermos will send SIGTERM to 2 and not 1, causing nginx to 
> still accept traffic even though the backend is dead. If thermos could also 
> send SIGTERM to 1, the task would tear down cleanly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1780) Offers with unknown resources types to Aurora crash the scheduler

2016-11-04 Thread Joshua Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15636746#comment-15636746
 ] 

Joshua Cohen commented on AURORA-1780:
--

+1 sounds like the most reasonable course of action.

> Offers with unknown resources types to Aurora crash the scheduler
> -
>
> Key: AURORA-1780
> URL: https://issues.apache.org/jira/browse/AURORA-1780
> Project: Aurora
>  Issue Type: Bug
> Environment: vagrant
>Reporter: Renan DelValle
>Assignee: Renan DelValle
>
> Taking offers from Agents which have resources that are not known to Aurora 
> cause the Scheduler to crash.
> Steps to reproduce:
> {code}
> vagrant up
> sudo service mesos-slave stop
> echo 
> "cpus(aurora-role):0.5;cpus(*):3.5;mem(aurora-role):1024;disk:2;gpus(*):4;test:200"
>  | sudo tee /etc/mesos-slave/resources
> sudo rm -f /var/lib/mesos/meta/slaves/latest
> sudo service mesos-slave start
> {code}
> Wait around a few moments for the offer to be made to Aurora
> {code}
> I0922 02:41:57.839 [Thread-19, MesosSchedulerImpl:142] Received notification 
> of lost agent: value: "cadaf569-171d-42fc-a417-fbd608ea5bab-S0"
> I0922 02:42:30.585597  2999 log.cpp:577] Attempting to append 109 bytes to 
> the log
> I0922 02:42:30.585654  2999 coordinator.cpp:348] Coordinator attempting to 
> write APPEND action at position 4
> I0922 02:42:30.585747  2999 replica.cpp:537] Replica received write request 
> for position 4 from (10)@192.168.33.7:8083
> I0922 02:42:30.586858  2999 leveldb.cpp:341] Persisting action (125 bytes) to 
> leveldb took 1.086601ms
> I0922 02:42:30.586897  2999 replica.cpp:712] Persisted action at 4
> I0922 02:42:30.587020  2999 replica.cpp:691] Replica received learned notice 
> for position 4 from @0.0.0.0:0
> I0922 02:42:30.587785  2999 leveldb.cpp:341] Persisting action (127 bytes) to 
> leveldb took 746999ns
> I0922 02:42:30.587805  2999 replica.cpp:712] Persisted action at 4
> I0922 02:42:30.587811  2999 replica.cpp:697] Replica learned APPEND action at 
> position 4
> I0922 02:42:30.601 [SchedulerImpl-0, OfferManager$OfferManagerImpl:185] 
> Returning offers for cadaf569-171d-42fc-a417-fbd608ea5bab-S1 for compaction.
> Sep 22, 2016 2:42:38 AM 
> com.google.common.util.concurrent.ServiceManager$ServiceListener failed
> SEVERE: Service SlotSizeCounterService [FAILED] has failed in the RUNNING 
> state.
> java.lang.NullPointerException: Unknown Mesos resource: name: "test"
> type: SCALAR
> scalar {
>   value: 200.0
> }
> role: "*"
>   at java.util.Objects.requireNonNull(Objects.java:228)
>   at 
> org.apache.aurora.scheduler.resources.ResourceType.fromResource(ResourceType.java:355)
>   at 
> org.apache.aurora.scheduler.resources.ResourceManager.lambda$static$0(ResourceManager.java:52)
>   at com.google.common.collect.Iterators$7.computeNext(Iterators.java:675)
>   at 
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
>   at 
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
>   at java.util.Iterator.forEachRemaining(Iterator.java:115)
>   at 
> java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
>   at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
>   at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
>   at 
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
>   at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>   at 
> java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
>   at 
> org.apache.aurora.scheduler.resources.ResourceManager.bagFromResources(ResourceManager.java:274)
>   at 
> org.apache.aurora.scheduler.resources.ResourceManager.bagFromMesosResources(ResourceManager.java:239)
>   at 
> org.apache.aurora.scheduler.stats.AsyncStatsModule$OfferAdapter.get(AsyncStatsModule.java:153)
>   at 
> org.apache.aurora.scheduler.stats.SlotSizeCounter.run(SlotSizeCounter.java:168)
>   at 
> org.apache.aurora.scheduler.stats.AsyncStatsModule$SlotSizeCounterService.runOneIteration(AsyncStatsModule.java:130)
>   at 
> com.google.common.util.concurrent.AbstractScheduledService$ServiceDelegate$Task.run(AbstractScheduledService.java:189)
>   at com.google.common.util.concurrent.Callables$3.run(Callables.java:100)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.ru

[jira] [Resolved] (AURORA-1785) Populate curator latches with scheduler information

2016-11-04 Thread Stephan Erb (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb resolved AURORA-1785.
-
Resolution: Fixed

> Populate curator latches with scheduler information
> ---
>
> Key: AURORA-1785
> URL: https://issues.apache.org/jira/browse/AURORA-1785
> Project: Aurora
>  Issue Type: Task
>Reporter: Zameer Manji
>Assignee: Jing Chen
>Priority: Minor
>  Labels: newbie
> Fix For: 0.17.0
>
>
> If you look at the mesos ZK node for leader election you see something like 
> this:
> {noformat}
>  u'json.info_000104',
>  u'json.info_000102',
>  u'json.info_000101',
>  u'json.info_98',
>  u'json.info_97'
> {noformat}
> Each of these nodes contains data about the machine contending for 
> leadership. It is a JSON serialized {{MasterInfo}} protobuf. This means an 
> operator can inspect who is contending for leadership by checking the content 
> of the nodes.
> When you check the aurora ZK node you see something like this:
> {noformat}
>  u'_c_2884a0d3-b5b0-4445-b8d6-b271a6df6220-latch-000774',
>  u'_c_86a21335-c5a2-4bcb-b471-4ce128b67616-latch-000776',
>  u'_c_a4f8b0f7-d063-4df2-958b-7b3e6f666a95-latch-000775',
>  u'_c_120cd9da-3bc1-495b-b02f-2142fb22c0a0-latch-000784',
>  u'_c_46547c31-c5c2-4fb1-8a53-237e3cb0292f-latch-000780',
>  u'member_000781'
> {noformat}
> Only the leader node contains information. The curator latches contain no 
> information. It is not possible to figure out which machines are contending 
> for leadership purely from ZK.
> I think we should attach data to the latches like mesos.
> Being able to do this is invaluable to debug issues if an extra master is 
> added to the cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1785) Populate curator latches with scheduler information

2016-11-04 Thread Stephan Erb (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15636194#comment-15636194
 ] 

Stephan Erb commented on AURORA-1785:
-

RB https://reviews.apache.org/r/52665/ has landed.

Thanks!

> Populate curator latches with scheduler information
> ---
>
> Key: AURORA-1785
> URL: https://issues.apache.org/jira/browse/AURORA-1785
> Project: Aurora
>  Issue Type: Task
>Reporter: Zameer Manji
>Assignee: Jing Chen
>Priority: Minor
>  Labels: newbie
> Fix For: 0.17.0
>
>
> If you look at the mesos ZK node for leader election you see something like 
> this:
> {noformat}
>  u'json.info_000104',
>  u'json.info_000102',
>  u'json.info_000101',
>  u'json.info_98',
>  u'json.info_97'
> {noformat}
> Each of these nodes contains data about the machine contending for 
> leadership. It is a JSON serialized {{MasterInfo}} protobuf. This means an 
> operator can inspect who is contending for leadership by checking the content 
> of the nodes.
> When you check the aurora ZK node you see something like this:
> {noformat}
>  u'_c_2884a0d3-b5b0-4445-b8d6-b271a6df6220-latch-000774',
>  u'_c_86a21335-c5a2-4bcb-b471-4ce128b67616-latch-000776',
>  u'_c_a4f8b0f7-d063-4df2-958b-7b3e6f666a95-latch-000775',
>  u'_c_120cd9da-3bc1-495b-b02f-2142fb22c0a0-latch-000784',
>  u'_c_46547c31-c5c2-4fb1-8a53-237e3cb0292f-latch-000780',
>  u'member_000781'
> {noformat}
> Only the leader node contains information. The curator latches contain no 
> information. It is not possible to figure out which machines are contending 
> for leadership purely from ZK.
> I think we should attach data to the latches like mesos.
> Being able to do this is invaluable to debug issues if an extra master is 
> added to the cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)