date:20161104

[jira] [Updated] (AURORA-1809) Investigate flaky test TestRunnerKillProcessGroup.test_pg_is_killed

2016-11-04 Thread Zameer Manji (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zameer Manji updated AURORA-1809:
-
Fix Version/s: 0.17.0

> Investigate flaky test TestRunnerKillProcessGroup.test_pg_is_killed
> ---
>
> Key: AURORA-1809
> URL: https://issues.apache.org/jira/browse/AURORA-1809
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
> Fix For: 0.17.0
>
>
> If you run it apart of the full test suite it fails like this:
> {noformat}
>   FAILURES 
>  __ TestRunnerKillProcessGroup.test_pg_is_killed __
>  
>  self =  object at 0x7f0c79893e10>
>  
>  [1mdef test_pg_is_killed(self):[0m
>  [1m  runner = self.start_runner()[0m
>  [1m  tm = TaskMonitor(runner.tempdir, 
> runner.task_id)[0m
>  [1m  self.wait_until_running(tm)[0m
>  [1m  process_state, run_number = 
> tm.get_active_processes()[0][0m
>  [1m  assert process_state.process == 'process'[0m
>  [1m  assert run_number == 0[0m
>  [1m[0m
>  [1m  child_pidfile = os.path.join(runner.sandbox, 
> runner.task_id, 'child.txt')[0m
>  [1m  while not os.path.exists(child_pidfile):[0m
>  [1mtime.sleep(0.1)[0m
>  [1m  parent_pidfile = os.path.join(runner.sandbox, 
> runner.task_id, 'parent.txt')[0m
>  [1m  while not os.path.exists(parent_pidfile):[0m
>  [1mtime.sleep(0.1)[0m
>  [1m  with open(child_pidfile) as fp:[0m
>  [1mchild_pid = int(fp.read().rstrip())[0m
>  [1m  with open(parent_pidfile) as fp:[0m
>  [1mparent_pid = int(fp.read().rstrip())[0m
>  [1m[0m
>  [1m  ps = ProcessProviderFactory.get()[0m
>  [1m  ps.collect_all()[0m
>  [1m  assert parent_pid in ps.pids()[0m
>  [1m  assert child_pid in ps.pids()[0m
>  [1m  assert child_pid in 
> ps.children_of(parent_pid)[0m
>  [1m[0m
>  [1m  with open(os.path.join(runner.sandbox, 
> runner.task_id, 'exit.txt'), 'w') as fp:[0m
>  [1mfp.write('go away!')[0m
>  [1m[0m
>  [1m  while tm.task_state() is not 
> TaskState.SUCCESS:[0m
>  [1mtime.sleep(0.1)[0m
>  [1m[0m
>  [1m  state = tm.get_state()[0m
>  [1m  assert state.processes['process'][0].state == 
> ProcessState.SUCCESS[0m
>  [1m[0m
>  [1m  ps.collect_all()[0m
>  [1m  assert parent_pid not in ps.pids()[0m
>  [1m> assert child_pid not in ps.pids()[0m
>  [1m[31mE assert 30475 not in set([1, 2, 3, 5, 7, 
> 8, ...])[0m
>  [1m[31mE  +  where set([1, 2, 3, 5, 7, 8, ...]) = 
>   at 0x7f0c798b1990>>()[0m
>  [1m[31mE  +where  ProcessProvider_Procfs.pids of 
>  at 0x7f0c798b1990>> = 
>  at 0x7f0c798b1990>.pids[0m
>  
>  
> src/test/python/apache/thermos/core/test_staged_kill.py:287: AssertionError
>  -- Captured stderr call --
>  WARNING:root:Could not read from checkpoint 
> /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner
>  WARNING:root:Could not read from checkpoint 
> /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner
>  WARNING:root:Could not read from checkpoint 
> /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner
>  WARNING:root:Could not read from checkpoint 
> /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner
>  WARNING:root:Could not read from checkpoint 
> /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner
>  WARNING:root:Could not read from checkpoint 
> /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner
>  WARNING:root:Could not read from checkpoint 
> /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner
>   generated xml file: 
> /home/je

[jira] [Created] (AURORA-1809) Investigate flaky test TestRunnerKillProcessGroup.test_pg_is_killed

2016-11-04 Thread Zameer Manji (JIRA)

Zameer Manji created AURORA-1809:


 Summary: Investigate flaky test 
TestRunnerKillProcessGroup.test_pg_is_killed
 Key: AURORA-1809
 URL: https://issues.apache.org/jira/browse/AURORA-1809
 Project: Aurora
  Issue Type: Bug
Reporter: Zameer Manji


If you run it apart of the full test suite it fails like this:
{noformat}
  FAILURES 
 __ TestRunnerKillProcessGroup.test_pg_is_killed __
 
 self = 
 
 [1mdef test_pg_is_killed(self):[0m
 [1m  runner = self.start_runner()[0m
 [1m  tm = TaskMonitor(runner.tempdir, 
runner.task_id)[0m
 [1m  self.wait_until_running(tm)[0m
 [1m  process_state, run_number = 
tm.get_active_processes()[0][0m
 [1m  assert process_state.process == 'process'[0m
 [1m  assert run_number == 0[0m
 [1m[0m
 [1m  child_pidfile = os.path.join(runner.sandbox, 
runner.task_id, 'child.txt')[0m
 [1m  while not os.path.exists(child_pidfile):[0m
 [1mtime.sleep(0.1)[0m
 [1m  parent_pidfile = os.path.join(runner.sandbox, 
runner.task_id, 'parent.txt')[0m
 [1m  while not os.path.exists(parent_pidfile):[0m
 [1mtime.sleep(0.1)[0m
 [1m  with open(child_pidfile) as fp:[0m
 [1mchild_pid = int(fp.read().rstrip())[0m
 [1m  with open(parent_pidfile) as fp:[0m
 [1mparent_pid = int(fp.read().rstrip())[0m
 [1m[0m
 [1m  ps = ProcessProviderFactory.get()[0m
 [1m  ps.collect_all()[0m
 [1m  assert parent_pid in ps.pids()[0m
 [1m  assert child_pid in ps.pids()[0m
 [1m  assert child_pid in 
ps.children_of(parent_pid)[0m
 [1m[0m
 [1m  with open(os.path.join(runner.sandbox, 
runner.task_id, 'exit.txt'), 'w') as fp:[0m
 [1mfp.write('go away!')[0m
 [1m[0m
 [1m  while tm.task_state() is not 
TaskState.SUCCESS:[0m
 [1mtime.sleep(0.1)[0m
 [1m[0m
 [1m  state = tm.get_state()[0m
 [1m  assert state.processes['process'][0].state == 
ProcessState.SUCCESS[0m
 [1m[0m
 [1m  ps.collect_all()[0m
 [1m  assert parent_pid not in ps.pids()[0m
 [1m> assert child_pid not in ps.pids()[0m
 [1m[31mE assert 30475 not in set([1, 2, 3, 5, 7, 8, 
...])[0m
 [1m[31mE  +  where set([1, 2, 3, 5, 7, 8, ...]) = 
>()[0m
 [1m[31mE  +where > = 
.pids[0m
 
 
src/test/python/apache/thermos/core/test_staged_kill.py:287: AssertionError
 -- Captured stderr call --
 WARNING:root:Could not read from checkpoint 
/tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner
 WARNING:root:Could not read from checkpoint 
/tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner
 WARNING:root:Could not read from checkpoint 
/tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner
 WARNING:root:Could not read from checkpoint 
/tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner
 WARNING:root:Could not read from checkpoint 
/tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner
 WARNING:root:Could not read from checkpoint 
/tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner
 WARNING:root:Could not read from checkpoint 
/tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner
  generated xml file: 
/home/jenkins/jenkins-slave/workspace/AuroraBot/dist/test-results/415337499eb72578eab327a6487c1f5c9452b3d6.xml
 
 [1m[31m 1 failed, 719 passed, 6 skipped, 1 warnings in 
206.00 seconds [0m
 
FAILURE
{noformat}


If you run the test as a one off you see this:
{noformat}
00:45:32 00:00 [main]
   (To run a reporting server: ./pants server)
00:45:32 00:00   [setup]
00:45:32 00:00 [parse]fatal: Not a git repository (or any of the parent 
directories): .git

[jira] [Commented] (AURORA-1808) Thermos executor should send SIGTERM to daemonized processes

2016-11-04 Thread Zameer Manji (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15637620#comment-15637620
 ] 

Zameer Manji commented on AURORA-1808:
--

https://github.com/apache/aurora/commit/5410c229f30d6d8e331cdddc5c84b9b2b5313c01

> Thermos executor should send SIGTERM to daemonized processes 
> -
>
> Key: AURORA-1808
> URL: https://issues.apache.org/jira/browse/AURORA-1808
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Assignee: Zameer Manji
>
> Thermos loses track of double forking processes, meaning on task teardown  
> the daemonized process will not receive a signal to shut down cleanly.
> This can be a serious issue if one is running two processes: 
> 1. nginx which demonizes and accepts HTTP requests.
> 2. A backend processes that receives traffic from nginx over a local socket. 
> On task shutdown thermos will send SIGTERM to 2 and not 1, causing nginx to 
> still accept traffic even though the backend is dead. If thermos could also 
> send SIGTERM to 1, the task would tear down cleanly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (AURORA-1780) Offers with unknown resources types to Aurora crash the scheduler

2016-11-04 Thread Joshua Cohen (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15636746#comment-15636746
 ] 

Joshua Cohen commented on AURORA-1780:
--

+1 sounds like the most reasonable course of action.

> Offers with unknown resources types to Aurora crash the scheduler
> -
>
> Key: AURORA-1780
> URL: https://issues.apache.org/jira/browse/AURORA-1780
> Project: Aurora
>  Issue Type: Bug
> Environment: vagrant
>Reporter: Renan DelValle
>Assignee: Renan DelValle
>
> Taking offers from Agents which have resources that are not known to Aurora 
> cause the Scheduler to crash.
> Steps to reproduce:
> {code}
> vagrant up
> sudo service mesos-slave stop
> echo 
> "cpus(aurora-role):0.5;cpus(*):3.5;mem(aurora-role):1024;disk:2;gpus(*):4;test:200"
>  | sudo tee /etc/mesos-slave/resources
> sudo rm -f /var/lib/mesos/meta/slaves/latest
> sudo service mesos-slave start
> {code}
> Wait around a few moments for the offer to be made to Aurora
> {code}
> I0922 02:41:57.839 [Thread-19, MesosSchedulerImpl:142] Received notification 
> of lost agent: value: "cadaf569-171d-42fc-a417-fbd608ea5bab-S0"
> I0922 02:42:30.585597  2999 log.cpp:577] Attempting to append 109 bytes to 
> the log
> I0922 02:42:30.585654  2999 coordinator.cpp:348] Coordinator attempting to 
> write APPEND action at position 4
> I0922 02:42:30.585747  2999 replica.cpp:537] Replica received write request 
> for position 4 from (10)@192.168.33.7:8083
> I0922 02:42:30.586858  2999 leveldb.cpp:341] Persisting action (125 bytes) to 
> leveldb took 1.086601ms
> I0922 02:42:30.586897  2999 replica.cpp:712] Persisted action at 4
> I0922 02:42:30.587020  2999 replica.cpp:691] Replica received learned notice 
> for position 4 from @0.0.0.0:0
> I0922 02:42:30.587785  2999 leveldb.cpp:341] Persisting action (127 bytes) to 
> leveldb took 746999ns
> I0922 02:42:30.587805  2999 replica.cpp:712] Persisted action at 4
> I0922 02:42:30.587811  2999 replica.cpp:697] Replica learned APPEND action at 
> position 4
> I0922 02:42:30.601 [SchedulerImpl-0, OfferManager$OfferManagerImpl:185] 
> Returning offers for cadaf569-171d-42fc-a417-fbd608ea5bab-S1 for compaction.
> Sep 22, 2016 2:42:38 AM 
> com.google.common.util.concurrent.ServiceManager$ServiceListener failed
> SEVERE: Service SlotSizeCounterService [FAILED] has failed in the RUNNING 
> state.
> java.lang.NullPointerException: Unknown Mesos resource: name: "test"
> type: SCALAR
> scalar {
>   value: 200.0
> }
> role: "*"
>   at java.util.Objects.requireNonNull(Objects.java:228)
>   at 
> org.apache.aurora.scheduler.resources.ResourceType.fromResource(ResourceType.java:355)
>   at 
> org.apache.aurora.scheduler.resources.ResourceManager.lambda$static$0(ResourceManager.java:52)
>   at com.google.common.collect.Iterators$7.computeNext(Iterators.java:675)
>   at 
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
>   at 
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
>   at java.util.Iterator.forEachRemaining(Iterator.java:115)
>   at 
> java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
>   at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
>   at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
>   at 
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
>   at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>   at 
> java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
>   at 
> org.apache.aurora.scheduler.resources.ResourceManager.bagFromResources(ResourceManager.java:274)
>   at 
> org.apache.aurora.scheduler.resources.ResourceManager.bagFromMesosResources(ResourceManager.java:239)
>   at 
> org.apache.aurora.scheduler.stats.AsyncStatsModule$OfferAdapter.get(AsyncStatsModule.java:153)
>   at 
> org.apache.aurora.scheduler.stats.SlotSizeCounter.run(SlotSizeCounter.java:168)
>   at 
> org.apache.aurora.scheduler.stats.AsyncStatsModule$SlotSizeCounterService.runOneIteration(AsyncStatsModule.java:130)
>   at 
> com.google.common.util.concurrent.AbstractScheduledService$ServiceDelegate$Task.run(AbstractScheduledService.java:189)
>   at com.google.common.util.concurrent.Callables$3.run(Callables.java:100)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.ru

[jira] [Resolved] (AURORA-1785) Populate curator latches with scheduler information

2016-11-04 Thread Stephan Erb (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb resolved AURORA-1785.
-
Resolution: Fixed

> Populate curator latches with scheduler information
> ---
>
> Key: AURORA-1785
> URL: https://issues.apache.org/jira/browse/AURORA-1785
> Project: Aurora
>  Issue Type: Task
>Reporter: Zameer Manji
>Assignee: Jing Chen
>Priority: Minor
>  Labels: newbie
> Fix For: 0.17.0
>
>
> If you look at the mesos ZK node for leader election you see something like 
> this:
> {noformat}
>  u'json.info_000104',
>  u'json.info_000102',
>  u'json.info_000101',
>  u'json.info_98',
>  u'json.info_97'
> {noformat}
> Each of these nodes contains data about the machine contending for 
> leadership. It is a JSON serialized {{MasterInfo}} protobuf. This means an 
> operator can inspect who is contending for leadership by checking the content 
> of the nodes.
> When you check the aurora ZK node you see something like this:
> {noformat}
>  u'_c_2884a0d3-b5b0-4445-b8d6-b271a6df6220-latch-000774',
>  u'_c_86a21335-c5a2-4bcb-b471-4ce128b67616-latch-000776',
>  u'_c_a4f8b0f7-d063-4df2-958b-7b3e6f666a95-latch-000775',
>  u'_c_120cd9da-3bc1-495b-b02f-2142fb22c0a0-latch-000784',
>  u'_c_46547c31-c5c2-4fb1-8a53-237e3cb0292f-latch-000780',
>  u'member_000781'
> {noformat}
> Only the leader node contains information. The curator latches contain no 
> information. It is not possible to figure out which machines are contending 
> for leadership purely from ZK.
> I think we should attach data to the latches like mesos.
> Being able to do this is invaluable to debug issues if an extra master is 
> added to the cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (AURORA-1785) Populate curator latches with scheduler information

2016-11-04 Thread Stephan Erb (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15636194#comment-15636194
 ] 

Stephan Erb commented on AURORA-1785:
-

RB https://reviews.apache.org/r/52665/ has landed.

Thanks!

> Populate curator latches with scheduler information
> ---
>
> Key: AURORA-1785
> URL: https://issues.apache.org/jira/browse/AURORA-1785
> Project: Aurora
>  Issue Type: Task
>Reporter: Zameer Manji
>Assignee: Jing Chen
>Priority: Minor
>  Labels: newbie
> Fix For: 0.17.0
>
>
> If you look at the mesos ZK node for leader election you see something like 
> this:
> {noformat}
>  u'json.info_000104',
>  u'json.info_000102',
>  u'json.info_000101',
>  u'json.info_98',
>  u'json.info_97'
> {noformat}
> Each of these nodes contains data about the machine contending for 
> leadership. It is a JSON serialized {{MasterInfo}} protobuf. This means an 
> operator can inspect who is contending for leadership by checking the content 
> of the nodes.
> When you check the aurora ZK node you see something like this:
> {noformat}
>  u'_c_2884a0d3-b5b0-4445-b8d6-b271a6df6220-latch-000774',
>  u'_c_86a21335-c5a2-4bcb-b471-4ce128b67616-latch-000776',
>  u'_c_a4f8b0f7-d063-4df2-958b-7b3e6f666a95-latch-000775',
>  u'_c_120cd9da-3bc1-495b-b02f-2142fb22c0a0-latch-000784',
>  u'_c_46547c31-c5c2-4fb1-8a53-237e3cb0292f-latch-000780',
>  u'member_000781'
> {noformat}
> Only the leader node contains information. The curator latches contain no 
> information. It is not possible to figure out which machines are contending 
> for leadership purely from ZK.
> I think we should attach data to the latches like mesos.
> Being able to do this is invaluable to debug issues if an extra master is 
> added to the cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (AURORA-1809) Investigate flaky test TestRunnerKillProcessGroup.test_pg_is_killed

[jira] [Created] (AURORA-1809) Investigate flaky test TestRunnerKillProcessGroup.test_pg_is_killed

[jira] [Commented] (AURORA-1808) Thermos executor should send SIGTERM to daemonized processes

[jira] [Commented] (AURORA-1780) Offers with unknown resources types to Aurora crash the scheduler

[jira] [Resolved] (AURORA-1785) Populate curator latches with scheduler information

[jira] [Commented] (AURORA-1785) Populate curator latches with scheduler information

6 matches

Site Navigation

Mail list logo

Footer information