[GitHub] vinodkone merged pull request #294: FaultDomain, conventions for additional hierarchy.

2019-02-15 Thread GitBox
vinodkone merged pull request #294: FaultDomain, conventions for additional 
hierarchy.
URL: https://github.com/apache/mesos/pull/294
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


Re: Review Request 69957: Updated master operation handling to account for new agent capability.

2019-02-15 Thread Gastón Kleiman

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69957/#review212883
---


Ship it!




Ship It!

- Gastón Kleiman


On Feb. 13, 2019, 8:09 p.m., Greg Mann wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69957/
> ---
> 
> (Updated Feb. 13, 2019, 8:09 p.m.)
> 
> 
> Review request for mesos and Gastón Kleiman.
> 
> 
> Bugs: MESOS-9535
> https://issues.apache.org/jira/browse/MESOS-9535
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Since the new AGENT_OPERATION_FEEDBACK capability has been
> made required for agent startup, we need to update the way
> the master handles validation of incoming operations and
> acknowledgement calls to account for cases where each of
> the AGENT_OPERATION_FEEDBACK and RESOURCE_PROVIDER
> capabilities are enabled/disabled.
> 
> 
> Diffs
> -
> 
>   src/master/master.cpp cf2210ec26642028d5e4fb7fc1841eb0a1ed3396 
> 
> 
> Diff: https://reviews.apache.org/r/69957/diff/4/
> 
> 
> Testing
> ---
> 
> Testing details at the end of this chain.
> 
> 
> Thanks,
> 
> Greg Mann
> 
>



Re: Review Request 69986: Added a log line to isolator for simplified debugging.

2019-02-15 Thread Gilbert Song

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69986/#review212882
---




src/slave/containerizer/mesos/isolators/filesystem/linux.cpp
Lines 884-886 (patched)


Hey Alex, IIUC this log will only be printed out if there are more than one 
executors sharing the same persistent volume. We can keep this log though, it 
may not be too meaningful once we have the GidManager (merge next week).

> Shall we add a log line for the case when the task is running under a 
different user than its default executor?

Per your comment, nested container will not exercise this function. So not 
related.


- Gilbert Song


On Feb. 15, 2019, 12:54 a.m., Alexander Rukletsov wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69986/
> ---
> 
> (Updated Feb. 15, 2019, 12:54 a.m.)
> 
> 
> Review request for mesos, Andrei Budnik and Gilbert Song.
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Added a log line to isolator for simplified debugging.
> 
> 
> Diffs
> -
> 
>   src/slave/containerizer/mesos/isolators/filesystem/linux.cpp 
> 2a9ea448d7f963f86e8b2909d83e82b498e4104c 
> 
> 
> Diff: https://reviews.apache.org/r/69986/diff/1/
> 
> 
> Testing
> ---
> 
> None: not a functional change.
> 
> 
> Thanks,
> 
> Alexander Rukletsov
> 
>



Re: Review Request 69978: Added garbage collection of terminated operations status update streams.

2019-02-15 Thread Greg Mann

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69978/#review212881
---




src/slave/slave.cpp
Line 4711 (original), 4735 (patched)


By moving this line into the `.then()` continuation, we will no longer 
remove the operation if there is an error when calling 
`operationStatusUpdateManager.acknowledgement()`; is that what we want? Perhaps 
we should add a `removeOperation()` call in the `err()` helper as well?


- Greg Mann


On Feb. 13, 2019, 11:15 p.m., Gastón Kleiman wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69978/
> ---
> 
> (Updated Feb. 13, 2019, 11:15 p.m.)
> 
> 
> Review request for mesos, Chun-Hung Hsiao and Greg Mann.
> 
> 
> Bugs: MESOS-9574
> https://issues.apache.org/jira/browse/MESOS-9574
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Make the agent garbage collect an operation status update stream once
> a terminal status update is acknowledged.
> 
> This patch also improves the logging of acknowledgement failures and the
> readability of the `Slave::operationStatusAcknowledgement` method.
> 
> 
> Diffs
> -
> 
>   src/slave/slave.cpp e3c2c005d865b5c333e92e50e49ef398fe06ad79 
> 
> 
> Diff: https://reviews.apache.org/r/69978/diff/2/
> 
> 
> Testing
> ---
> 
> Manual testing + existing tests still pass.
> 
> 
> Thanks,
> 
> Gastón Kleiman
> 
>



Re: Review Request 69994: Added a test `ROOT_EmptyCheckpointFileSlaveRecovery`.

2019-02-15 Thread Mesos Reviewbot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69994/#review212879
---



Patch looks great!

Reviews applied: [69972, 69994]

Passed command: export OS='ubuntu:14.04' BUILDTOOL='autotools' COMPILER='gcc' 
CONFIGURATION='--verbose --disable-libtool-wrappers 
--disable-parallel-test-execution' ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1'; 
./support/docker-build.sh

- Mesos Reviewbot


On Feb. 15, 2019, 9:08 a.m., Qian Zhang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69994/
> ---
> 
> (Updated Feb. 15, 2019, 9:08 a.m.)
> 
> 
> Review request for mesos, Andrei Budnik and Gilbert Song.
> 
> 
> Bugs: MESOS-9507
> https://issues.apache.org/jira/browse/MESOS-9507
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Added a test `ROOT_EmptyCheckpointFileSlaveRecovery`.
> 
> 
> Diffs
> -
> 
>   src/tests/containerizer/docker_volume_isolator_tests.cpp 
> 1503290bf62d5718df0a530a660b519649e76789 
> 
> 
> Diff: https://reviews.apache.org/r/69994/diff/1/
> 
> 
> Testing
> ---
> 
> sudo make check
> 
> This test will fail without the patch https://reviews.apache.org/r/69972
> 
> 
> Thanks,
> 
> Qian Zhang
> 
>



Re: Review Request 69981: Fixed a flaky test `MasterQuotaTest.RemoveSingleQuota`.

2019-02-15 Thread Meng Zhu


> On Feb. 14, 2019, 9:12 a.m., Benjamin Mahler wrote:
> > src/tests/master_quota_tests.cpp
> > Lines 524-525 (patched)
> > 
> >
> > Hm.. we're actually waiting for the 200 OK, is this needed because the 
> > http handler doesn't wait for the allocator to finish processing the update?
> > 
> > Probably we should spell it out here and below so it's clear

Updated the comment saying we are waiting specifically for the metrics update 
to finish, which is not on the path for 200 OK.


- Meng


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69981/#review212833
---


On Feb. 13, 2019, 4:27 p.m., Meng Zhu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69981/
> ---
> 
> (Updated Feb. 13, 2019, 4:27 p.m.)
> 
> 
> Review request for mesos, Alexander Rukletsov and Benjamin Mahler.
> 
> 
> Bugs: MESOS-9143
> https://issues.apache.org/jira/browse/MESOS-9143
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> The test is flaky due to a race between metrics update
> and metrics query.
> 
> This patch adds clock settle to ensure quota update and
> removal are fully processed (including metrics updates) before
> continuing with the metrics query.
> 
> 
> Diffs
> -
> 
>   src/tests/master_quota_tests.cpp 354a9e988c3e4cdf94c99170d2abb7cb17f9e11a 
> 
> 
> Diff: https://reviews.apache.org/r/69981/diff/2/
> 
> 
> Testing
> ---
> 
> Test ran continuously without failure
> 
> 
> Thanks,
> 
> Meng Zhu
> 
>



Re: Review Request 69615: Disable containerizer ptrace attach.

2019-02-15 Thread Mesos Reviewbot Windows

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69615/#review212878
---



FAIL: Failed to apply the current review.

Failed command: `python.exe .\support\apply-reviews.py -n -r 69615`

All the build artifacts available at: 
http://dcos-win.westus2.cloudapp.azure.com/artifacts/mesos-reviewbot-testing/2894/mesos-review-69615

Relevant logs:

- 
[apply-review-69615.log](http://dcos-win.westus2.cloudapp.azure.com/artifacts/mesos-reviewbot-testing/2894/mesos-review-69615/logs/apply-review-69615.log):

```
error: patch failed: src/slave/containerizer/mesos/containerizer.cpp:1944
error: src/slave/containerizer/mesos/containerizer.cpp: patch does not apply
```

- Mesos Reviewbot Windows


On Feb. 8, 2019, 9:09 p.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69615/
> ---
> 
> (Updated Feb. 8, 2019, 9:09 p.m.)
> 
> 
> Review request for mesos, Xudong Ni, Gilbert Song, Jie Yu, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-9349
> https://issues.apache.org/jira/browse/MESOS-9349
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Use `prctl(PR_SET_DUMPABLE)` to disable the ability to attach to
> the containerizer process(es) on Linux systems. This prevents
> unprivileged containerized processes from reading information
> about the containerizer process(es) from `/proc`. This gives an
> additional layer of protection against leaking information to
> untrusted container processes.
> 
> 
> Diffs
> -
> 
>   docs/configuration/agent.md a4015c409d00a4c117df6c869d5ba5072bcfe58e 
>   src/launcher/default_executor.cpp 5837cfa4deba557cae43112092ff24b97137951f 
>   src/launcher/executor.cpp f962e800f23d5582b1bc04a263253893492a5054 
>   src/slave/containerizer/mesos/containerizer.cpp 
> 35f51ad33da53b3e6a8eec275fbf3e77782b0fba 
>   src/slave/containerizer/mesos/launch.hpp 
> 0a6394d56321948ad760ac69c05456319a254842 
>   src/slave/containerizer/mesos/launch.cpp 
> 7f401cdf481123b8c6cc500ac02bb7daf2613d2c 
>   src/slave/flags.hpp 7346ba5b711a8353a4bc1d7dcd2f6184b777ddd0 
>   src/slave/flags.cpp 066b84f528b4c888dde399e0b5d5fe5531934de6 
>   src/slave/slave.cpp 0182dd2ca326723e96eef8c072696ad3c873de0b 
>   src/tests/containerizer/mesos_containerizer_tests.cpp 
> 449928c10b897061642af8ad267f8b70695940e6 
>   src/tests/slave_tests.cpp 22a0295086ae4f4ec26df00a0e077eecfa27f1fb 
> 
> 
> Diff: https://reviews.apache.org/r/69615/diff/3/
> 
> 
> Testing
> ---
> 
> make check (Fedora 29)
> 
> 
> Thanks,
> 
> James Peach
> 
>



Re: Review Request 69909: Tested unreachable task behavior on agent GC.

2019-02-15 Thread Mesos Reviewbot Windows

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69909/#review212877
---



PASS: Mesos patch 69909 was successfully built and tested.

Reviews applied: `['69907', '69908', '69909']`

All the build artifacts available at: 
http://dcos-win.westus2.cloudapp.azure.com/artifacts/mesos-reviewbot-testing/2893/mesos-review-69909

- Mesos Reviewbot Windows


On Feb. 15, 2019, 6:03 p.m., Vinod Kone wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69909/
> ---
> 
> (Updated Feb. 15, 2019, 6:03 p.m.)
> 
> 
> Review request for mesos, Gastón Kleiman and Greg Mann.
> 
> 
> Bugs: MESOS-8887
> https://issues.apache.org/jira/browse/MESOS-8887
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Updated `PartitionTest, RegistryGcByCount` test. This test fails
> without the previous patch.
> 
> 
> Diffs
> -
> 
>   src/tests/partition_tests.cpp 15af47b7d4328861a7523e669c2afc6a6574166b 
> 
> 
> Diff: https://reviews.apache.org/r/69909/diff/2/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Vinod Kone
> 
>



Re: Review Request 69909: Tested unreachable task behavior on agent GC.

2019-02-15 Thread Vinod Kone

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69909/
---

(Updated Feb. 15, 2019, 6:03 p.m.)


Review request for mesos, Gastón Kleiman and Greg Mann.


Changes
---

Addressed Gaston's comments. NNFR.


Bugs: MESOS-8887
https://issues.apache.org/jira/browse/MESOS-8887


Repository: mesos


Description
---

Updated `PartitionTest, RegistryGcByCount` test. This test fails
without the previous patch.


Diffs (updated)
-

  src/tests/partition_tests.cpp 15af47b7d4328861a7523e669c2afc6a6574166b 


Diff: https://reviews.apache.org/r/69909/diff/2/

Changes: https://reviews.apache.org/r/69909/diff/1-2/


Testing
---

make check


Thanks,

Vinod Kone



Re: Review Request 69907: Fixed variable names in `Master::_doRegistryGC()`.

2019-02-15 Thread Vinod Kone

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69907/
---

(Updated Feb. 15, 2019, 6:02 p.m.)


Review request for mesos, Gastón Kleiman and Greg Mann.


Changes
---

Addressed gaston's comments. NNFR.


Repository: mesos


Description
---

Substituted `slave` with `slaveId` to be consistent with the code base.
No functional changes.


Diffs (updated)
-

  src/master/master.cpp 8e23785006b4f4962770d8bba163f977b8b33414 


Diff: https://reviews.apache.org/r/69907/diff/2/

Changes: https://reviews.apache.org/r/69907/diff/1-2/


Testing
---

make check


Thanks,

Vinod Kone



[GitHub] jdef commented on a change in pull request #294: FaultDomain, conventions for additional hierarchy.

2019-02-15 Thread GitBox
jdef commented on a change in pull request #294: FaultDomain, conventions for 
additional hierarchy.
URL: https://github.com/apache/mesos/pull/294#discussion_r257272993
 
 

 ##
 File path: include/mesos/mesos.proto
 ##
 @@ -827,6 +827,18 @@ message ExecutorInfo {
  * in the same zone, in different zones in the same region, or in
  * different regions. Note that all masters in a given Mesos cluster
  * must be in the same region.
+ *
+ * Complex deployments may have additional levels of hierarchy: for example,
+ * multiple racks might be grouped together into "halls" and multiple DCs in
+ * the same geographical vicinity might be grouped together. As a convention,
+ * the recommended way to represent additional levels of hierarchy is via dot-
 
 Review comment:
   @vinodkone did you have any other thoughts on this matter?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


Re: Review Request 69994: Added a test `ROOT_EmptyCheckpointFileSlaveRecovery`.

2019-02-15 Thread Andrei Budnik

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69994/#review212869
---




src/tests/containerizer/docker_volume_isolator_tests.cpp
Lines 691 (patched)


The comment from the previous patch that describes the root cause of empty 
files might be helpful to understand why we need this test. This comment can be 
copy-pasted here.



src/tests/containerizer/docker_volume_isolator_tests.cpp
Lines 745-748 (patched)


If we need "options", we could pass `{{"iops", "150"}}` directly to the 
first invokation of `createDockerVolume`?



src/tests/containerizer/docker_volume_isolator_tests.cpp
Lines 851 (patched)


Consider `Owned mockClient`?


- Andrei Budnik


On Feb. 15, 2019, 9:08 a.m., Qian Zhang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69994/
> ---
> 
> (Updated Feb. 15, 2019, 9:08 a.m.)
> 
> 
> Review request for mesos, Andrei Budnik and Gilbert Song.
> 
> 
> Bugs: MESOS-9507
> https://issues.apache.org/jira/browse/MESOS-9507
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Added a test `ROOT_EmptyCheckpointFileSlaveRecovery`.
> 
> 
> Diffs
> -
> 
>   src/tests/containerizer/docker_volume_isolator_tests.cpp 
> 1503290bf62d5718df0a530a660b519649e76789 
> 
> 
> Diff: https://reviews.apache.org/r/69994/diff/1/
> 
> 
> Testing
> ---
> 
> sudo make check
> 
> This test will fail without the patch https://reviews.apache.org/r/69972
> 
> 
> Thanks,
> 
> Qian Zhang
> 
>



Re: Review Request 69972: Skipped the container which has no checkpointed volumes during recovery.

2019-02-15 Thread Andrei Budnik


> On Feb. 13, 2019, 1:28 p.m., Andrei Budnik wrote:
> > src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp
> > Lines 300 (patched)
> > 
> >
> > Given that `state::checkpoint` is **atomic**, we can not end up in the 
> > state where the file is empty because the agent did not finish writing to 
> > it.
> > 
> > However, an empty file might occur in case of hard reboot of the 
> > agent's host. This happens because page cache is dumped every 20 seconds by 
> > default in Linux. There is a chance that the file is created, but data has 
> > not yet synced on disk.
> > 
> > As we have agreed with Gilbert, we need to ignore empty files **only** 
> > in case of orphan containers.
> 
> Qian Zhang wrote:
> > Given that state::checkpoint is atomic, we can not end up in the state 
> where the file is empty because the agent did not finish writing to it.
> > However, an empty file might occur in case of hard reboot of the 
> agent's host. This happens because page cache is dumped every 20 seconds by 
> default in Linux. There is a chance that the file is created, but data has 
> not yet synced on disk.
> 
> Agree. And I see the comment `// This could happen if the slave died 
> after opening the file for writing but before it checkpointed anything.` in a 
> couple of places in Mesos code (e.g., `slave/state.cpp`, 
> `metadata_manager.cpp`), I think those comments need to be updated as well.
> 
> > As we have agreed with Gilbert, we need to ignore empty files only in 
> case of orphan containers.
> 
> Can you please elaborate a bit? Why do we want to treat orphan containers 
> and recoverable containers differently? How will we handle the recoverable 
> containers in this case?

> I think those comments need to be updated as well.

We can update these comments in a separate patch later.

> Why do we want to treat orphan containers and recoverable containers 
> differently? How will we handle the recoverable containers in this case?

I think that `recoverable` containers could not have empty state files by 
construction as checkpointing state is atomic. If this invariant does not 
satisfy, then we definitely have a bug in our code which we need to fix.
In the case of hard reboots, this invariant might be broken, but all containers 
are `orphan`.

If the reasoning above looks acceptable, then we might want to recover after 
broken invariant for `orphan` containers, while keeping this error for 
`non-orphan` containers as an assertion (for us, developers) that the invariant 
could not be broken in normal circumstances.


- Andrei


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69972/#review212795
---


On Feb. 13, 2019, 8:26 a.m., Qian Zhang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69972/
> ---
> 
> (Updated Feb. 13, 2019, 8:26 a.m.)
> 
> 
> Review request for mesos, Andrei Budnik and Gilbert Song.
> 
> 
> Bugs: MESOS-9507
> https://issues.apache.org/jira/browse/MESOS-9507
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> There are two cases we need to handle:
>   1. The checkpointed docker volumes file does not exist.
>   2. The checkpointed docker volumes file is empty.
> For both of the two cases, in the recovery of `docker/volume` isolator,
> we should remove the container's checkpoint directory and then skip the
> container.
> 
> 
> Diffs
> -
> 
>   src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp 
> a72fc84da6fb0f24d363dd4c635500510da675d8 
> 
> 
> Diff: https://reviews.apache.org/r/69972/diff/1/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Qian Zhang
> 
>



Re: Review Request 69994: Added a test `ROOT_EmptyCheckpointFileSlaveRecovery`.

2019-02-15 Thread Mesos Reviewbot Windows

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69994/#review212866
---



PASS: Mesos patch 69994 was successfully built and tested.

Reviews applied: `['69972', '69994']`

All the build artifacts available at: 
http://dcos-win.westus2.cloudapp.azure.com/artifacts/mesos-reviewbot-testing/2892/mesos-review-69994

- Mesos Reviewbot Windows


On Feb. 15, 2019, 10:08 a.m., Qian Zhang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69994/
> ---
> 
> (Updated Feb. 15, 2019, 10:08 a.m.)
> 
> 
> Review request for mesos, Andrei Budnik and Gilbert Song.
> 
> 
> Bugs: MESOS-9507
> https://issues.apache.org/jira/browse/MESOS-9507
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Added a test `ROOT_EmptyCheckpointFileSlaveRecovery`.
> 
> 
> Diffs
> -
> 
>   src/tests/containerizer/docker_volume_isolator_tests.cpp 
> 1503290bf62d5718df0a530a660b519649e76789 
> 
> 
> Diff: https://reviews.apache.org/r/69994/diff/1/
> 
> 
> Testing
> ---
> 
> sudo make check
> 
> This test will fail without the patch https://reviews.apache.org/r/69972
> 
> 
> Thanks,
> 
> Qian Zhang
> 
>



Re: Review Request 69994: Added a test `ROOT_EmptyCheckpointFileSlaveRecovery`.

2019-02-15 Thread Qian Zhang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69994/
---

(Updated Feb. 15, 2019, 5:08 p.m.)


Review request for mesos, Andrei Budnik and Gilbert Song.


Bugs: MESOS-9507
https://issues.apache.org/jira/browse/MESOS-9507


Repository: mesos


Description
---

Added a test `ROOT_EmptyCheckpointFileSlaveRecovery`.


Diffs
-

  src/tests/containerizer/docker_volume_isolator_tests.cpp 
1503290bf62d5718df0a530a660b519649e76789 


Diff: https://reviews.apache.org/r/69994/diff/1/


Testing (updated)
---

sudo make check

This test will fail without the patch https://reviews.apache.org/r/69972


Thanks,

Qian Zhang



Re: Review Request 69972: Skipped the container which has no checkpointed volumes during recovery.

2019-02-15 Thread Qian Zhang


> On Feb. 13, 2019, 9:28 p.m., Andrei Budnik wrote:
> > Thanks for the patch!
> > I think we should implement a test for this. Otherwise, it would be very 
> > dangerous to _refactor_ this part of code in the future.
> > If you have no chance to implement a test for this now, please feel free to 
> > file a ticket.

Sure, I posted a unit test here: https://reviews.apache.org/r/69994/


- Qian


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69972/#review212795
---


On Feb. 13, 2019, 4:26 p.m., Qian Zhang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69972/
> ---
> 
> (Updated Feb. 13, 2019, 4:26 p.m.)
> 
> 
> Review request for mesos, Andrei Budnik and Gilbert Song.
> 
> 
> Bugs: MESOS-9507
> https://issues.apache.org/jira/browse/MESOS-9507
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> There are two cases we need to handle:
>   1. The checkpointed docker volumes file does not exist.
>   2. The checkpointed docker volumes file is empty.
> For both of the two cases, in the recovery of `docker/volume` isolator,
> we should remove the container's checkpoint directory and then skip the
> container.
> 
> 
> Diffs
> -
> 
>   src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp 
> a72fc84da6fb0f24d363dd4c635500510da675d8 
> 
> 
> Diff: https://reviews.apache.org/r/69972/diff/1/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Qian Zhang
> 
>



Re: Review Request 69972: Skipped the container which has no checkpointed volumes during recovery.

2019-02-15 Thread Qian Zhang


> On Feb. 13, 2019, 9:28 p.m., Andrei Budnik wrote:
> > src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp
> > Lines 300 (patched)
> > 
> >
> > Given that `state::checkpoint` is **atomic**, we can not end up in the 
> > state where the file is empty because the agent did not finish writing to 
> > it.
> > 
> > However, an empty file might occur in case of hard reboot of the 
> > agent's host. This happens because page cache is dumped every 20 seconds by 
> > default in Linux. There is a chance that the file is created, but data has 
> > not yet synced on disk.
> > 
> > As we have agreed with Gilbert, we need to ignore empty files **only** 
> > in case of orphan containers.

> Given that state::checkpoint is atomic, we can not end up in the state where 
> the file is empty because the agent did not finish writing to it.
> However, an empty file might occur in case of hard reboot of the agent's 
> host. This happens because page cache is dumped every 20 seconds by default 
> in Linux. There is a chance that the file is created, but data has not yet 
> synced on disk.

Agree. And I see the comment `// This could happen if the slave died after 
opening the file for writing but before it checkpointed anything.` in a couple 
of places in Mesos code (e.g., `slave/state.cpp`, `metadata_manager.cpp`), I 
think those comments need to be updated as well.

> As we have agreed with Gilbert, we need to ignore empty files only in case of 
> orphan containers.

Can you please elaborate a bit? Why do we want to treat orphan containers and 
recoverable containers differently? How will we handle the recoverable 
containers in this case?


- Qian


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69972/#review212795
---


On Feb. 13, 2019, 4:26 p.m., Qian Zhang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69972/
> ---
> 
> (Updated Feb. 13, 2019, 4:26 p.m.)
> 
> 
> Review request for mesos, Andrei Budnik and Gilbert Song.
> 
> 
> Bugs: MESOS-9507
> https://issues.apache.org/jira/browse/MESOS-9507
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> There are two cases we need to handle:
>   1. The checkpointed docker volumes file does not exist.
>   2. The checkpointed docker volumes file is empty.
> For both of the two cases, in the recovery of `docker/volume` isolator,
> we should remove the container's checkpoint directory and then skip the
> container.
> 
> 
> Diffs
> -
> 
>   src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp 
> a72fc84da6fb0f24d363dd4c635500510da675d8 
> 
> 
> Diff: https://reviews.apache.org/r/69972/diff/1/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Qian Zhang
> 
>



Re: Review Request 69986: Added a log line to isolator for simplified debugging.

2019-02-15 Thread Alexander Rukletsov

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69986/
---

(Updated Feb. 15, 2019, 8:54 a.m.)


Review request for mesos, Andrei Budnik and Gilbert Song.


Summary (updated)
-

Added a log line to isolator for simplified debugging.


Repository: mesos


Description (updated)
---

Added a log line to isolator for simplified debugging.


Diffs
-

  src/slave/containerizer/mesos/isolators/filesystem/linux.cpp 
2a9ea448d7f963f86e8b2909d83e82b498e4104c 


Diff: https://reviews.apache.org/r/69986/diff/1/


Testing
---

None: not a functional change.


Thanks,

Alexander Rukletsov



Review Request 69994: Added a test `ROOT_EmptyCheckpointFileSlaveRecovery`.

2019-02-15 Thread Qian Zhang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69994/
---

Review request for mesos, Andrei Budnik and Gilbert Song.


Bugs: MESOS-9507
https://issues.apache.org/jira/browse/MESOS-9507


Repository: mesos


Description
---

Added a test `ROOT_EmptyCheckpointFileSlaveRecovery`.


Diffs
-

  src/tests/containerizer/docker_volume_isolator_tests.cpp 
1503290bf62d5718df0a530a660b519649e76789 


Diff: https://reviews.apache.org/r/69994/diff/1/


Testing
---


Thanks,

Qian Zhang



Re: Review Request 69986: Added a log line for simplified debugging.

2019-02-15 Thread Till Toenshoff via Review Board

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69986/#review212862
---


Ship it!




Ship It!

- Till Toenshoff


On Feb. 14, 2019, 3:24 p.m., Alexander Rukletsov wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69986/
> ---
> 
> (Updated Feb. 14, 2019, 3:24 p.m.)
> 
> 
> Review request for mesos, Andrei Budnik and Gilbert Song.
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Added a log line for simplified debugging.
> 
> 
> Diffs
> -
> 
>   src/slave/containerizer/mesos/isolators/filesystem/linux.cpp 
> 2a9ea448d7f963f86e8b2909d83e82b498e4104c 
> 
> 
> Diff: https://reviews.apache.org/r/69986/diff/1/
> 
> 
> Testing
> ---
> 
> None: not a functional change.
> 
> 
> Thanks,
> 
> Alexander Rukletsov
> 
>