[jira] [Updated] (MESOS-4515) ContainerLoggerTest.LOGROTATE_RotateInSandbox breaks when running on Centos6.

2016-01-26 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-4515:
-
Labels: mesosphere module test  (was: module test)

> ContainerLoggerTest.LOGROTATE_RotateInSandbox breaks when running on Centos6.
> -
>
> Key: MESOS-4515
> URL: https://issues.apache.org/jira/browse/MESOS-4515
> Project: Mesos
>  Issue Type: Bug
> Environment: Centos6, gcc-4.9.3
>Reporter: Till Toenshoff
>Assignee: Joseph Wu
>Priority: Blocker
>  Labels: mesosphere, module, test
> Fix For: 0.27.0
>
>
> {noformat}
> [17:24:58][Step 7/7] logrotate: bad argument --version: unknown error
> [17:24:58][Step 7/7] F0126 17:24:57.913729  4503 
> container_logger_tests.cpp:380] CHECK_SOME(containerizer): Failed to create 
> container logger: Failed to create container logger module 
> 'org_apache_mesos_LogrotateContainerLogger': Error creating Module instance 
> for 'org_apache_mesos_LogrotateContainerLogger' 
> [17:24:58][Step 7/7] *** Check failure stack trace: ***
> [17:24:58][Step 7/7] @ 0x7f11ae0d2d40  google::LogMessage::Fail()
> [17:24:58][Step 7/7] @ 0x7f11ae0d2c9c  google::LogMessage::SendToLog()
> [17:24:58][Step 7/7] @ 0x7f11ae0d2692  google::LogMessage::Flush()
> [17:24:58][Step 7/7] @ 0x7f11ae0d544c  
> google::LogMessageFatal::~LogMessageFatal()
> [17:24:58][Step 7/7] @   0x983927  _CheckFatal::~_CheckFatal()
> [17:24:58][Step 7/7] @   0xa9a18b  
> mesos::internal::tests::ContainerLoggerTest_LOGROTATE_RotateInSandbox_Test::TestBody()
> [17:24:58][Step 7/7] @  0x1623a4e  
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> [17:24:58][Step 7/7] @  0x161eab2  
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> [17:24:58][Step 7/7] @  0x15ffdfd  testing::Test::Run()
> [17:24:58][Step 7/7] @  0x160058b  testing::TestInfo::Run()
> [17:24:58][Step 7/7] @  0x1600bc6  testing::TestCase::Run()
> [17:24:58][Step 7/7] @  0x1607515  
> testing::internal::UnitTestImpl::RunAllTests()
> [17:24:58][Step 7/7] @  0x16246dd  
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> [17:24:58][Step 7/7] @  0x161f608  
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> [17:24:58][Step 7/7] @  0x1606245  testing::UnitTest::Run()
> [17:24:58][Step 7/7] @   0xde36b6  RUN_ALL_TESTS()
> [17:24:58][Step 7/7] @   0xde32cc  main
> [17:24:58][Step 7/7] @ 0x7f11a8896d5d  __libc_start_main
> [17:24:58][Step 7/7] @   0x981fc9  (unknown)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3889) Modify Oversubscription documentation to explicitly forbid the QoS Controller from killing executors running on optimistically offered resources.

2016-01-19 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-3889:
-
Description: 
The oversubcription documentation currently assumes that oversubscribed 
resources ({{USAGE_SLACK}}) are the only type of revocable resources.  
Optimistic offers will add a second type of revocable resource 
({{ALLOCATION_SLACK}}) that should not be acted upon by oversubscription 
components.

For example, the [oversubscription 
doc|http://mesos.apache.org/documentation/latest/oversubscription/] says the 
following:
{quote}
NOTE: If any resource used by a task or executor is revocable, the whole 
container is treated as a revocable container and can therefore be killed or 
throttled by the QoS Controller.
{quote}
which we may amend to something like:
{quote}
NOTE: If any resource used by a task or executor is revocable usage slack, the 
whole container is treated as an oversubscribed container and can therefore be 
killed or throttled by the QoS Controller.
{quote}

> Modify Oversubscription documentation to explicitly forbid the QoS Controller 
> from killing executors running on optimistically offered resources.
> -
>
> Key: MESOS-3889
> URL: https://issues.apache.org/jira/browse/MESOS-3889
> Project: Mesos
>  Issue Type: Bug
>Reporter: Artem Harutyunyan
>Assignee: Klaus Ma
>  Labels: mesosphere
>
> The oversubcription documentation currently assumes that oversubscribed 
> resources ({{USAGE_SLACK}}) are the only type of revocable resources.  
> Optimistic offers will add a second type of revocable resource 
> ({{ALLOCATION_SLACK}}) that should not be acted upon by oversubscription 
> components.
> For example, the [oversubscription 
> doc|http://mesos.apache.org/documentation/latest/oversubscription/] says the 
> following:
> {quote}
> NOTE: If any resource used by a task or executor is revocable, the whole 
> container is treated as a revocable container and can therefore be killed or 
> throttled by the QoS Controller.
> {quote}
> which we may amend to something like:
> {quote}
> NOTE: If any resource used by a task or executor is revocable usage slack, 
> the whole container is treated as an oversubscribed container and can 
> therefore be killed or throttled by the QoS Controller.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4111) Provide a means for libprocess users to exit while ensuring messages are flushed.

2016-01-19 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107194#comment-15107194
 ] 

Joseph Wu commented on MESOS-4111:
--

{{process::finalize}} only waits for the event queue on all processes to 
finish.  (It does this by putting a {{TerminateEvent}} at the back of the 
queue.)

Writes to a socket (or any FD), do not have events.  So you'd need to augment 
{{process::finalize}} to clean up and flush sockets too.  This 
[patch|https://reviews.apache.org/r/40266] is part of a chain to do something 
similar.

> Provide a means for libprocess users to exit while ensuring messages are 
> flushed.
> -
>
> Key: MESOS-4111
> URL: https://issues.apache.org/jira/browse/MESOS-4111
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Benjamin Mahler
>Priority: Minor
>
> Currently after a {{send}} there is no way to ensure that the message is 
> flushed on the socket before terminating. We work around this by inserting 
> {{os::sleep}} calls (see MESOS-243, MESOS-4106).
> There are a number of approaches to this:
> (1) Return a Future from send that notifies when the message is flushed from 
> the system.
> (2) Call process::finalize before exiting. This would require that 
> process::finalize flushes all of the outstanding data on any active sockets, 
> which may block.
> Regardless of the approach, there needs to be a timer if we want to guarantee 
> termination.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4384) Documentation cannot link to external URLs that end in .md

2016-01-15 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15102015#comment-15102015
 ] 

Joseph Wu commented on MESOS-4384:
--

Note: I modified the regex here:
https://reviews.apache.org/r/42172/

> Documentation cannot link to external URLs that end in .md
> --
>
> Key: MESOS-4384
> URL: https://issues.apache.org/jira/browse/MESOS-4384
> Project: Mesos
>  Issue Type: Bug
>  Components: documentation
>Reporter: Neil Conway
>Assignee: Joerg Schad
>Priority: Minor
>  Labels: documentation, mesosphere
>
> Per [~joerg84]: "In fact it seems that all links ending with .md are 
> interpreted as
> relative links on the webpage, i.e. [label](https://test.com/foo.md) is
> rendered into https://test.com/foo/
> ">label"
> Currently the rakefile will rewrite all with this too general regex 
> {code}
> '''f.read.gsub(/\((.*)(\.md)\)/, '(/documentation/latest/\1/)')'''
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-4136) Add a ContainerLogger module that restrains log sizes

2016-01-15 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15074359#comment-15074359
 ] 

Joseph Wu edited comment on MESOS-4136 at 1/15/16 7:34 PM:
---

|| Review || Summary ||
| https://reviews.apache.org/r/42052/ | Cleanup/Refactor of {{Subprocess::IO}} |
| https://reviews.apache.org/r/41779/
https://reviews.apache.org/r/41780/ | Add non-dup option to 
{{Subprocess::IO::FD}} |
| https://reviews.apache.org/r/42358/ | Refactor {{SandboxContainerLogger}} |
| https://reviews.apache.org/r/41781/ | Add rotating logger test |
| https://reviews.apache.org/r/41782/ | Makefile and test config changes |
| https://reviews.apache.org/r/41783/ | Implement the rotating logger |


was (Author: kaysoky):
|| Review || Summary ||
| https://reviews.apache.org/r/42052/ | Cleanup/Refactor of {{Subprocess::IO}} |
| https://reviews.apache.org/r/41779/
https://reviews.apache.org/r/41780/ | Add non-dup option to 
{{Subprocess::IO::FD}} |
| https://reviews.apache.org/r/41781/ | Add rotating logger test |
| https://reviews.apache.org/r/41782/ | Makefile and test config changes |
| https://reviews.apache.org/r/41783/ | Implement the rotating logger |

> Add a ContainerLogger module that restrains log sizes
> -
>
> Key: MESOS-4136
> URL: https://issues.apache.org/jira/browse/MESOS-4136
> Project: Mesos
>  Issue Type: Improvement
>  Components: modules
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: logging, mesosphere
>
> One of the major problems this logger module aims to solve is overflowing 
> executor/task log files.  Log files are simply written to disk, and are not 
> managed other than via occasional garbage collection by the agent process 
> (and this only deals with terminated executors).
> We should add a {{ContainerLogger}} module that truncates logs as it reaches 
> a configurable maximum size.  Additionally, we should determine if the web 
> UI's {{pailer}} needs to be changed to deal with logs that are not 
> append-only.
> This will be a non-default module which will also serve as an example for how 
> to implement the module.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4412) MesosZookeeperTest doesn't allow multiple masters

2016-01-15 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15102827#comment-15102827
 ] 

Joseph Wu commented on MESOS-4412:
--

Thanks for confirming this!  (See [MESOS-2976].)

Could you post the code you tried to use to start a second master?  (Possibly 
as a review.)

> MesosZookeeperTest doesn't allow multiple masters
> -
>
> Key: MESOS-4412
> URL: https://issues.apache.org/jira/browse/MESOS-4412
> Project: Mesos
>  Issue Type: Bug
>  Components: tests
>Affects Versions: 0.25.0, 0.26.0, 0.27.0
>Reporter: Dario Rexin
>
> In order to test certain behavior of non-leading nodes - e.g. redirecting to 
> the leading master when sending http api requests to a non-leading node - it 
> would be helpful to be able to spin up multiple masters in the test. The 
> ZooKeeperTest class should allow to do this, but fails when more than one 
> master is being started. The test will run into a timeout when the second 
> master is being started and exit with an error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-4136) Add a ContainerLogger module that restrains log sizes

2016-01-15 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15074359#comment-15074359
 ] 

Joseph Wu edited comment on MESOS-4136 at 1/16/16 12:30 AM:


|| Review || Summary ||
| https://reviews.apache.org/r/42052/ | Cleanup/Refactor of {{Subprocess::IO}} |
| https://reviews.apache.org/r/41779/
https://reviews.apache.org/r/41780/ | Add non-dup option to 
{{Subprocess::IO::FD}} |
| https://reviews.apache.org/r/42358/ | Refactor {{SandboxContainerLogger}} |
| https://reviews.apache.org/r/42374/ | Add {{LOGROTATE}} test filter |
| https://reviews.apache.org/r/41781/ | Add rotating logger test |
| https://reviews.apache.org/r/41782/ | Makefile and test config changes |
| https://reviews.apache.org/r/41783/ | Implement the rotating logger |


was (Author: kaysoky):
|| Review || Summary ||
| https://reviews.apache.org/r/42052/ | Cleanup/Refactor of {{Subprocess::IO}} |
| https://reviews.apache.org/r/41779/
https://reviews.apache.org/r/41780/ | Add non-dup option to 
{{Subprocess::IO::FD}} |
| https://reviews.apache.org/r/42358/ | Refactor {{SandboxContainerLogger}} |
| https://reviews.apache.org/r/41781/ | Add rotating logger test |
| https://reviews.apache.org/r/41782/ | Makefile and test config changes |
| https://reviews.apache.org/r/41783/ | Implement the rotating logger |

> Add a ContainerLogger module that restrains log sizes
> -
>
> Key: MESOS-4136
> URL: https://issues.apache.org/jira/browse/MESOS-4136
> Project: Mesos
>  Issue Type: Improvement
>  Components: modules
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: logging, mesosphere
>
> One of the major problems this logger module aims to solve is overflowing 
> executor/task log files.  Log files are simply written to disk, and are not 
> managed other than via occasional garbage collection by the agent process 
> (and this only deals with terminated executors).
> We should add a {{ContainerLogger}} module that truncates logs as it reaches 
> a configurable maximum size.  Additionally, we should determine if the web 
> UI's {{pailer}} needs to be changed to deal with logs that are not 
> append-only.
> This will be a non-default module which will also serve as an example for how 
> to implement the module.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4383) Support docker runtime configuration env var from image.

2016-01-14 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-4383:
-
Summary: Support docker runtime configuration env var from image.  (was: 
Support docer runtime configuration env var from image.)

> Support docker runtime configuration env var from image.
> 
>
> Key: MESOS-4383
> URL: https://issues.apache.org/jira/browse/MESOS-4383
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Gilbert Song
>Assignee: Gilbert Song
>  Labels: mesosphere, unified-containerizer-mvp
>
> We need to support env var configuration returned from docker image in mesos 
> containerizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4301) Accepting an inverse offer prints misleading logs

2016-01-14 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15099057#comment-15099057
 ] 

Joseph Wu commented on MESOS-4301:
--

Minimal patch just to silence the log line:
https://reviews.apache.org/r/42324/

> Accepting an inverse offer prints misleading logs
> -
>
> Key: MESOS-4301
> URL: https://issues.apache.org/jira/browse/MESOS-4301
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.25.0
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: log, maintenance, mesosphere
>
> Whenever a scheduler accepts an inverse offer, Mesos will print a line like 
> this in the master logs:
> {code}
> W1125 10:05:53.155109 29362 master.cpp:2897] ACCEPT call used invalid offers 
> '[ 932f7d7b-f2d4-42c7-9391-222c19b9d35b-O2 ]': Offer 
> 932f7d7b-f2d4-42c7-9391-222c19b9d35b-O2 is no longer valid
> {code}
> Inverse offers should not trigger this warning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4385) Offers and InverseOffers cannot be accepted in the same ACCEPT call

2016-01-14 Thread Joseph Wu (JIRA)
Joseph Wu created MESOS-4385:


 Summary: Offers and InverseOffers cannot be accepted in the same 
ACCEPT call
 Key: MESOS-4385
 URL: https://issues.apache.org/jira/browse/MESOS-4385
 Project: Mesos
  Issue Type: Bug
  Components: framework, master
Affects Versions: 0.25.0
Reporter: Joseph Wu
Assignee: Joseph Wu


*Problem*
* In {{Master::accept}}, {{validation::offer::validate}} returns an error when 
an {{InverseOffer}} is included in the list of {{OfferIDs}} in an {{ACCEPT}} 
call.
* If an {{Offer}} is part of the same {{ACCEPT}}, the master sees 
{{error.isSome()}} and returns a {{TASK_LOST}} for normal offers.  
(https://github.com/apache/mesos/blob/fafbdca610d0a150b9fa9cb62d1c63cb7a6fdaf3/src/master/master.cpp#L3117)

Here's a regression test:
https://reviews.apache.org/r/42092/

*Proprosal*
The question is whether we want to allow the mixing of {{Offers}} and 
{{InverseOffers}}.

Arguments for mixing:
* The design/structure of the maintenance originally intended to overload 
{{ACCEPT}} and {{DECLINE}} to take inverse offers.
* Enforcing non-mixing may require breaking changes to {{scheduler.proto}}.

Arguments against mixing:
* Some semantics are difficult to explain.  What does it mean to supply 
{{InverseOffers}} with {{Offer::Operations}}?  What about {{DECLINE}} with 
{{Offers}} and {{InverseOffers}}, including a "reason"?
* What happens if we presumably add a third type of offer?
* Does it make sense to {{TASK_LOST}} valid normal offers if {{InverseOffers}} 
are invalid?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-4301) Accepting an inverse offer prints misleading logs

2016-01-14 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15090173#comment-15090173
 ] 

Joseph Wu edited comment on MESOS-4301 at 1/14/16 9:36 PM:
---

Review to:
* Fix the logging.
* Fix the bug found above.
* Refactor {{Master::accept}} to read more sequentially.

https://reviews.apache.org/r/42086/ (discarded)


was (Author: kaysoky):
Review to:
* Fix the logging.
* Fix the bug found above.
* Refactor {{Master::accept}} to read more sequentially.

https://reviews.apache.org/r/42086/

> Accepting an inverse offer prints misleading logs
> -
>
> Key: MESOS-4301
> URL: https://issues.apache.org/jira/browse/MESOS-4301
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.25.0
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: log, maintenance, mesosphere
>
> Whenever a scheduler accepts an inverse offer, Mesos will print a line like 
> this in the master logs:
> {code}
> W1125 10:05:53.155109 29362 master.cpp:2897] ACCEPT call used invalid offers 
> '[ 932f7d7b-f2d4-42c7-9391-222c19b9d35b-O2 ]': Offer 
> 932f7d7b-f2d4-42c7-9391-222c19b9d35b-O2 is no longer valid
> {code}
> Inverse offers should not trigger this warning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-4301) Accepting an inverse offer prints misleading logs

2016-01-14 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15090140#comment-15090140
 ] 

Joseph Wu edited comment on MESOS-4301 at 1/14/16 9:36 PM:
---

While fixing this log line, found another bug.

Essentially:
# {{validation::offer::validate}} returns an error when an {{InverseOffer}} is 
accepted.
# If an {{Offer}} is part of the same {{Call::ACCEPT}}, the master sees 
{{error.isSome()}} and returns a {{TASK_LOST}} for normal offers.  
(https://github.com/apache/mesos/blob/fafbdca610d0a150b9fa9cb62d1c63cb7a6fdaf3/src/master/master.cpp#L3117)

Regression test:
https://reviews.apache.org/r/42092/ (discarded)


was (Author: kaysoky):
While fixing this log line, found another bug.

Essentially:
# {{validation::offer::validate}} returns an error when an {{InverseOffer}} is 
accepted.
# If an {{Offer}} is part of the same {{Call::ACCEPT}}, the master sees 
{{error.isSome()}} and returns a {{TASK_LOST}} for normal offers.  
(https://github.com/apache/mesos/blob/fafbdca610d0a150b9fa9cb62d1c63cb7a6fdaf3/src/master/master.cpp#L3117)

Regression test:
https://reviews.apache.org/r/42092/

> Accepting an inverse offer prints misleading logs
> -
>
> Key: MESOS-4301
> URL: https://issues.apache.org/jira/browse/MESOS-4301
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.25.0
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: log, maintenance, mesosphere
>
> Whenever a scheduler accepts an inverse offer, Mesos will print a line like 
> this in the master logs:
> {code}
> W1125 10:05:53.155109 29362 master.cpp:2897] ACCEPT call used invalid offers 
> '[ 932f7d7b-f2d4-42c7-9391-222c19b9d35b-O2 ]': Offer 
> 932f7d7b-f2d4-42c7-9391-222c19b9d35b-O2 is no longer valid
> {code}
> Inverse offers should not trigger this warning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-4136) Add a ContainerLogger module that restrains log sizes

2016-01-13 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15074359#comment-15074359
 ] 

Joseph Wu edited comment on MESOS-4136 at 1/13/16 8:19 PM:
---

|| Review || Summary ||
| https://reviews.apache.org/r/42052/
https://reviews.apache.org/r/42059/ | Type-ification of {{Subprocess::IO}} |
| https://reviews.apache.org/r/41779/
https://reviews.apache.org/r/41780/ | Add non-dup option to 
{{Subprocess::IO::FD}} |
| https://reviews.apache.org/r/41781/ | Add rotating logger test |
| https://reviews.apache.org/r/41782/ | Makefile and test config changes |
| https://reviews.apache.org/r/41783/ | Implement the rotating logger |


was (Author: kaysoky):
|| Review || Summary ||
| https://reviews.apache.org/r/41779/
https://reviews.apache.org/r/41780/ | Change {{Subprocess::IO::FD}} |
| https://reviews.apache.org/r/41781/ | Add rotating logger test | 
| https://reviews.apache.org/r/41782/ | Makefile and test config changes |
| https://reviews.apache.org/r/41783/ | Implement the rotating logger |

> Add a ContainerLogger module that restrains log sizes
> -
>
> Key: MESOS-4136
> URL: https://issues.apache.org/jira/browse/MESOS-4136
> Project: Mesos
>  Issue Type: Improvement
>  Components: modules
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: logging, mesosphere
>
> One of the major problems this logger module aims to solve is overflowing 
> executor/task log files.  Log files are simply written to disk, and are not 
> managed other than via occasional garbage collection by the agent process 
> (and this only deals with terminated executors).
> We should add a {{ContainerLogger}} module that truncates logs as it reaches 
> a configurable maximum size.  Additionally, we should determine if the web 
> UI's {{pailer}} needs to be changed to deal with logs that are not 
> append-only.
> This will be a non-default module which will also serve as an example for how 
> to implement the module.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4301) Accepting an inverse offer prints misleading logs

2016-01-12 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-4301:
-
Target Version/s: 0.27.0

> Accepting an inverse offer prints misleading logs
> -
>
> Key: MESOS-4301
> URL: https://issues.apache.org/jira/browse/MESOS-4301
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.25.0
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: log, maintenance, mesosphere
>
> Whenever a scheduler accepts an inverse offer, Mesos will print a line like 
> this in the master logs:
> {code}
> W1125 10:05:53.155109 29362 master.cpp:2897] ACCEPT call used invalid offers 
> '[ 932f7d7b-f2d4-42c7-9391-222c19b9d35b-O2 ]': Offer 
> 932f7d7b-f2d4-42c7-9391-222c19b9d35b-O2 is no longer valid
> {code}
> Inverse offers should not trigger this warning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3820) Test-only libprocess reinitialization

2016-01-12 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-3820:
-
Epic Name: libprocess-finalize

> Test-only libprocess reinitialization
> -
>
> Key: MESOS-3820
> URL: https://issues.apache.org/jira/browse/MESOS-3820
> Project: Mesos
>  Issue Type: Epic
>  Components: libprocess, test
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: mesosphere
>
> *Background*
> Libprocess initialization includes the spawning of a variety of global 
> processes and the creation of the server socket which listens for incoming 
> requests.  Some properties of the server socket are configured via 
> environment variables, such as the IP and port or the SSL configuration.
> In the case of tests, libprocess is initialized once per test binary.  This 
> means that testing different configurations (SSL in particular) is cumbersome 
> as a separate process would be needed for every test case.
> *Proposal*
> # Add some optional code between some tests like:
> {code}
> // Cleanup all of libprocess's state, as if we're starting anew.
> process::finalize(); 
> // For tests that need to test SSL connections with the Master:
> openssl::reinitialize();
> process::initialize();
> {code}
> See [MESOS-3863] for more on {{process::finalize}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3820) Test-only libprocess reinitialization

2016-01-12 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-3820:
-
Description: 
*Background*
Libprocess initialization includes the spawning of a variety of global 
processes and the creation of the server socket which listens for incoming 
requests.  Some properties of the server socket are configured via environment 
variables, such as the IP and port or the SSL configuration.

In the case of tests, libprocess is initialized once per test binary.  This 
means that testing different configurations (SSL in particular) is cumbersome 
as a separate process would be needed for every test case.

*Proposal*
# Add some optional code between some tests like:
{code}
// Cleanup all of libprocess's state, as if we're starting anew.
process::finalize(); 

// For tests that need to test SSL connections with the Master:
openssl::reinitialize();

process::initialize();
{code}
See [MESOS-3863] for more on {{process::finalize}}.

  was:
*Background*
Libprocess initialization includes the spawning of a variety of global 
processes and the creation of the server socket which listens for incoming 
requests.  Some properties of the server socket are configured via environment 
variables, such as the IP and port or the SSL configuration.

In the case of tests, libprocess is initialized once per test binary.  This 
means that testing different configurations (SSL in particular) is cumbersome 
as a separate process would be needed for every test case.

*Proposal* (Still under investigation)
# Investigate using {{process::finalize}} to completely clean up libprocess.  
See [MESOS-3863].
# Add a test-only {{process::reinitialize}} function, which should be roughly 
equivalent to a first-time run of {{process::initialize}}.

-*Proposal to swap out server socket*- (Does not work)
# Follow the [example of the SSL 
library|https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/openssl.cpp#L280]
 and allow tests to declare an internal function for re-initializing a portion 
of libprocess.
# Move the [existing creation of the server 
socket|https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/process.cpp#L852-L856]
 into a {{reinitialize_server_socket}} function.
# Add any necessary cleanup for swapping server sockets.
# Consider whether any additional locking is required in the 
{{reinitialize_server_socket}} function.

 Issue Type: Epic  (was: Story)
Summary: Test-only libprocess reinitialization  (was: Refactor 
libprocess initialization to allow for test-only reinitialization of the server 
socket)

> Test-only libprocess reinitialization
> -
>
> Key: MESOS-3820
> URL: https://issues.apache.org/jira/browse/MESOS-3820
> Project: Mesos
>  Issue Type: Epic
>  Components: libprocess, test
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: mesosphere
>
> *Background*
> Libprocess initialization includes the spawning of a variety of global 
> processes and the creation of the server socket which listens for incoming 
> requests.  Some properties of the server socket are configured via 
> environment variables, such as the IP and port or the SSL configuration.
> In the case of tests, libprocess is initialized once per test binary.  This 
> means that testing different configurations (SSL in particular) is cumbersome 
> as a separate process would be needed for every test case.
> *Proposal*
> # Add some optional code between some tests like:
> {code}
> // Cleanup all of libprocess's state, as if we're starting anew.
> process::finalize(); 
> // For tests that need to test SSL connections with the Master:
> openssl::reinitialize();
> process::initialize();
> {code}
> See [MESOS-3863] for more on {{process::finalize}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4136) Add a ContainerLogger module that restrains log sizes

2016-01-12 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095438#comment-15095438
 ] 

Joseph Wu commented on MESOS-4136:
--

This should be able to make it before the end of the week.

> Add a ContainerLogger module that restrains log sizes
> -
>
> Key: MESOS-4136
> URL: https://issues.apache.org/jira/browse/MESOS-4136
> Project: Mesos
>  Issue Type: Improvement
>  Components: modules
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: logging, mesosphere
>
> One of the major problems this logger module aims to solve is overflowing 
> executor/task log files.  Log files are simply written to disk, and are not 
> managed other than via occasional garbage collection by the agent process 
> (and this only deals with terminated executors).
> We should add a {{ContainerLogger}} module that truncates logs as it reaches 
> a configurable maximum size.  Additionally, we should determine if the web 
> UI's {{pailer}} needs to be changed to deal with logs that are not 
> append-only.
> This will be a non-default module which will also serve as an example for how 
> to implement the module.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4306) AGENT_DEAD Message

2016-01-11 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15092373#comment-15092373
 ] 

Joseph Wu commented on MESOS-4306:
--

In case of random failure, even the master does not know if the machine is gone 
temporarily (i.e. flaky network) or permanently (i.e. machine exploded).  

> AGENT_DEAD Message
> --
>
> Key: MESOS-4306
> URL: https://issues.apache.org/jira/browse/MESOS-4306
> Project: Mesos
>  Issue Type: Task
>Reporter: Gabriel Hartmann
>
> Frameworks currently receive SLAVE_LOST messages when an Agent fails or is 
> behind a network partition for some period of time.  However frameworks and 
> indeed Mesos cannot differentiate between an Agent being temporarily or 
> permanently lost.
> It would be good to have a message indicating that an Agent is lost and won't 
> be returning.  This would require human intervention so an endpoint should be 
> exposed to induce the sending of this message.
> This is particularly helpful for frameworks which are waiting for the return 
> of persistent volumes.  In the case where an Agent hosting significant data 
> (multi terabyte) the framework may be willing to wait a significant amount of 
> time before repairing its replication factor (for example).  Explicit human 
> provided information about the permanent state of Agents and therefore their 
> resources would allow these kinds of frameworks to accelerate their recovery 
> timelines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4301) Accepting an inverse offer prints misleading logs

2016-01-08 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15090140#comment-15090140
 ] 

Joseph Wu commented on MESOS-4301:
--

While fixing this log line, found another bug.

Essentially:
# {{validation::offer::validate}} returns an error when an {{InverseOffer}} is 
accepted.
# If an {{Offer}} is part of the same {{Call::ACCEPT}}, the master sees 
{{error.isSome()}} and returns a {{TASK_LOST}} for normal offers.  
(https://github.com/apache/mesos/blob/fafbdca610d0a150b9fa9cb62d1c63cb7a6fdaf3/src/master/master.cpp#L3117)

Regression test:
https://reviews.apache.org/r/42092/

> Accepting an inverse offer prints misleading logs
> -
>
> Key: MESOS-4301
> URL: https://issues.apache.org/jira/browse/MESOS-4301
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.25.0
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: log, maintenance, mesosphere
>
> Whenever a scheduler accepts an inverse offer, Mesos will print a line like 
> this in the master logs:
> {code}
> W1125 10:05:53.155109 29362 master.cpp:2897] ACCEPT call used invalid offers 
> '[ 932f7d7b-f2d4-42c7-9391-222c19b9d35b-O2 ]': Offer 
> 932f7d7b-f2d4-42c7-9391-222c19b9d35b-O2 is no longer valid
> {code}
> Inverse offers should not trigger this warning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4301) Accepting an inverse offer prints misleading logs

2016-01-08 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15090173#comment-15090173
 ] 

Joseph Wu commented on MESOS-4301:
--

Review to fix the logging and the regression test above:
https://reviews.apache.org/r/42086/

> Accepting an inverse offer prints misleading logs
> -
>
> Key: MESOS-4301
> URL: https://issues.apache.org/jira/browse/MESOS-4301
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.25.0
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: log, maintenance, mesosphere
>
> Whenever a scheduler accepts an inverse offer, Mesos will print a line like 
> this in the master logs:
> {code}
> W1125 10:05:53.155109 29362 master.cpp:2897] ACCEPT call used invalid offers 
> '[ 932f7d7b-f2d4-42c7-9391-222c19b9d35b-O2 ]': Offer 
> 932f7d7b-f2d4-42c7-9391-222c19b9d35b-O2 is no longer valid
> {code}
> Inverse offers should not trigger this warning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-4301) Accepting an inverse offer prints misleading logs

2016-01-08 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15090173#comment-15090173
 ] 

Joseph Wu edited comment on MESOS-4301 at 1/9/16 1:57 AM:
--

Review to:
* Fix the logging.
* Fix the bug found above.
* Refactor {{Master::accept}} to read more sequentially.

https://reviews.apache.org/r/42086/


was (Author: kaysoky):
Review to fix the logging and the regression test above:
https://reviews.apache.org/r/42086/

> Accepting an inverse offer prints misleading logs
> -
>
> Key: MESOS-4301
> URL: https://issues.apache.org/jira/browse/MESOS-4301
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.25.0
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: log, maintenance, mesosphere
>
> Whenever a scheduler accepts an inverse offer, Mesos will print a line like 
> this in the master logs:
> {code}
> W1125 10:05:53.155109 29362 master.cpp:2897] ACCEPT call used invalid offers 
> '[ 932f7d7b-f2d4-42c7-9391-222c19b9d35b-O2 ]': Offer 
> 932f7d7b-f2d4-42c7-9391-222c19b9d35b-O2 is no longer valid
> {code}
> Inverse offers should not trigger this warning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4306) AGENT_DEAD Message

2016-01-07 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15088385#comment-15088385
 ] 

Joseph Wu commented on MESOS-4306:
--

The {{/maintenance/status}} only returns the machine (i.e. "the machine is 
DOWN".) .  But you can {{GET /maintenance/schedule}} to check if the duration 
is infinite :)

> AGENT_DEAD Message
> --
>
> Key: MESOS-4306
> URL: https://issues.apache.org/jira/browse/MESOS-4306
> Project: Mesos
>  Issue Type: Task
>Reporter: Gabriel Hartmann
>
> Frameworks currently receive SLAVE_LOST messages when an Agent fails or is 
> behind a network partition for some period of time.  However frameworks and 
> indeed Mesos cannot differentiate between an Agent being temporarily or 
> permanently lost.
> It would be good to have a message indicating that an Agent is lost and won't 
> be returning.  This would require human intervention so an endpoint should be 
> exposed to induce the sending of this message.
> This is particularly helpful for frameworks which are waiting for the return 
> of persistent volumes.  In the case where an Agent hosting significant data 
> (multi terabyte) the framework may be willing to wait a significant amount of 
> time before repairing its replication factor (for example).  Explicit human 
> provided information about the permanent state of Agents and therefore their 
> resources would allow these kinds of frameworks to accelerate their recovery 
> timelines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4306) AGENT_DEAD Message

2016-01-07 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15088571#comment-15088571
 ] 

Joseph Wu commented on MESOS-4306:
--

For random outages, the {{/maintenance/status}} won't change, since only the 
operator can trigger these changes.  

When the framework goes to check the machine's status, the machine will either:
# Not show up, if it hasn't been scheduled for maintenance
# Show up as {{DRAINING}}, if it has been scheduled for maintenance, but not 
taken down by the operator yet.

> AGENT_DEAD Message
> --
>
> Key: MESOS-4306
> URL: https://issues.apache.org/jira/browse/MESOS-4306
> Project: Mesos
>  Issue Type: Task
>Reporter: Gabriel Hartmann
>
> Frameworks currently receive SLAVE_LOST messages when an Agent fails or is 
> behind a network partition for some period of time.  However frameworks and 
> indeed Mesos cannot differentiate between an Agent being temporarily or 
> permanently lost.
> It would be good to have a message indicating that an Agent is lost and won't 
> be returning.  This would require human intervention so an endpoint should be 
> exposed to induce the sending of this message.
> This is particularly helpful for frameworks which are waiting for the return 
> of persistent volumes.  In the case where an Agent hosting significant data 
> (multi terabyte) the framework may be willing to wait a significant amount of 
> time before repairing its replication factor (for example).  Explicit human 
> provided information about the permanent state of Agents and therefore their 
> resources would allow these kinds of frameworks to accelerate their recovery 
> timelines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4306) AGENT_DEAD Message

2016-01-07 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15088014#comment-15088014
 ] 

Joseph Wu commented on MESOS-4306:
--

I don't think you need another message in this case.

With maintenance (in 0.25), an operator can set a unavailability period of 
infinity to denote the same semantics as {{AGENT_DEAD}} (or rather, 
{{AGENT_TO_BE_KILLED}}?).  The framework would be notified of this in advance 
via inverse offers.

When the agent actually gets terminated (by the operator), the framework will 
see a {{SLAVE_LOST}} (in HTTP API-land, {{Event::FAILURE}}).

Would it help to add maintenance info to {{Event::FAILURE}} too?  i.e. In case 
a machine is taken down before any inverse offers get sent.

> AGENT_DEAD Message
> --
>
> Key: MESOS-4306
> URL: https://issues.apache.org/jira/browse/MESOS-4306
> Project: Mesos
>  Issue Type: Task
>Reporter: Gabriel Hartmann
>
> Frameworks currently receive SLAVE_LOST messages when an Agent fails or is 
> behind a network partition for some period of time.  However frameworks and 
> indeed Mesos cannot differentiate between an Agent being temporarily or 
> permanently lost.
> It would be good to have a message indicating that an Agent is lost and won't 
> be returning.  This would require human intervention so an endpoint should be 
> exposed to induce the sending of this message.
> This is particularly helpful for frameworks which are waiting for the return 
> of persistent volumes.  In the case where an Agent hosting significant data 
> (multi terabyte) the framework may be willing to wait a significant amount of 
> time before repairing its replication factor (for example).  Explicit human 
> provided information about the permanent state of Agents and therefore their 
> resources would allow these kinds of frameworks to accelerate their recovery 
> timelines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4302) Offer filter timeouts are ignored if the allocator is slow or backlogged.

2016-01-07 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-4302:
-
Assignee: Alexander Rukletsov  (was: Guangya Liu)

> Offer filter timeouts are ignored if the allocator is slow or backlogged.
> -
>
> Key: MESOS-4302
> URL: https://issues.apache.org/jira/browse/MESOS-4302
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Reporter: Benjamin Mahler
>Assignee: Alexander Rukletsov
>Priority: Critical
>  Labels: mesosphere
>
> Currently, when the allocator recovers resources from an offer, it creates a 
> filter timeout based on time at which the call is processed.
> This means that if it takes longer than the filter duration for the allocator 
> to perform an allocation for the relevant agent, then the filter is never 
> applied.
> This leads to pathological behavior: if the framework sets a filter duration 
> that is smaller than the wall clock time it takes for us to perform the next 
> allocation, then the filters will have no effect. This can mean that low 
> share frameworks may continue receiving offers that they have no intent to 
> use, without other frameworks ever receiving these offers.
> The workaround for this is for frameworks to set high filter durations, and 
> possibly reviving offers when they need more resources, however, we should 
> fix this issue in the allocator. (i.e. derive the timeout deadlines and 
> expiry based on allocation times).
> This seems to warrant cherry-picking into bug fix releases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4306) AGENT_DEAD Message

2016-01-07 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15088268#comment-15088268
 ] 

Joseph Wu commented on MESOS-4306:
--

Yes, this is possible.  All agents which are taken down for maintenance are 
effectively blacklisted.  If they attempt to register, they will be told to 
shut down. 

As long as the framework has access to the maintenance endpoints, it can call
{code}
GET /master/maintenance/status
{code}

This will contain a list of machines that are {{DOWN}} (temporarily or 
permanently).

> AGENT_DEAD Message
> --
>
> Key: MESOS-4306
> URL: https://issues.apache.org/jira/browse/MESOS-4306
> Project: Mesos
>  Issue Type: Task
>Reporter: Gabriel Hartmann
>
> Frameworks currently receive SLAVE_LOST messages when an Agent fails or is 
> behind a network partition for some period of time.  However frameworks and 
> indeed Mesos cannot differentiate between an Agent being temporarily or 
> permanently lost.
> It would be good to have a message indicating that an Agent is lost and won't 
> be returning.  This would require human intervention so an endpoint should be 
> exposed to induce the sending of this message.
> This is particularly helpful for frameworks which are waiting for the return 
> of persistent volumes.  In the case where an Agent hosting significant data 
> (multi terabyte) the framework may be willing to wait a significant amount of 
> time before repairing its replication factor (for example).  Explicit human 
> provided information about the permanent state of Agents and therefore their 
> resources would allow these kinds of frameworks to accelerate their recovery 
> timelines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4300) Add AuthN and AuthZ to maintenance endpoints.

2016-01-06 Thread Joseph Wu (JIRA)
Joseph Wu created MESOS-4300:


 Summary: Add AuthN and AuthZ to maintenance endpoints.
 Key: MESOS-4300
 URL: https://issues.apache.org/jira/browse/MESOS-4300
 Project: Mesos
  Issue Type: Task
  Components: master, security
Affects Versions: 0.25.0
Reporter: Joseph Wu


Maintenance endpoints are currently only restricted by firewall settings.  They 
should also support authentication/authorization like other HTTP endpoints.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4300) Add AuthN and AuthZ to maintenance endpoints.

2016-01-06 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-4300:
-
Labels: authentication authorization maintenance mesosphere  (was: )

> Add AuthN and AuthZ to maintenance endpoints.
> -
>
> Key: MESOS-4300
> URL: https://issues.apache.org/jira/browse/MESOS-4300
> Project: Mesos
>  Issue Type: Task
>  Components: master, security
>Affects Versions: 0.25.0
>Reporter: Joseph Wu
>  Labels: authentication, authorization, maintenance, mesosphere
>
> Maintenance endpoints are currently only restricted by firewall settings.  
> They should also support authentication/authorization like other HTTP 
> endpoints.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4301) Accepting an inverse offer prints misleading logs

2016-01-06 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-4301:
-
Description: 
Whenever a scheduler accepts an inverse offer, Mesos will print a line like 
this in the master logs:
{code}
W1125 10:05:53.155109 29362 master.cpp:2897] ACCEPT call used invalid offers '[ 
932f7d7b-f2d4-42c7-9391-222c19b9d35b-O2 ]': Offer 
932f7d7b-f2d4-42c7-9391-222c19b9d35b-O2 is no longer valid
{code}

Inverse offers should not trigger this warning.

  was:
Whenever a scheduler accepts an inverse offer, we will a line like this in the 
master logs:
{code}
W1125 10:05:53.155109 29362 master.cpp:2897] ACCEPT call used invalid offers '[ 
932f7d7b-f2d4-42c7-9391-222c19b9d35b-O2 ]': Offer 
932f7d7b-f2d4-42c7-9391-222c19b9d35b-O2 is no longer valid
{code}

Inverse offers should not trigger this warning.


> Accepting an inverse offer prints misleading logs
> -
>
> Key: MESOS-4301
> URL: https://issues.apache.org/jira/browse/MESOS-4301
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.25.0
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: log, maintenance, mesosphere
>
> Whenever a scheduler accepts an inverse offer, Mesos will print a line like 
> this in the master logs:
> {code}
> W1125 10:05:53.155109 29362 master.cpp:2897] ACCEPT call used invalid offers 
> '[ 932f7d7b-f2d4-42c7-9391-222c19b9d35b-O2 ]': Offer 
> 932f7d7b-f2d4-42c7-9391-222c19b9d35b-O2 is no longer valid
> {code}
> Inverse offers should not trigger this warning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4301) Accepting an inverse offer prints misleading logs

2016-01-06 Thread Joseph Wu (JIRA)
Joseph Wu created MESOS-4301:


 Summary: Accepting an inverse offer prints misleading logs
 Key: MESOS-4301
 URL: https://issues.apache.org/jira/browse/MESOS-4301
 Project: Mesos
  Issue Type: Bug
  Components: master
Affects Versions: 0.25.0
Reporter: Joseph Wu
Assignee: Joseph Wu


Whenever a scheduler accepts an inverse offer, we will a line like this in the 
master logs:
{code}
W1125 10:05:53.155109 29362 master.cpp:2897] ACCEPT call used invalid offers '[ 
932f7d7b-f2d4-42c7-9391-222c19b9d35b-O2 ]': Offer 
932f7d7b-f2d4-42c7-9391-222c19b9d35b-O2 is no longer valid
{code}

Inverse offers should not trigger this warning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-4059) Investigate remaining flakiness in MasterMaintenanceTest.InverseOffersFilters

2016-01-05 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu reassigned MESOS-4059:


Assignee: Joseph Wu

> Investigate remaining flakiness in MasterMaintenanceTest.InverseOffersFilters
> -
>
> Key: MESOS-4059
> URL: https://issues.apache.org/jira/browse/MESOS-4059
> Project: Mesos
>  Issue Type: Bug
>Reporter: Neil Conway
>Assignee: Joseph Wu
>Priority: Minor
>  Labels: flaky-test, mesosphere
>
> Per comments in MESOS-3916, the fix for that issue decreased the degree of 
> flakiness, but it seems that some intermittent test failures do occur -- 
> should be investigated.
> *Flakiness in task acknowledgment*
> {code}
> I1203 18:25:04.609817 28732 status_update_manager.cpp:392] Received status 
> update acknowledgement (UUID: 6afd012e-8e88-41b2-8239-a9b852d07ca1) for task 
> 26305fdd-edb0-4764-8b8a-2558f2b2d81b of framework 
> c7900911-cc7a-4dde-92e7-48fe82cddd9e-
> W1203 18:25:04.610076 28732 status_update_manager.cpp:762] Unexpected status 
> update acknowledgement (received 6afd012e-8e88-41b2-8239-a9b852d07ca1, 
> expecting 82fc7a7b-e64a-4f4d-ab74-76abac42b4e6) for update TASK_RUNNING 
> (UUID: 82fc7a7b-e64a-4f4d-ab74-76abac42b4e6) for task 
> 26305fdd-edb0-4764-8b8a-2558f2b2d81b of framework 
> c7900911-cc7a-4dde-92e7-48fe82cddd9e-
> E1203 18:25:04.610339 28736 slave.cpp:2339] Failed to handle status update 
> acknowledgement (UUID: 6afd012e-8e88-41b2-8239-a9b852d07ca1) for task 
> 26305fdd-edb0-4764-8b8a-2558f2b2d81b of framework 
> c7900911-cc7a-4dde-92e7-48fe82cddd9e-: Duplicate acknowledgemen
> {code}
> This is a race between [launching and acknowledging two 
> tasks|https://github.com/apache/mesos/blob/75cb89fa961b249c9ab7fa0f45dfa9d415a5/src/tests/master_maintenance_tests.cpp#L1486-L1517].
>   The status updates for each task are not necessarily received in the same 
> order as launching the tasks.
> *Flakiness in first inverse offer filter*
> See [this comment in 
> MESOS-3916|https://issues.apache.org/jira/browse/MESOS-3916?focusedCommentId=15027478=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15027478]
>  for the explanation.  The related logs are above the comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4059) Investigate remaining flakiness in MasterMaintenanceTest.InverseOffersFilters

2016-01-05 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-4059:
-
  Sprint: Mesosphere Sprint 26
Story Points: 1

> Investigate remaining flakiness in MasterMaintenanceTest.InverseOffersFilters
> -
>
> Key: MESOS-4059
> URL: https://issues.apache.org/jira/browse/MESOS-4059
> Project: Mesos
>  Issue Type: Bug
>Reporter: Neil Conway
>Assignee: Joseph Wu
>Priority: Minor
>  Labels: flaky-test, mesosphere
>
> Per comments in MESOS-3916, the fix for that issue decreased the degree of 
> flakiness, but it seems that some intermittent test failures do occur -- 
> should be investigated.
> *Flakiness in task acknowledgment*
> {code}
> I1203 18:25:04.609817 28732 status_update_manager.cpp:392] Received status 
> update acknowledgement (UUID: 6afd012e-8e88-41b2-8239-a9b852d07ca1) for task 
> 26305fdd-edb0-4764-8b8a-2558f2b2d81b of framework 
> c7900911-cc7a-4dde-92e7-48fe82cddd9e-
> W1203 18:25:04.610076 28732 status_update_manager.cpp:762] Unexpected status 
> update acknowledgement (received 6afd012e-8e88-41b2-8239-a9b852d07ca1, 
> expecting 82fc7a7b-e64a-4f4d-ab74-76abac42b4e6) for update TASK_RUNNING 
> (UUID: 82fc7a7b-e64a-4f4d-ab74-76abac42b4e6) for task 
> 26305fdd-edb0-4764-8b8a-2558f2b2d81b of framework 
> c7900911-cc7a-4dde-92e7-48fe82cddd9e-
> E1203 18:25:04.610339 28736 slave.cpp:2339] Failed to handle status update 
> acknowledgement (UUID: 6afd012e-8e88-41b2-8239-a9b852d07ca1) for task 
> 26305fdd-edb0-4764-8b8a-2558f2b2d81b of framework 
> c7900911-cc7a-4dde-92e7-48fe82cddd9e-: Duplicate acknowledgemen
> {code}
> This is a race between [launching and acknowledging two 
> tasks|https://github.com/apache/mesos/blob/75cb89fa961b249c9ab7fa0f45dfa9d415a5/src/tests/master_maintenance_tests.cpp#L1486-L1517].
>   The status updates for each task are not necessarily received in the same 
> order as launching the tasks.
> *Flakiness in first inverse offer filter*
> See [this comment in 
> MESOS-3916|https://issues.apache.org/jira/browse/MESOS-3916?focusedCommentId=15027478=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15027478]
>  for the explanation.  The related logs are above the comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4150) Implement container logger module metadata recovery

2016-01-04 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-4150:
-
Sprint: Mesosphere Sprint 26

> Implement container logger module metadata recovery
> ---
>
> Key: MESOS-4150
> URL: https://issues.apache.org/jira/browse/MESOS-4150
> Project: Mesos
>  Issue Type: Task
>  Components: modules
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: logging, mesosphere
>
> The {{ContainerLoggers}} are intended to be isolated from agent failover, in 
> the same way that executors do not crash when the agent process crashes.
> For default {{ContainerLogger}} s, like the {{SandboxContainerLogger}} and 
> the (tentatively named) {{TruncatingSandboxContainerLogger}}, the log files 
> are exposed during agent recovery regardless.
> For non-default {{ContainerLogger}} s, the recovery of executor metadata may 
> be necessary to rebuild endpoints that expose the logs.  This can be 
> implemented as part of {{Containerizer::recover}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4206) Write new log-related documentation

2016-01-04 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-4206:
-
Sprint: Mesosphere Sprint 26

> Write new log-related documentation
> ---
>
> Key: MESOS-4206
> URL: https://issues.apache.org/jira/browse/MESOS-4206
> Project: Mesos
>  Issue Type: Documentation
>  Components: documentation
>Reporter: Neil Conway
>Assignee: Joseph Wu
>  Labels: documentation, logging, mesosphere
>
> This should include:
> * Default logging behavior for master, agent, framework, executor, task.
> * Master/agent:
> ** A summary of log-related flags.
> ** {{glog}} specific options.
> * Separation of master/agent logs from container logs.
> * The {{ContainerLogger}} module.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4287) Extract stout

2016-01-04 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15081851#comment-15081851
 ] 

Joseph Wu commented on MESOS-4287:
--

Perhaps you're looking for this:
https://github.com/3rdparty/stout

> Extract stout
> -
>
> Key: MESOS-4287
> URL: https://issues.apache.org/jira/browse/MESOS-4287
> Project: Mesos
>  Issue Type: Improvement
>  Components: stout
>Reporter: Axel Etcheverry
>Priority: Minor
>
> Is it possible to extract the stout library?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-4088) Modularize existing plain-file logging for executor/task logs launched with the Mesos Containerizer

2015-12-21 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15049749#comment-15049749
 ] 

Joseph Wu edited comment on MESOS-4088 at 12/21/15 7:38 PM:


|| Reviews || Summary ||
| https://reviews.apache.org/r/41166/ | Disallow {{ContainerLogger}} + 
{{ExternalContainerizer}} |
| https://reviews.apache.org/r/41167/ | Initialize and call the 
{{ContainerLogger}} in {{MesosContainerizer}} |
| https://reviews.apache.org/r/41169/ | Update {{MesosContainerizer}} tests |


was (Author: kaysoky):
|| Reviews || Summary ||
| https://reviews.apache.org/r/41166/ | Disallow {{ContainerLogger}} + 
{{ExternalContainerizer}} |
| https://reviews.apache.org/r/41167/ | Initialize and call the 
{{ContainerLogger}} in {{MesosContainerizer}} |
| https://reviews.apache.org/r/41168/ | Update {{MesosTest}} |
| https://reviews.apache.org/r/41169/ | Update {{MesosContainerizer}} tests |

> Modularize existing plain-file logging for executor/task logs launched with 
> the Mesos Containerizer
> ---
>
> Key: MESOS-4088
> URL: https://issues.apache.org/jira/browse/MESOS-4088
> Project: Mesos
>  Issue Type: Task
>  Components: modules
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: logging, mesosphere
>
> Once a module for executor/task output logging has been introduced, the 
> default module will mirror the existing behavior.  Executor/task 
> stdout/stderr is piped into files within the executor's sandbox directory.
> The files are exposed in the web UI, via the {{/files}} endpoint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4234) Add tests for running Docker in Docker

2015-12-21 Thread Joseph Wu (JIRA)
Joseph Wu created MESOS-4234:


 Summary: Add tests for running Docker in Docker
 Key: MESOS-4234
 URL: https://issues.apache.org/jira/browse/MESOS-4234
 Project: Mesos
  Issue Type: Improvement
  Components: containerization, docker, test
Reporter: Joseph Wu


When the Mesos agent is itself running in a Docker container, the Docker 
containerizer will spawn executors as additional containers (rather than as a 
subprocess).  This prevents executors from dying if the agent dies.

There are currently no automated tests for this code path, particularly because 
the test setup is particularly expensive.  A test would need to:
* Create a Docker image.
* Compile Mesos inside the Docker image.
* Spin up agents inside Docker containers.
* Clean up properly.

These tests are currently done manually 
([example|https://reviews.apache.org/r/41560/]).  Similar tests should be 
codified and added as regression tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4137) Modularize plain-file logging for executor/task logs launched with the Docker Containerizer

2015-12-21 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067252#comment-15067252
 ] 

Joseph Wu commented on MESOS-4137:
--

Change how we pipe output from Docker, which also removes the necessity of 
loading a {{ContainerLogger}} into the {{mesos-docker-executor}}:
https://reviews.apache.org/r/41560/

> Modularize plain-file logging for executor/task logs launched with the Docker 
> Containerizer
> ---
>
> Key: MESOS-4137
> URL: https://issues.apache.org/jira/browse/MESOS-4137
> Project: Mesos
>  Issue Type: Task
>  Components: docker, modules
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: logging, mesosphere
>
> Adding a hook inside the Docker containerizer is slightly more involved than 
> the Mesos containerizer.
> Docker executors/tasks perform plain-file logging in different places 
> depending on whether the agent is in a Docker container itself
> || Agent || Code ||
> | Not in container | {{DockerContainerizerProcess::launchExecutorProcess}} |
> | In container | {{Docker::run}} in a {{mesos-docker-executor}} process |
> This means a {{ContainerLogger}} will need to be loaded or hooked into the 
> {{mesos-docker-executor}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-4137) Modularize plain-file logging for executor/task logs launched with the Docker Containerizer

2015-12-21 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15057100#comment-15057100
 ] 

Joseph Wu edited comment on MESOS-4137 at 12/22/15 1:36 AM:


|| Reviews || Summary ||
| https://reviews.apache.org/r/41560/ | Change piping inside 
{{mesos-docker-executor}} |
| https://reviews.apache.org/r/41294/ | Add {{ContainerLogger}} to 
{{DockerContainerizer}} |
| https://reviews.apache.org/r/41370/ | Update {{MesosTest}} |
| https://reviews.apache.org/r/41378/ | Update {{DockerContainerizer}} tests |


was (Author: kaysoky):
|| Reviews || Summary ||
| https://reviews.apache.org/r/41294/ | Add {{ContainerLogger}} to 
{{DockerContainerizer}} |
| https://reviews.apache.org/r/41369/ | Add {{ContainerLogger}} to 
{{mesos-docker-executor}} |
| https://reviews.apache.org/r/41370/ | Update {{MesosTest}} |
| https://reviews.apache.org/r/41378/ | Update {{DockerContainerizer}} tests |

> Modularize plain-file logging for executor/task logs launched with the Docker 
> Containerizer
> ---
>
> Key: MESOS-4137
> URL: https://issues.apache.org/jira/browse/MESOS-4137
> Project: Mesos
>  Issue Type: Task
>  Components: docker, modules
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: logging, mesosphere
>
> Adding a hook inside the Docker containerizer is slightly more involved than 
> the Mesos containerizer.
> Docker executors/tasks perform plain-file logging in different places 
> depending on whether the agent is in a Docker container itself
> || Agent || Code ||
> | Not in container | {{DockerContainerizerProcess::launchExecutorProcess}} |
> | In container | {{Docker::run}} in a {{mesos-docker-executor}} process |
> This means a {{ContainerLogger}} will need to be loaded or hooked into the 
> {{mesos-docker-executor}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4137) Modularize plain-file logging for executor/task logs launched with the Docker Containerizer

2015-12-21 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-4137:
-
Description: 
Adding a hook inside the Docker containerizer is slightly more involved than 
the Mesos containerizer.

Docker executors/tasks perform plain-file logging in different places depending 
on whether the agent is in a Docker container itself
|| Agent || Code ||
| Not in container | {{DockerContainerizerProcess::launchExecutorProcess}} |
| In container | {{Docker::run}} in a {{mesos-docker-executor}} process |

This means a {{ContainerLogger}} will need to be loaded or hooked into the 
{{mesos-docker-executor}}.  Or we will need to change how piping in done in 
{{mesos-docker-executor}}.

  was:
Adding a hook inside the Docker containerizer is slightly more involved than 
the Mesos containerizer.

Docker executors/tasks perform plain-file logging in different places depending 
on whether the agent is in a Docker container itself
|| Agent || Code ||
| Not in container | {{DockerContainerizerProcess::launchExecutorProcess}} |
| In container | {{Docker::run}} in a {{mesos-docker-executor}} process |

This means a {{ContainerLogger}} will need to be loaded or hooked into the 
{{mesos-docker-executor}}.


> Modularize plain-file logging for executor/task logs launched with the Docker 
> Containerizer
> ---
>
> Key: MESOS-4137
> URL: https://issues.apache.org/jira/browse/MESOS-4137
> Project: Mesos
>  Issue Type: Task
>  Components: docker, modules
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: logging, mesosphere
>
> Adding a hook inside the Docker containerizer is slightly more involved than 
> the Mesos containerizer.
> Docker executors/tasks perform plain-file logging in different places 
> depending on whether the agent is in a Docker container itself
> || Agent || Code ||
> | Not in container | {{DockerContainerizerProcess::launchExecutorProcess}} |
> | In container | {{Docker::run}} in a {{mesos-docker-executor}} process |
> This means a {{ContainerLogger}} will need to be loaded or hooked into the 
> {{mesos-docker-executor}}.  Or we will need to change how piping in done in 
> {{mesos-docker-executor}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (MESOS-4137) Modularize plain-file logging for executor/task logs launched with the Docker Containerizer

2015-12-21 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-4137:
-
Comment: was deleted

(was: Change how we pipe output from Docker, which also removes the necessity 
of loading a {{ContainerLogger}} into the {{mesos-docker-executor}}:
https://reviews.apache.org/r/41560/)

> Modularize plain-file logging for executor/task logs launched with the Docker 
> Containerizer
> ---
>
> Key: MESOS-4137
> URL: https://issues.apache.org/jira/browse/MESOS-4137
> Project: Mesos
>  Issue Type: Task
>  Components: docker, modules
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: logging, mesosphere
>
> Adding a hook inside the Docker containerizer is slightly more involved than 
> the Mesos containerizer.
> Docker executors/tasks perform plain-file logging in different places 
> depending on whether the agent is in a Docker container itself
> || Agent || Code ||
> | Not in container | {{DockerContainerizerProcess::launchExecutorProcess}} |
> | In container | {{Docker::run}} in a {{mesos-docker-executor}} process |
> This means a {{ContainerLogger}} will need to be loaded or hooked into the 
> {{mesos-docker-executor}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4216) post-reviews.py should support multiple git worktrees

2015-12-21 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15066729#comment-15066729
 ] 

Joseph Wu commented on MESOS-4216:
--

[~klueska], would these changes ([MESOS-4125]) fix this?  

> post-reviews.py should support multiple git worktrees
> -
>
> Key: MESOS-4216
> URL: https://issues.apache.org/jira/browse/MESOS-4216
> Project: Mesos
>  Issue Type: Improvement
>  Components: build
>Reporter: Shuai Lin
>Priority: Trivial
>
> Git 2.5 add a new feature "multiple worktrees", e.g. one can checkout 
> multiple worktrees of the same local git repo, which share the same .git 
> directory.
> For example, the following command:
> {code} 
> git worktree add -b new-branch ../mesos-new-branch master 
> {code}
> would create a new folder {{mesos-new-branch}} in the parent folder of mesos 
> source tree.
> See [this github 
> blog|https://github.com/blog/2042-git-2-5-including-multiple-worktrees-and-triangular-workflows]
>  for details.
> This feature is quite handy when developing mesos: you can avoid re-compiling 
> mesos (really costs a lot of time) when you need to temporary switch to 
> another branch and switch back soon after: you just create a worktree, and do 
> stuff there.
> Currently the {{post-reviews.py}} script doesn't work well if not running in 
> the default worktree, becase it would look for the {{.git}} folder. In 
> non-default worktree, `.git` is a file whose content points to the real 
> `.git` dir.
> {code} 
> $ cd ~/dev/mesos-new-branch 
> $ cat .git 
> gitdir: ~/dev/mesos/.git/worktrees/mesos-new-branch 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4150) Implement container logger module metadata recovery

2015-12-20 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-4150:
-
Sprint:   (was: Mesosphere Sprint 24)

> Implement container logger module metadata recovery
> ---
>
> Key: MESOS-4150
> URL: https://issues.apache.org/jira/browse/MESOS-4150
> Project: Mesos
>  Issue Type: Task
>  Components: modules
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: logging, mesosphere
>
> The {{ContainerLoggers}} are intended to be isolated from agent failover, in 
> the same way that executors do not crash when the agent process crashes.
> For default {{ContainerLogger}} s, like the {{SandboxContainerLogger}} and 
> the (tentatively named) {{TruncatingSandboxContainerLogger}}, the log files 
> are exposed during agent recovery regardless.
> For non-default {{ContainerLogger}} s, the recovery of executor metadata may 
> be necessary to rebuild endpoints that expose the logs.  This can be 
> implemented as part of {{Containerizer::recover}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4206) Write new log-related documentation

2015-12-18 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-4206:
-
Shepherd: Benjamin Hindman
Story Points: 3
 Description: 
This should include:
* Default logging behavior for master, agent, framework, executor, task.
* Master/agent:
** A summary of log-related flags.
** {{glog}} specific options.
* Separation of master/agent logs from container logs.
* The {{ContainerLogger}} module.

> Write new log-related documentation
> ---
>
> Key: MESOS-4206
> URL: https://issues.apache.org/jira/browse/MESOS-4206
> Project: Mesos
>  Issue Type: Documentation
>  Components: documentation
>Reporter: Neil Conway
>Assignee: Joseph Wu
>  Labels: documentation, logging, mesosphere
>
> This should include:
> * Default logging behavior for master, agent, framework, executor, task.
> * Master/agent:
> ** A summary of log-related flags.
> ** {{glog}} specific options.
> * Separation of master/agent logs from container logs.
> * The {{ContainerLogger}} module.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4203) Document that disk resource limits are not enforced by default

2015-12-18 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15064986#comment-15064986
 ] 

Joseph Wu commented on MESOS-4203:
--

Note, this default is also mentioned here:
https://github.com/apache/mesos/blob/master/docs/sandbox.md#sandbox-size

> Document that disk resource limits are not enforced by default
> --
>
> Key: MESOS-4203
> URL: https://issues.apache.org/jira/browse/MESOS-4203
> Project: Mesos
>  Issue Type: Documentation
>  Components: documentation, isolation
>Reporter: Neil Conway
>Assignee: Anand Mazumdar
>  Labels: isolation, mesosphere, persistent-volumes
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4109) HTTPConnectionTest.ClosingResponse is flaky

2015-12-17 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-4109:
-
Shepherd: Benjamin Mahler

> HTTPConnectionTest.ClosingResponse is flaky
> ---
>
> Key: MESOS-4109
> URL: https://issues.apache.org/jira/browse/MESOS-4109
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess, test
>Affects Versions: 0.26.0
> Environment: ASF Ubuntu 14 
> {{--enable-ssl --enable-libevent}}
>Reporter: Joseph Wu
>Assignee: Benjamin Mahler
>Priority: Minor
>  Labels: flaky, flaky-test, mesosphere, newbie, test
> Fix For: 0.27.0
>
>
> Output of the test:
> {code}
> [ RUN  ] HTTPConnectionTest.ClosingResponse
> I1210 01:20:27.048532 26671 process.cpp:3077] Handling HTTP event for process 
> '(22)' with path: '/(22)/get'
> ../../../3rdparty/libprocess/src/tests/http_tests.cpp:919: Failure
> Actual function call count doesn't match EXPECT_CALL(*http.process, get(_))...
>  Expected: to be called twice
>Actual: called once - unsatisfied and active
> [  FAILED  ] HTTPConnectionTest.ClosingResponse (43 ms)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4194) MesosContainerizer* tests leak FDs (pipes)

2015-12-17 Thread Joseph Wu (JIRA)
Joseph Wu created MESOS-4194:


 Summary: MesosContainerizer* tests leak FDs (pipes)
 Key: MESOS-4194
 URL: https://issues.apache.org/jira/browse/MESOS-4194
 Project: Mesos
  Issue Type: Bug
  Components: test
Affects Versions: 0.27.0
 Environment: OSX + clang
Reporter: Joseph Wu
Assignee: Jojy Varghese


If you run:
{{bin/mesos-tests.sh --gtest_filter="*MesosContainerizer*" --gtest_repeat=-1 
--gtest_break_on_failure}}

And then check:
{{lsof | grep mesos}}

The number of open pipes will grow linearly with the number of test repetitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3153) Add tests for HTTPS SSL socket communication

2015-12-16 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15060215#comment-15060215
 ] 

Joseph Wu commented on MESOS-3153:
--

We probably want some of these tests (in libprocess) at some point.  But this 
may be blocked by [MESOS-3820] (unless you want to re-write the client/server 
logic in the tests).

> Add tests for HTTPS SSL socket communication
> 
>
> Key: MESOS-3153
> URL: https://issues.apache.org/jira/browse/MESOS-3153
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jojy Varghese
>Assignee: Jojy Varghese
>Priority: Minor
>  Labels: mesosphere
>
> Unit tests are lacking for the following cases:
> 1. HTTPS Post with "None" payload. 
> 2. Verification of HTTPS payload on the SSL socket(maybe decode to a Request 
> object)
> 3. http -> ssl socket
> 4. https -> raw socket.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-4137) Modularize plain-file logging for executor/task logs launched with the Docker Containerizer

2015-12-15 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15057100#comment-15057100
 ] 

Joseph Wu edited comment on MESOS-4137 at 12/15/15 5:22 PM:


|| Reviews || Summary ||
| https://reviews.apache.org/r/41294/ | Add {{ContainerLogger}} to 
{{DockerContainerizer}} |
| https://reviews.apache.org/r/41369/ | Add {{ContainerLogger}} to 
{{mesos-docker-executor}} |
| https://reviews.apache.org/r/41370/ | Update {{MesosTest}} |
| https://reviews.apache.org/r/41378/ | Update {{DockerContainerizer}} tests |


was (Author: kaysoky):
|| Reviews || Summary ||
| https://reviews.apache.org/r/41294/ | Add {{ContainerLogger}} to 
{{DockerContainerizer}} |
| https://reviews.apache.org/r/41369/ | Add {{ContainerLogger}} to 
{{mesos-docker-executor}} |
| https://reviews.apache.org/r/41370/ | Update {{MesosTest}} |
| https://reviews.apache.org/r/41378/ | Update {{DockerContainerizer}} tests |
| https://reviews.apache.org/r/41386/ | Add regression test |

> Modularize plain-file logging for executor/task logs launched with the Docker 
> Containerizer
> ---
>
> Key: MESOS-4137
> URL: https://issues.apache.org/jira/browse/MESOS-4137
> Project: Mesos
>  Issue Type: Task
>  Components: docker, modules
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: logging, mesosphere
>
> Adding a hook inside the Docker containerizer is slightly more involved than 
> the Mesos containerizer.
> Docker executors/tasks perform plain-file logging in different places 
> depending on whether the agent is in a Docker container itself
> || Agent || Code ||
> | Not in container | {{DockerContainerizerProcess::launchExecutorProcess}} |
> | In container | {{Docker::run}} in a {{mesos-docker-executor}} process |
> This means a {{ContainerLogger}} will need to be loaded or hooked into the 
> {{mesos-docker-executor}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-4088) Modularize existing plain-file logging for executor/task logs launched with the Mesos Containerizer

2015-12-14 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15049749#comment-15049749
 ] 

Joseph Wu edited comment on MESOS-4088 at 12/15/15 1:01 AM:


|| Reviews || Summary ||
| https://reviews.apache.org/r/41166/ | Disallow {{ContainerLogger}} + 
{{ExternalContainerizer}} |
| https://reviews.apache.org/r/41167/ | Initialize and call the 
{{ContainerLogger}} in {{MesosContainerizer}} |
| https://reviews.apache.org/r/41168/ | Update {{MesosTest}} |
| https://reviews.apache.org/r/41169/ | Update {{MesosContainerizer}} tests |


was (Author: kaysoky):
|| Reviews || Summary ||
| https://reviews.apache.org/r/41166/ | Add {{ContainerLogger}} to 
{{Containerizer::Create}} |
| https://reviews.apache.org/r/41167/ | Initialize and call the 
{{ContainerLogger}} in {{MesosContainerizer::_launch}} |
| https://reviews.apache.org/r/41168/ | Update {{MesosTest}} |
| https://reviews.apache.org/r/41169/ | Update {{MesosContainerizer}} tests |

> Modularize existing plain-file logging for executor/task logs launched with 
> the Mesos Containerizer
> ---
>
> Key: MESOS-4088
> URL: https://issues.apache.org/jira/browse/MESOS-4088
> Project: Mesos
>  Issue Type: Task
>  Components: modules
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: logging, mesosphere
>
> Once a module for executor/task output logging has been introduced, the 
> default module will mirror the existing behavior.  Executor/task 
> stdout/stderr is piped into files within the executor's sandbox directory.
> The files are exposed in the web UI, via the {{/files}} endpoint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4137) Modularize plain-file logging for executor/task logs launched with the Docker Containerizer

2015-12-14 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15057100#comment-15057100
 ] 

Joseph Wu commented on MESOS-4137:
--

|| Reviews || Summary ||
| https://reviews.apache.org/r/41294/ | Add {{ContainerLogger}} to 
{{DockerContainerizer}} |
| https://reviews.apache.org/r/41369/ | Add {{ContainerLogger}} to 
{{mesos-docker-executor}} |
| https://reviews.apache.org/r/41370/ | Update {{MesosTest}} |
| https://reviews.apache.org/r/41378/ | Update {{DockerContainerizer}} tests |

> Modularize plain-file logging for executor/task logs launched with the Docker 
> Containerizer
> ---
>
> Key: MESOS-4137
> URL: https://issues.apache.org/jira/browse/MESOS-4137
> Project: Mesos
>  Issue Type: Task
>  Components: docker, modules
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: logging, mesosphere
>
> Adding a hook inside the Docker containerizer is slightly more involved than 
> the Mesos containerizer.
> Docker executors/tasks perform plain-file logging in different places 
> depending on whether the agent is in a Docker container itself
> || Agent || Code ||
> | Not in container | {{DockerContainerizerProcess::launchExecutorProcess}} |
> | In container | {{Docker::run}} in a {{mesos-docker-executor}} process |
> This means a {{ContainerLogger}} will need to be loaded or hooked into the 
> {{mesos-docker-executor}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4150) Implement container logger module metadata recovery

2015-12-14 Thread Joseph Wu (JIRA)
Joseph Wu created MESOS-4150:


 Summary: Implement container logger module metadata recovery
 Key: MESOS-4150
 URL: https://issues.apache.org/jira/browse/MESOS-4150
 Project: Mesos
  Issue Type: Task
  Components: modules
Reporter: Joseph Wu
Assignee: Joseph Wu


The {{ContainerLoggers}} are intended to be isolated from agent failover, in 
the same way that executors do not crash when the agent process crashes.

For default {{ContainerLogger}} s, like the {{SandboxContainerLogger}} and the 
(tentatively named) {{TruncatingSandboxContainerLogger}}, the log files are 
exposed during agent recovery regardless.

For non-default {{ContainerLogger}} s, the recovery of executor metadata may be 
necessary to rebuild endpoints that expose the logs.  This can be implemented 
as part of {{Containerizer::recover}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-4088) Modularize existing plain-file logging for executor/task logs launched with the Mesos Containerizer

2015-12-11 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15049749#comment-15049749
 ] 

Joseph Wu edited comment on MESOS-4088 at 12/12/15 12:04 AM:
-

|| Reviews || Summary ||
| https://reviews.apache.org/r/41166/ | Add {{ContainerLogger}} to 
{{Containerizer::Create}} |
| https://reviews.apache.org/r/41167/ | Initialize and call the 
{{ContainerLogger}} in {{MesosContainerizer::_launch}} |
| https://reviews.apache.org/r/41168/ | Update {{MesosTest}} |
| https://reviews.apache.org/r/41169/ | Update {{MesosContainerizer}} tests |


was (Author: kaysoky):
|| Reviews || Summary ||
| https://reviews.apache.org/r/41166/ | Add {{ExecutorLogger}} to 
{{Containerizer::Create}} |
| https://reviews.apache.org/r/41167/ | Initialize and call the 
{{ExecutorLogger}} in {{MesosContainerizer::_launch}} |
| https://reviews.apache.org/r/41168/ | Update {{MesosTest}} |
| https://reviews.apache.org/r/41169/ | Update {{MesosContainerizer}} tests |

> Modularize existing plain-file logging for executor/task logs launched with 
> the Mesos Containerizer
> ---
>
> Key: MESOS-4088
> URL: https://issues.apache.org/jira/browse/MESOS-4088
> Project: Mesos
>  Issue Type: Task
>  Components: modules
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: logging, mesosphere
>
> Once a module for executor/task output logging has been introduced, the 
> default module will mirror the existing behavior.  Executor/task 
> stdout/stderr is piped into files within the executor's sandbox directory.
> The files are exposed in the web UI, via the {{/files}} endpoint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4137) Modularize plain-file logging for executor/task logs launched with the Docker Containerizer

2015-12-11 Thread Joseph Wu (JIRA)
Joseph Wu created MESOS-4137:


 Summary: Modularize plain-file logging for executor/task logs 
launched with the Docker Containerizer
 Key: MESOS-4137
 URL: https://issues.apache.org/jira/browse/MESOS-4137
 Project: Mesos
  Issue Type: Task
  Components: docker, modules
Reporter: Joseph Wu
Assignee: Joseph Wu


Adding a hook inside the Docker containerizer is slightly more involved than 
the Mesos containerizer.

Docker executors/tasks perform plain-file logging in different places depending 
on whether the agent is in a Docker container itself
|| Agent || Code ||
| Not in container | {{DockerContainerizerProcess::launchExecutorProcess}} |
| In container | {{Docker::run}} in a {{mesos-docker-executor}} process |

This means a {{ContainerLogger}} will need to be loaded or hooked into the 
{{mesos-docker-executor}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4026) RegistryClientTest.SimpleRegistryPuller is flaky

2015-12-11 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053137#comment-15053137
 ] 

Joseph Wu commented on MESOS-4026:
--

Partially related, on some systems, the test will fail after 200 or so 
iterations, due to too many open FDs:
https://reviews.apache.org/r/41234/

> RegistryClientTest.SimpleRegistryPuller is flaky
> 
>
> Key: MESOS-4026
> URL: https://issues.apache.org/jira/browse/MESOS-4026
> Project: Mesos
>  Issue Type: Bug
>Reporter: Anand Mazumdar
>Assignee: Jojy Varghese
>  Labels: containerizer, flaky-test, mesosphere
>
> From ASF CI:
> https://builds.apache.org/job/Mesos/1289/COMPILER=gcc,CONFIGURATION=--verbose%20--enable-libevent%20--enable-ssl,OS=centos:7,label_exp=docker%7C%7CHadoop/console
> {code}
> [ RUN  ] RegistryClientTest.SimpleRegistryPuller
> I1127 02:51:40.235900   362 registry_client.cpp:511] Response status for url 
> 'https://localhost:57828/v2/library/busybox/manifests/latest': 401 
> Unauthorized
> I1127 02:51:40.249766   360 registry_client.cpp:511] Response status for url 
> 'https://localhost:57828/v2/library/busybox/manifests/latest': 200 OK
> I1127 02:51:40.251137   361 registry_puller.cpp:195] Downloading layer 
> '1ce2e90b0bc7224de3db1f0d646fe8e2c4dd37f1793928287f6074bc451a57ea' for image 
> 'busybox:latest'
> I1127 02:51:40.258514   354 registry_client.cpp:511] Response status for url 
> 'https://localhost:57828/v2/library/busybox/blobs/sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4':
>  307 Temporary Redirect
> I1127 02:51:40.264171   367 libevent_ssl_socket.cpp:1023] Socket error: 
> Connection reset by peer
> ../../src/tests/containerizer/provisioner_docker_tests.cpp:1210: Failure
> (socket).failure(): Failed accept: connection error: Connection reset by peer
> [  FAILED  ] RegistryClientTest.SimpleRegistryPuller (349 ms)
> {code}
> Logs from a previous run that passed:
> {code}
> [ RUN  ] RegistryClientTest.SimpleRegistryPuller
> I1126 18:49:05.306396   349 registry_client.cpp:511] Response status for url 
> 'https://localhost:53492/v2/library/busybox/manifests/latest': 401 
> Unauthorized
> I1126 18:49:05.321362   347 registry_client.cpp:511] Response status for url 
> 'https://localhost:53492/v2/library/busybox/manifests/latest': 200 OK
> I1126 18:49:05.322720   352 registry_puller.cpp:195] Downloading layer 
> '1ce2e90b0bc7224de3db1f0d646fe8e2c4dd37f1793928287f6074bc451a57ea' for image 
> 'busybox:latest'
> I1126 18:49:05.331317   350 registry_client.cpp:511] Response status for url 
> 'https://localhost:53492/v2/library/busybox/blobs/sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4':
>  307 Temporary Redirect
> I1126 18:49:05.370625   352 registry_client.cpp:511] Response status for url 
> 'https://127.0.0.1:53492/': 200 OK
> I1126 18:49:05.372102   355 registry_puller.cpp:294] Untarring layer 
> '1ce2e90b0bc7224de3db1f0d646fe8e2c4dd37f1793928287f6074bc451a57ea' downloaded 
> from registry to directory 'output_dir'
> [   OK ] RegistryClientTest.SimpleRegistryPuller (353 ms)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3892) Add a helper function to the Agent to retrieve the list of executors that are using optimistically offered, revocable resources.

2015-12-11 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053148#comment-15053148
 ] 

Joseph Wu commented on MESOS-3892:
--

Yes that's reasonable (and what we discussed in the [work group 
meeting|https://docs.google.com/document/d/1CKMelV6xD_HOsqwbqH3PM24P7ypS_G4oz_MDNxE85D8/edit#bookmark=id.xlfbqnql7ngq]).

Can you update the relevant JIRA's accordingly (rename, update descriptions, 
etc)?

> Add a helper function to the Agent to retrieve the list of executors that are 
> using optimistically offered, revocable resources.
> 
>
> Key: MESOS-3892
> URL: https://issues.apache.org/jira/browse/MESOS-3892
> Project: Mesos
>  Issue Type: Bug
>Reporter: Artem Harutyunyan
>Assignee: Klaus Ma
>  Labels: mesosphere
>
> {noformat}
> class Slave {
>   ...
>   // How the master currently keeps track of executors.
>   hashmap> executors;
>   ...
>   // Returns the list of executors that are using optimistically-
>   // offered, revocable resources.
>   list getEvictableExecutors() { ... }
>   ...
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4127) Ensure `Content-Type` field is set for some responses

2015-12-11 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-4127:
-
Summary: Ensure `Content-Type` field is set for some responses  (was: 
Ensure `Conten-Type` field is set for some responses)

> Ensure `Content-Type` field is set for some responses
> -
>
> Key: MESOS-4127
> URL: https://issues.apache.org/jira/browse/MESOS-4127
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Alexander Rukletsov
>  Labels: http, mesosphere, newbie++, tech-debt
>
> As pointed out by [~anandmazumdar] in https://reviews.apache.org/r/40905/, we 
> should make sure we set the {{Content-Type}} files for some responses.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4109) HTTPConnectionTest.ClosingResponse is flaky

2015-12-09 Thread Joseph Wu (JIRA)
Joseph Wu created MESOS-4109:


 Summary: HTTPConnectionTest.ClosingResponse is flaky
 Key: MESOS-4109
 URL: https://issues.apache.org/jira/browse/MESOS-4109
 Project: Mesos
  Issue Type: Bug
  Components: libprocess, test
Affects Versions: 0.26.0
 Environment: ASF Ubuntu 14 
{{--enable-ssl --enable-libevent}}
Reporter: Joseph Wu
Priority: Minor


Output of the test:
{code}
[ RUN  ] HTTPConnectionTest.ClosingResponse
I1210 01:20:27.048532 26671 process.cpp:3077] Handling HTTP event for process 
'(22)' with path: '/(22)/get'
../../../3rdparty/libprocess/src/tests/http_tests.cpp:919: Failure
Actual function call count doesn't match EXPECT_CALL(*http.process, get(_))...
 Expected: to be called twice
   Actual: called once - unsatisfied and active
[  FAILED  ] HTTPConnectionTest.ClosingResponse (43 ms)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4088) Modularize existing plain-file logging for executor/task logs

2015-12-09 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15049749#comment-15049749
 ] 

Joseph Wu commented on MESOS-4088:
--

|| Reviews || Summary ||
| https://reviews.apache.org/r/41166/ | Add {{ExecutorLogger}} to 
{{Containerizer::Create}} |
| https://reviews.apache.org/r/41167/ | Initialize and call the 
{{ExecutorLogger}} in {{MesosContainerizer::_launch}} |
| https://reviews.apache.org/r/41168/ | Update {{MesosTest}} |
| https://reviews.apache.org/r/41169/ | Update {{MesosContainerizer}} tests |

> Modularize existing plain-file logging for executor/task logs
> -
>
> Key: MESOS-4088
> URL: https://issues.apache.org/jira/browse/MESOS-4088
> Project: Mesos
>  Issue Type: Task
>  Components: modules
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: logging, mesosphere
>
> Once a module for executor/task output logging has been introduced, the 
> default module will mirror the existing behavior.  Executor/task 
> stdout/stderr is piped into files within the executor's sandbox directory.
> The files are exposed in the web UI, via the {{/files}} endpoint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-4087) Introduce a module for logging executor/task output

2015-12-08 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15045985#comment-15045985
 ] 

Joseph Wu edited comment on MESOS-4087 at 12/8/15 11:23 PM:


Reviews:
|| Review || Summary||
| https://reviews.apache.org/r/41055/ 
  https://reviews.apache.org/r/41057/ | Refactoring |
| https://reviews.apache.org/r/41002/ | Module interface |
| https://reviews.apache.org/r/41003/ | Default module implementation |
| https://reviews.apache.org/r/41004/ | Modularification |
| https://reviews.apache.org/r/41061/ | New agent flags |
| https://reviews.apache.org/r/4/ | Regression test |


was (Author: kaysoky):
Reviews (WIP):
https://reviews.apache.org/r/41055/
https://reviews.apache.org/r/41057/
https://reviews.apache.org/r/41002/
https://reviews.apache.org/r/41003/
https://reviews.apache.org/r/41004/

> Introduce a module for logging executor/task output
> ---
>
> Key: MESOS-4087
> URL: https://issues.apache.org/jira/browse/MESOS-4087
> Project: Mesos
>  Issue Type: Task
>  Components: containerization, modules
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: logging, mesosphere
>
> Existing executor/task logs are logged to files in their sandbox directory, 
> with some nuances based on which containerizer is used (see background 
> section in linked document).
> A logger for executor/task logs has the following requirements:
> * The logger is given a command to run and must handle the stdout/stderr of 
> the command.
> * The handling of stdout/stderr must be resilient across agent failover.  
> Logging should not stop if the agent fails.
> * Logs should be readable, presumably via the web UI, or via some other 
> module-specific UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2782) Document the sandbox

2015-12-08 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047500#comment-15047500
 ] 

Joseph Wu commented on MESOS-2782:
--

The tests for sandbox expectations already exist:
* {{PathsTest.Executor}}
* {{GarbageCollectorIntegrationTest.ExitedExecutor}}
* {{GarbageCollectorIntegrationTest.DiskUsage}}
* {{SlaveRecoveryTest.GCExecutor}}
* Indirectly tested by {{FilesTest.*}} and {{FetcherTest.*}}

> Document the sandbox
> 
>
> Key: MESOS-2782
> URL: https://issues.apache.org/jira/browse/MESOS-2782
> Project: Mesos
>  Issue Type: Documentation
>  Components: documentation
>Reporter: Aaron Bell
>Assignee: Joseph Wu
>  Labels: documentation, mesosphere
>
> The sandbox is the arena of debugging for most Mesos users. From an 
> application- or framework-developer perspective, they need to know
> - What it is
> - Where it is
> - How to use it, and how NOT to use it
> - What Mesos writes here (fetcher etc.)
> - Storage limits
> - Lifecycle and garbage collection
> This needs to be documented to help users get over the hump of learning to 
> work with Mesos.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-2782) Document the sandbox

2015-12-08 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047500#comment-15047500
 ] 

Joseph Wu edited comment on MESOS-2782 at 12/8/15 9:25 PM:
---

The tests for sandbox expectations already exist:
* {{PathsTest.Executor}}
* {{GarbageCollectorIntegrationTest.ExitedExecutor}}
* {{GarbageCollectorIntegrationTest.DiskUsage}}
* {{SlaveRecoveryTest.GCExecutor}}
* Indirectly tested by {{FilesTest.\*}} and {{FetcherTest.\*}}


was (Author: kaysoky):
The tests for sandbox expectations already exist:
* {{PathsTest.Executor}}
* {{GarbageCollectorIntegrationTest.ExitedExecutor}}
* {{GarbageCollectorIntegrationTest.DiskUsage}}
* {{SlaveRecoveryTest.GCExecutor}}
* Indirectly tested by {{FilesTest.*}} and {{FetcherTest.*}}

> Document the sandbox
> 
>
> Key: MESOS-2782
> URL: https://issues.apache.org/jira/browse/MESOS-2782
> Project: Mesos
>  Issue Type: Documentation
>  Components: documentation
>Reporter: Aaron Bell
>Assignee: Joseph Wu
>  Labels: documentation, mesosphere
>
> The sandbox is the arena of debugging for most Mesos users. From an 
> application- or framework-developer perspective, they need to know
> - What it is
> - Where it is
> - How to use it, and how NOT to use it
> - What Mesos writes here (fetcher etc.)
> - Storage limits
> - Lifecycle and garbage collection
> This needs to be documented to help users get over the hump of learning to 
> work with Mesos.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-2782) document the sandbox

2015-12-07 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu reassigned MESOS-2782:


Assignee: Joseph Wu

> document the sandbox
> 
>
> Key: MESOS-2782
> URL: https://issues.apache.org/jira/browse/MESOS-2782
> Project: Mesos
>  Issue Type: Documentation
>  Components: documentation
>Reporter: Aaron Bell
>Assignee: Joseph Wu
>  Labels: documentation, mesosphere
>
> The sandbox is the arena of debugging for most Mesos users. From an 
> application- or framework-developer perspective, they need to know
> - What it is
> - Where it is
> - How to use it, and how NOT to use it
> - What Mesos writes here (fetcher etc.)
> - Storage limits
> - Lifecycle and garbage collection
> This needs to be documented to help users get over the hump of learning to 
> work with Mesos.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2782) document the sandbox

2015-12-07 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-2782:
-
Sprint: Mesosphere Sprint 24

> document the sandbox
> 
>
> Key: MESOS-2782
> URL: https://issues.apache.org/jira/browse/MESOS-2782
> Project: Mesos
>  Issue Type: Documentation
>  Components: documentation
>Reporter: Aaron Bell
>Assignee: Joseph Wu
>  Labels: documentation, mesosphere
>
> The sandbox is the arena of debugging for most Mesos users. From an 
> application- or framework-developer perspective, they need to know
> - What it is
> - Where it is
> - How to use it, and how NOT to use it
> - What Mesos writes here (fetcher etc.)
> - Storage limits
> - Lifecycle and garbage collection
> This needs to be documented to help users get over the hump of learning to 
> work with Mesos.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4086) Containerizer logging modularization

2015-12-07 Thread Joseph Wu (JIRA)
Joseph Wu created MESOS-4086:


 Summary: Containerizer logging modularization
 Key: MESOS-4086
 URL: https://issues.apache.org/jira/browse/MESOS-4086
 Project: Mesos
  Issue Type: Epic
  Components: containerization, modules
Reporter: Joseph Wu
Assignee: Joseph Wu


Executors and tasks are configured (via the various containerizers) to write 
their output (stdout/stderr) to files ("stdout" and "stderr") on an agent's 
disk.

Unlike Master/Agent logs, executor/task logs are not attached to any formal 
logging system, like {{glog}}.  As such, there is significant scope for 
improvement.

By introducing a module for logging, we can provide a common/programmatic way 
to access and manage executor/task logs.  Modules could implement additional 
sinks for logs, such as:
* to the sandbox (the status quo),
* to syslog,
* to journald

This would also provide the hooks to deal with logging related problems, such 
as:
* the (current) lack of log rotation,
* searching through executor/task logs (i.e. via aggregation)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-2153) Add support for systemd journal for logging

2015-12-07 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu reassigned MESOS-2153:


Assignee: Joseph Wu

> Add support for systemd journal for logging
> ---
>
> Key: MESOS-2153
> URL: https://issues.apache.org/jira/browse/MESOS-2153
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, slave
>Reporter: Alexander Rukletsov
>Assignee: Joseph Wu
>Priority: Minor
>  Labels: mesosphere
>
> We should be able to redirect master and slave logs to systemd journal on the 
> systems where it's available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4087) Introduce a module for logging executor/task output

2015-12-07 Thread Joseph Wu (JIRA)
Joseph Wu created MESOS-4087:


 Summary: Introduce a module for logging executor/task output
 Key: MESOS-4087
 URL: https://issues.apache.org/jira/browse/MESOS-4087
 Project: Mesos
  Issue Type: Task
  Components: containerization, modules
Reporter: Joseph Wu
Assignee: Joseph Wu


Existing executor/task logs are logged to files in their sandbox directory, 
with some nuances based on which containerizer is used (see background section 
in linked document).

A logger for executor/task logs has the following requirements:
* The logger is given a command to run and must handle the stdout/stderr of the 
command.
* The handling of stdout/stderr must be resilient across agent failover.  
Logging should not stop if the agent fails.
* Logs should be readable, presumably via the web UI, or via some other 
module-specific UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4088) Modularize existing plain-file logging for executor/task logs

2015-12-07 Thread Joseph Wu (JIRA)
Joseph Wu created MESOS-4088:


 Summary: Modularize existing plain-file logging for executor/task 
logs
 Key: MESOS-4088
 URL: https://issues.apache.org/jira/browse/MESOS-4088
 Project: Mesos
  Issue Type: Task
  Components: modules
Reporter: Joseph Wu
Assignee: Joseph Wu


Once a module for executor/task output logging has been introduced, the default 
module will mirror the existing behavior.  Executor/task stdout/stderr is piped 
into files within the executor's sandbox directory.

The files are exposed in the web UI, via the {{/files}} endpoint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4087) Introduce a module for logging executor/task output

2015-12-07 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15045985#comment-15045985
 ] 

Joseph Wu commented on MESOS-4087:
--

Reviews (WIP):
https://reviews.apache.org/r/41055/
https://reviews.apache.org/r/41057/
https://reviews.apache.org/r/41002/
https://reviews.apache.org/r/41003/
https://reviews.apache.org/r/41004/

> Introduce a module for logging executor/task output
> ---
>
> Key: MESOS-4087
> URL: https://issues.apache.org/jira/browse/MESOS-4087
> Project: Mesos
>  Issue Type: Task
>  Components: containerization, modules
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: logging, mesosphere
>
> Existing executor/task logs are logged to files in their sandbox directory, 
> with some nuances based on which containerizer is used (see background 
> section in linked document).
> A logger for executor/task logs has the following requirements:
> * The logger is given a command to run and must handle the stdout/stderr of 
> the command.
> * The handling of stdout/stderr must be resilient across agent failover.  
> Logging should not stop if the agent fails.
> * Logs should be readable, presumably via the web UI, or via some other 
> module-specific UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2782) Document the sandbox

2015-12-07 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-2782:
-
Summary: Document the sandbox  (was: document the sandbox)

I'm going to fold this into the work for the [Logger module|MESOS-4086].

We're going to want:
* A markdown doc.
* Some tests that exercise the default logging behavior (for the sandbox).

> Document the sandbox
> 
>
> Key: MESOS-2782
> URL: https://issues.apache.org/jira/browse/MESOS-2782
> Project: Mesos
>  Issue Type: Documentation
>  Components: documentation
>Reporter: Aaron Bell
>Assignee: Joseph Wu
>  Labels: documentation, mesosphere
>
> The sandbox is the arena of debugging for most Mesos users. From an 
> application- or framework-developer perspective, they need to know
> - What it is
> - Where it is
> - How to use it, and how NOT to use it
> - What Mesos writes here (fetcher etc.)
> - Storage limits
> - Lifecycle and garbage collection
> This needs to be documented to help users get over the hump of learning to 
> work with Mesos.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4067) ReservationTest.ACLMultipleOperations is flaky

2015-12-04 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041794#comment-15041794
 ] 

Joseph Wu commented on MESOS-4067:
--

What about pausing the clock and manually controlling when the 
{{allocation_interval}} passes?

> ReservationTest.ACLMultipleOperations is flaky
> --
>
> Key: MESOS-4067
> URL: https://issues.apache.org/jira/browse/MESOS-4067
> Project: Mesos
>  Issue Type: Bug
>Reporter: Michael Park
>  Labels: flaky, mesosphere
>
> Observed from the CI: 
> https://builds.apache.org/job/Mesos/COMPILER=gcc,CONFIGURATION=--verbose%20--enable-libevent%20--enable-ssl,OS=ubuntu%3A14.04,label_exp=docker%7C%7CHadoop/1319/changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3586) MemoryPressureMesosTest.CGROUPS_ROOT_Statistics and CGROUPS_ROOT_SlaveRecovery are flaky

2015-12-04 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-3586:
-
Shepherd: Bernd Mathiske

> MemoryPressureMesosTest.CGROUPS_ROOT_Statistics and 
> CGROUPS_ROOT_SlaveRecovery are flaky
> 
>
> Key: MESOS-3586
> URL: https://issues.apache.org/jira/browse/MESOS-3586
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.24.0, 0.26.0
> Environment: Ubuntu 14.04, 3.13.0-32 generic
> Debian 8, gcc 4.9.2
>Reporter: Miguel Bernadin
>Assignee: Joseph Wu
>  Labels: flaky, flaky-test
>
> I am install Mesos 0.24.0 on 4 servers which have very similar hardware and 
> software configurations. 
> After performing {{../configure}}, {{make}}, and {{make check}} some servers 
> have completed successfully and other failed on test {{[ RUN  ] 
> MemoryPressureMesosTest.CGROUPS_ROOT_Statistics}}.
> Is there something I should check in this test? 
> {code}
> PERFORMED MAKE CHECK NODE-001
> [ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics
> I1005 14:37:35.585067 38479 exec.cpp:133] Version: 0.24.0
> I1005 14:37:35.593789 38497 exec.cpp:207] Executor registered on slave 
> 20151005-143735-2393768202-35106-27900-S0
> Registered executor on svdidac038.techlabs.accenture.com
> Starting task 010b2fe9-4eac-4136-8a8a-6ce7665488b0
> Forked command at 38510
> sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done'
> PERFORMED MAKE CHECK NODE-002
> [ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics
> I1005 14:38:58.794112 36997 exec.cpp:133] Version: 0.24.0
> I1005 14:38:58.802851 37022 exec.cpp:207] Executor registered on slave 
> 20151005-143857-2360213770-50427-26325-S0
> Registered executor on svdidac039.techlabs.accenture.com
> Starting task 9bb317ba-41cb-44a4-b507-d1c85ceabc28
> sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done'
> Forked command at 37028
> ../../src/tests/containerizer/memory_pressure_tests.cpp:145: Failure
> Expected: (usage.get().mem_medium_pressure_counter()) >= 
> (usage.get().mem_critical_pressure_counter()), actual: 5 vs 6
> 2015-10-05 
> 14:39:00,130:26325(0x2af08cc78700):ZOO_ERROR@handle_socket_error_msg@1697: 
> Socket [127.0.0.1:37198] zk retcode=-4, errno=111(Connection refused): server 
> refused to accept the client
> [  FAILED  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics (4303 ms)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4047) MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery is flaky

2015-12-04 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-4047:
-
Shepherd: Bernd Mathiske

> MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery is flaky
> ---
>
> Key: MESOS-4047
> URL: https://issues.apache.org/jira/browse/MESOS-4047
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.26.0
> Environment: Ubuntu 14, gcc 4.8.4
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: flaky, flaky-test
>
> {code:title=Output from passed test}
> [--] 1 test from MemoryPressureMesosTest
> 1+0 records in
> 1+0 records out
> 1048576 bytes (1.0 MB) copied, 0.000430889 s, 2.4 GB/s
> [ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery
> I1202 11:09:14.319327  5062 exec.cpp:134] Version: 0.27.0
> I1202 11:09:14.17  5079 exec.cpp:208] Executor registered on slave 
> bea15b35-9aa1-4b57-96fb-29b5f70638ac-S0
> Registered executor on ubuntu
> Starting task 4e62294c-cfcf-4a13-b699-c6a4b7ac5162
> sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done'
> Forked command at 5085
> I1202 11:09:14.391739  5077 exec.cpp:254] Received reconnect request from 
> slave bea15b35-9aa1-4b57-96fb-29b5f70638ac-S0
> I1202 11:09:14.398598  5082 exec.cpp:231] Executor re-registered on slave 
> bea15b35-9aa1-4b57-96fb-29b5f70638ac-S0
> Re-registered executor on ubuntu
> Shutting down
> Sending SIGTERM to process tree at pid 5085
> Killing the following process trees:
> [ 
> -+- 5085 sh -c while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done 
>  \--- 5086 dd count=512 bs=1M if=/dev/zero of=./temp 
> ]
> [   OK ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery (1096 ms)
> {code}
> {code:title=Output from failed test}
> [--] 1 test from MemoryPressureMesosTest
> 1+0 records in
> 1+0 records out
> 1048576 bytes (1.0 MB) copied, 0.000404489 s, 2.6 GB/s
> [ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery
> I1202 11:09:15.509950  5109 exec.cpp:134] Version: 0.27.0
> I1202 11:09:15.568183  5123 exec.cpp:208] Executor registered on slave 
> 88734acc-718e-45b0-95b9-d8f07cea8a9e-S0
> Registered executor on ubuntu
> Starting task 14b6bab9-9f60-4130-bdc4-44efba262bc6
> Forked command at 5132
> sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done'
> I1202 11:09:15.665498  5129 exec.cpp:254] Received reconnect request from 
> slave 88734acc-718e-45b0-95b9-d8f07cea8a9e-S0
> I1202 11:09:15.670995  5123 exec.cpp:381] Executor asked to shutdown
> Shutting down
> Sending SIGTERM to process tree at pid 5132
> ../../src/tests/containerizer/memory_pressure_tests.cpp:283: Failure
> (usage).failure(): Unknown container: ebe90e15-72fa-4519-837b-62f43052c913
> *** Aborted at 1449083355 (unix time) try "date -d @1449083355" if you are 
> using GNU date ***
> {code}
> Notice that in the failed test, the executor is asked to shutdown when it 
> tries to reconnect to the agent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4059) Investigate remaining flakiness in MasterMaintenanceTest.InverseOffersFilters

2015-12-03 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-4059:
-
Assignee: Neil Conway

> Investigate remaining flakiness in MasterMaintenanceTest.InverseOffersFilters
> -
>
> Key: MESOS-4059
> URL: https://issues.apache.org/jira/browse/MESOS-4059
> Project: Mesos
>  Issue Type: Bug
>Reporter: Neil Conway
>Assignee: Neil Conway
>Priority: Minor
>  Labels: flaky-test, mesosphere
>
> Per comments in MESOS-3916, the fix for that issue decreased the degree of 
> flakiness, but it seems that some intermittent test failures do occur -- 
> should be investigated.
> *Flakiness in task acknowledgment*
> {code}
> I1203 18:25:04.609817 28732 status_update_manager.cpp:392] Received status 
> update acknowledgement (UUID: 6afd012e-8e88-41b2-8239-a9b852d07ca1) for task 
> 26305fdd-edb0-4764-8b8a-2558f2b2d81b of framework 
> c7900911-cc7a-4dde-92e7-48fe82cddd9e-
> W1203 18:25:04.610076 28732 status_update_manager.cpp:762] Unexpected status 
> update acknowledgement (received 6afd012e-8e88-41b2-8239-a9b852d07ca1, 
> expecting 82fc7a7b-e64a-4f4d-ab74-76abac42b4e6) for update TASK_RUNNING 
> (UUID: 82fc7a7b-e64a-4f4d-ab74-76abac42b4e6) for task 
> 26305fdd-edb0-4764-8b8a-2558f2b2d81b of framework 
> c7900911-cc7a-4dde-92e7-48fe82cddd9e-
> E1203 18:25:04.610339 28736 slave.cpp:2339] Failed to handle status update 
> acknowledgement (UUID: 6afd012e-8e88-41b2-8239-a9b852d07ca1) for task 
> 26305fdd-edb0-4764-8b8a-2558f2b2d81b of framework 
> c7900911-cc7a-4dde-92e7-48fe82cddd9e-: Duplicate acknowledgemen
> {code}
> This is a race between [launching and acknowledging two 
> tasks|https://github.com/apache/mesos/blob/75cb89fa961b249c9ab7fa0f45dfa9d415a5/src/tests/master_maintenance_tests.cpp#L1486-L1517].
>   The status updates for each task are not necessarily received in the same 
> order as launching the tasks.
> *Flakiness in first inverse offer filter*
> See [this comment in 
> MESOS-3916|https://issues.apache.org/jira/browse/MESOS-3916?focusedCommentId=15027478=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15027478]
>  for the explanation.  The related logs are above the comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-4055) SSL-related test fail reliably in optimized build

2015-12-03 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu reassigned MESOS-4055:


Assignee: Joseph Wu

> SSL-related test fail reliably in optimized build
> -
>
> Key: MESOS-4055
> URL: https://issues.apache.org/jira/browse/MESOS-4055
> Project: Mesos
>  Issue Type: Task
>  Components: libprocess, test
>Affects Versions: 0.26.0
>Reporter: Benjamin Bannier
>Assignee: Joseph Wu
>
> Under ubuntu14.04 building {{5c0e4dc}} using {{gcc-4.8.4-2ubuntu1~14.04}} with
> {code}
> % ../configure --enable-ssl --enable-libevent --enable-optimize
> {code}
> most SSL-related tests fail reliably with SIGSEV. The full list of failing 
> tests is
> {code}
> SSL.Disabled
> SSLTest.BasicSameProcess
> SSLTest.SSLSocket
> SSLTest.NonSSLSocket
> SSLTest.NoVerifyBadCA
> SSLTest.RequireBadCA
> SSLTest.VerifyBadCA
> SSLTest.VerifyCertificate
> SSLTest.RequireCertificate
> SSLTest.ProtocolMismatch
> SSLTest.ValidDowngrade
> SSLtest.NoValidDowngrade
> SSLTest.NoValidDowngrade
> SSLTest.ValidDowngradeEachProtocol
> SSLTest.NoValidDowngradeEachProtocol
> SSLTest.PeerAddress
> SSLTest.HTTPSGet
> SSLTest.HTTPSPost
> {code}
> The test fail with {{SIGSEV}} or similarly worrisome reasons, e.g.,
> {code}
> [ RUN  ] SSLTest.SSLSocket
> *** Aborted at 1449135851 (unix time) try "date -d @1449135851" if you are 
> using GNU date ***
> PC: @   0x4418f4 Try<>::~Try()
> *** SIGSEGV (@0x5acce6) received by PID 29976 (TID 0x7fe601eb5780) from PID 
> 5950694; stack trace: ***
> @ 0x7fe601a9a340 (unknown)
> @   0x4418f4 Try<>::~Try()
> @   0x5a843c SSLTest::setup_server()
> @   0x595162 SSLTest_SSLSocket_Test::TestBody()
> @   0x5f2428 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @   0x5ec880 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @   0x5cd0ff testing::Test::Run()
> @   0x5cd882 testing::TestInfo::Run()
> @   0x5cdec8 testing::TestCase::Run()
> @   0x5d4610 testing::internal::UnitTestImpl::RunAllTests()
> @   0x5f3203 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @   0x5ed5f4 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @   0x5d33ac testing::UnitTest::Run()
> @   0x40fd70 main
> @ 0x7fe600024ec5 (unknown)
> @   0x413eb1 (unknown)
> Segmentation fault
> {code}
> Even though we do not typically release optimized builds we should still look 
> into these as optimizations tend to expose fragile constructs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4059) Investigate remaining flakiness in MasterMaintenanceTest.InverseOffersFilters

2015-12-03 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-4059:
-
Description: 
Per comments in MESOS-3916, the fix for that issue decreased the degree of 
flakiness, but it seems that some intermittent test failures do occur -- should 
be investigated.

*Flakiness in task acknowledgment*
{code}
I1203 18:25:04.609817 28732 status_update_manager.cpp:392] Received status 
update acknowledgement (UUID: 6afd012e-8e88-41b2-8239-a9b852d07ca1) for task 
26305fdd-edb0-4764-8b8a-2558f2b2d81b of framework 
c7900911-cc7a-4dde-92e7-48fe82cddd9e-
W1203 18:25:04.610076 28732 status_update_manager.cpp:762] Unexpected status 
update acknowledgement (received 6afd012e-8e88-41b2-8239-a9b852d07ca1, 
expecting 82fc7a7b-e64a-4f4d-ab74-76abac42b4e6) for update TASK_RUNNING (UUID: 
82fc7a7b-e64a-4f4d-ab74-76abac42b4e6) for task 
26305fdd-edb0-4764-8b8a-2558f2b2d81b of framework 
c7900911-cc7a-4dde-92e7-48fe82cddd9e-
E1203 18:25:04.610339 28736 slave.cpp:2339] Failed to handle status update 
acknowledgement (UUID: 6afd012e-8e88-41b2-8239-a9b852d07ca1) for task 
26305fdd-edb0-4764-8b8a-2558f2b2d81b of framework 
c7900911-cc7a-4dde-92e7-48fe82cddd9e-: Duplicate acknowledgemen
{code}

*Flakiness in first inverse offer filter*
See [this comment in 
MESOS-3916|https://issues.apache.org/jira/browse/MESOS-3916?focusedCommentId=15027478=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15027478]
 for the explanation.  The related logs are above the comment.

  was:Per comments in MESOS-3916, the fix for that issue decreased the degree 
of flakiness, but it seems that some intermittent test failures do occur -- 
should be investigated.


> Investigate remaining flakiness in MasterMaintenanceTest.InverseOffersFilters
> -
>
> Key: MESOS-4059
> URL: https://issues.apache.org/jira/browse/MESOS-4059
> Project: Mesos
>  Issue Type: Bug
>Reporter: Neil Conway
>Priority: Minor
>  Labels: flaky-test, mesosphere
>
> Per comments in MESOS-3916, the fix for that issue decreased the degree of 
> flakiness, but it seems that some intermittent test failures do occur -- 
> should be investigated.
> *Flakiness in task acknowledgment*
> {code}
> I1203 18:25:04.609817 28732 status_update_manager.cpp:392] Received status 
> update acknowledgement (UUID: 6afd012e-8e88-41b2-8239-a9b852d07ca1) for task 
> 26305fdd-edb0-4764-8b8a-2558f2b2d81b of framework 
> c7900911-cc7a-4dde-92e7-48fe82cddd9e-
> W1203 18:25:04.610076 28732 status_update_manager.cpp:762] Unexpected status 
> update acknowledgement (received 6afd012e-8e88-41b2-8239-a9b852d07ca1, 
> expecting 82fc7a7b-e64a-4f4d-ab74-76abac42b4e6) for update TASK_RUNNING 
> (UUID: 82fc7a7b-e64a-4f4d-ab74-76abac42b4e6) for task 
> 26305fdd-edb0-4764-8b8a-2558f2b2d81b of framework 
> c7900911-cc7a-4dde-92e7-48fe82cddd9e-
> E1203 18:25:04.610339 28736 slave.cpp:2339] Failed to handle status update 
> acknowledgement (UUID: 6afd012e-8e88-41b2-8239-a9b852d07ca1) for task 
> 26305fdd-edb0-4764-8b8a-2558f2b2d81b of framework 
> c7900911-cc7a-4dde-92e7-48fe82cddd9e-: Duplicate acknowledgemen
> {code}
> *Flakiness in first inverse offer filter*
> See [this comment in 
> MESOS-3916|https://issues.apache.org/jira/browse/MESOS-3916?focusedCommentId=15027478=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15027478]
>  for the explanation.  The related logs are above the comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4055) SSL-related test fail reliably in optimized build

2015-12-03 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15038440#comment-15038440
 ] 

Joseph Wu commented on MESOS-4055:
--

Note: The flag should be {{--enable-optimize}}.  Looks like {{configure}} will 
silently ignore incorrect flags (and build with {{-O0}}).

> SSL-related test fail reliably in optimized build
> -
>
> Key: MESOS-4055
> URL: https://issues.apache.org/jira/browse/MESOS-4055
> Project: Mesos
>  Issue Type: Task
>  Components: libprocess, test
>Affects Versions: 0.26.0
>Reporter: Benjamin Bannier
>
> Under ubuntu14.04 building {{5c0e4dc}} using {{gcc-4.8.4-2ubuntu1~14.04}} with
> {code}
> % ../configure --enable-ssl --enable-libevent --enable-optimize
> {code}
> most SSL-related tests fail reliably with SIGSEV. The full list of failing 
> tests is
> {code}
> SSL.Disabled
> SSLTest.BasicSameProcess
> SSLTest.SSLSocket
> SSLTest.NonSSLSocket
> SSLTest.NoVerifyBadCA
> SSLTest.RequireBadCA
> SSLTest.VerifyBadCA
> SSLTest.VerifyCertificate
> SSLTest.RequireCertificate
> SSLTest.ProtocolMismatch
> SSLTest.ValidDowngrade
> SSLtest.NoValidDowngrade
> SSLTest.NoValidDowngrade
> SSLTest.ValidDowngradeEachProtocol
> SSLTest.NoValidDowngradeEachProtocol
> SSLTest.PeerAddress
> SSLTest.HTTPSGet
> SSLTest.HTTPSPost
> {code}
> The test fail with {{SIGSEV}} or similarly worrisome reasons, e.g.,
> {code}
> [ RUN  ] SSLTest.SSLSocket
> *** Aborted at 1449135851 (unix time) try "date -d @1449135851" if you are 
> using GNU date ***
> PC: @   0x4418f4 Try<>::~Try()
> *** SIGSEGV (@0x5acce6) received by PID 29976 (TID 0x7fe601eb5780) from PID 
> 5950694; stack trace: ***
> @ 0x7fe601a9a340 (unknown)
> @   0x4418f4 Try<>::~Try()
> @   0x5a843c SSLTest::setup_server()
> @   0x595162 SSLTest_SSLSocket_Test::TestBody()
> @   0x5f2428 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @   0x5ec880 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @   0x5cd0ff testing::Test::Run()
> @   0x5cd882 testing::TestInfo::Run()
> @   0x5cdec8 testing::TestCase::Run()
> @   0x5d4610 testing::internal::UnitTestImpl::RunAllTests()
> @   0x5f3203 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @   0x5ed5f4 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @   0x5d33ac testing::UnitTest::Run()
> @   0x40fd70 main
> @ 0x7fe600024ec5 (unknown)
> @   0x413eb1 (unknown)
> Segmentation fault
> {code}
> Even though we do not typically release optimized builds we should still look 
> into these as optimizations tend to expose fragile constructs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4055) SSL-related test fail reliably in optimized build

2015-12-03 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15038531#comment-15038531
 ] 

Joseph Wu commented on MESOS-4055:
--

I didn't encounter any problems on Ubuntu 14 (OpenSSL v1.0.1f,  Libevent 
v2.0.21, gcc 4.8.4).

Can you try again?  (Maybe {{make clean}} beforehand?)

> SSL-related test fail reliably in optimized build
> -
>
> Key: MESOS-4055
> URL: https://issues.apache.org/jira/browse/MESOS-4055
> Project: Mesos
>  Issue Type: Task
>  Components: libprocess, test
>Affects Versions: 0.26.0
>Reporter: Benjamin Bannier
>Assignee: Joseph Wu
>
> Under ubuntu14.04 building {{5c0e4dc}} using {{gcc-4.8.4-2ubuntu1~14.04}} with
> {code}
> % ../configure --enable-ssl --enable-libevent --enable-optimize
> {code}
> most SSL-related tests fail reliably with SIGSEV. The full list of failing 
> tests is
> {code}
> SSL.Disabled
> SSLTest.BasicSameProcess
> SSLTest.SSLSocket
> SSLTest.NonSSLSocket
> SSLTest.NoVerifyBadCA
> SSLTest.RequireBadCA
> SSLTest.VerifyBadCA
> SSLTest.VerifyCertificate
> SSLTest.RequireCertificate
> SSLTest.ProtocolMismatch
> SSLTest.ValidDowngrade
> SSLtest.NoValidDowngrade
> SSLTest.NoValidDowngrade
> SSLTest.ValidDowngradeEachProtocol
> SSLTest.NoValidDowngradeEachProtocol
> SSLTest.PeerAddress
> SSLTest.HTTPSGet
> SSLTest.HTTPSPost
> {code}
> The test fail with {{SIGSEV}} or similarly worrisome reasons, e.g.,
> {code}
> [ RUN  ] SSLTest.SSLSocket
> *** Aborted at 1449135851 (unix time) try "date -d @1449135851" if you are 
> using GNU date ***
> PC: @   0x4418f4 Try<>::~Try()
> *** SIGSEGV (@0x5acce6) received by PID 29976 (TID 0x7fe601eb5780) from PID 
> 5950694; stack trace: ***
> @ 0x7fe601a9a340 (unknown)
> @   0x4418f4 Try<>::~Try()
> @   0x5a843c SSLTest::setup_server()
> @   0x595162 SSLTest_SSLSocket_Test::TestBody()
> @   0x5f2428 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @   0x5ec880 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @   0x5cd0ff testing::Test::Run()
> @   0x5cd882 testing::TestInfo::Run()
> @   0x5cdec8 testing::TestCase::Run()
> @   0x5d4610 testing::internal::UnitTestImpl::RunAllTests()
> @   0x5f3203 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @   0x5ed5f4 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @   0x5d33ac testing::UnitTest::Run()
> @   0x40fd70 main
> @ 0x7fe600024ec5 (unknown)
> @   0x413eb1 (unknown)
> Segmentation fault
> {code}
> Even though we do not typically release optimized builds we should still look 
> into these as optimizations tend to expose fragile constructs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4055) SSL-related test fail reliably in optimized build

2015-12-03 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-4055:
-
Description: 
Under ubuntu14.04 building {{5c0e4dc}} using {{gcc-4.8.4-2ubuntu1~14.04}} with
{code}
% ../configure --enable-ssl --enable-libevent --enable-optimize
{code}

most SSL-related tests fail reliably with SIGSEV. The full list of failing 
tests is
{code}
SSL.Disabled
SSLTest.BasicSameProcess
SSLTest.SSLSocket
SSLTest.NonSSLSocket
SSLTest.NoVerifyBadCA
SSLTest.RequireBadCA
SSLTest.VerifyBadCA
SSLTest.VerifyCertificate
SSLTest.RequireCertificate
SSLTest.ProtocolMismatch
SSLTest.ValidDowngrade
SSLtest.NoValidDowngrade
SSLTest.NoValidDowngrade
SSLTest.ValidDowngradeEachProtocol
SSLTest.NoValidDowngradeEachProtocol
SSLTest.PeerAddress
SSLTest.HTTPSGet
SSLTest.HTTPSPost
{code}

The test fail with {{SIGSEV}} or similarly worrisome reasons, e.g.,
{code}
[ RUN  ] SSLTest.SSLSocket
*** Aborted at 1449135851 (unix time) try "date -d @1449135851" if you are 
using GNU date ***
PC: @   0x4418f4 Try<>::~Try()
*** SIGSEGV (@0x5acce6) received by PID 29976 (TID 0x7fe601eb5780) from PID 
5950694; stack trace: ***
@ 0x7fe601a9a340 (unknown)
@   0x4418f4 Try<>::~Try()
@   0x5a843c SSLTest::setup_server()
@   0x595162 SSLTest_SSLSocket_Test::TestBody()
@   0x5f2428 
testing::internal::HandleSehExceptionsInMethodIfSupported<>()
@   0x5ec880 
testing::internal::HandleExceptionsInMethodIfSupported<>()
@   0x5cd0ff testing::Test::Run()
@   0x5cd882 testing::TestInfo::Run()
@   0x5cdec8 testing::TestCase::Run()
@   0x5d4610 testing::internal::UnitTestImpl::RunAllTests()
@   0x5f3203 
testing::internal::HandleSehExceptionsInMethodIfSupported<>()
@   0x5ed5f4 
testing::internal::HandleExceptionsInMethodIfSupported<>()
@   0x5d33ac testing::UnitTest::Run()
@   0x40fd70 main
@ 0x7fe600024ec5 (unknown)
@   0x413eb1 (unknown)
Segmentation fault
{code}

Even though we do not typically release optimized builds we should still look 
into these as optimizations tend to expose fragile constructs.


  was:
Under ubuntu14.04 building {{5c0e4dc}} using {{gcc-4.8.4-2ubuntu1~14.04}} with
{code}
% ../configure --enable-ssl --enable-libevent --enable-optimized
{code}

most SSL-related tests fail reliably with SIGSEV. The full list of failing 
tests is
{code}
SSL.Disabled
SSLTest.BasicSameProcess
SSLTest.SSLSocket
SSLTest.NonSSLSocket
SSLTest.NoVerifyBadCA
SSLTest.RequireBadCA
SSLTest.VerifyBadCA
SSLTest.VerifyCertificate
SSLTest.RequireCertificate
SSLTest.ProtocolMismatch
SSLTest.ValidDowngrade
SSLtest.NoValidDowngrade
SSLTest.NoValidDowngrade
SSLTest.ValidDowngradeEachProtocol
SSLTest.NoValidDowngradeEachProtocol
SSLTest.PeerAddress
SSLTest.HTTPSGet
SSLTest.HTTPSPost
{code}

The test fail with {{SIGSEV}} or similarly worrisome reasons, e.g.,
{code}
[ RUN  ] SSLTest.SSLSocket
*** Aborted at 1449135851 (unix time) try "date -d @1449135851" if you are 
using GNU date ***
PC: @   0x4418f4 Try<>::~Try()
*** SIGSEGV (@0x5acce6) received by PID 29976 (TID 0x7fe601eb5780) from PID 
5950694; stack trace: ***
@ 0x7fe601a9a340 (unknown)
@   0x4418f4 Try<>::~Try()
@   0x5a843c SSLTest::setup_server()
@   0x595162 SSLTest_SSLSocket_Test::TestBody()
@   0x5f2428 
testing::internal::HandleSehExceptionsInMethodIfSupported<>()
@   0x5ec880 
testing::internal::HandleExceptionsInMethodIfSupported<>()
@   0x5cd0ff testing::Test::Run()
@   0x5cd882 testing::TestInfo::Run()
@   0x5cdec8 testing::TestCase::Run()
@   0x5d4610 testing::internal::UnitTestImpl::RunAllTests()
@   0x5f3203 
testing::internal::HandleSehExceptionsInMethodIfSupported<>()
@   0x5ed5f4 
testing::internal::HandleExceptionsInMethodIfSupported<>()
@   0x5d33ac testing::UnitTest::Run()
@   0x40fd70 main
@ 0x7fe600024ec5 (unknown)
@   0x413eb1 (unknown)
Segmentation fault
{code}

Even though we do not typically release optimized builds we should still look 
into these as optimizations tend to expose fragile constructs.



> SSL-related test fail reliably in optimized build
> -
>
> Key: MESOS-4055
> URL: https://issues.apache.org/jira/browse/MESOS-4055
> Project: Mesos
>  Issue Type: Task
>  Components: libprocess, test
>Affects Versions: 0.26.0
>Reporter: Benjamin Bannier
>
> Under ubuntu14.04 building {{5c0e4dc}} using {{gcc-4.8.4-2ubuntu1~14.04}} with
> {code}
> % ../configure --enable-ssl --enable-libevent --enable-optimize
> {code}
> most SSL-related tests fail reliably with SIGSEV. The full 

[jira] [Updated] (MESOS-4059) Investigate remaining flakiness in MasterMaintenanceTest.InverseOffersFilters

2015-12-03 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-4059:
-
Description: 
Per comments in MESOS-3916, the fix for that issue decreased the degree of 
flakiness, but it seems that some intermittent test failures do occur -- should 
be investigated.

*Flakiness in task acknowledgment*
{code}
I1203 18:25:04.609817 28732 status_update_manager.cpp:392] Received status 
update acknowledgement (UUID: 6afd012e-8e88-41b2-8239-a9b852d07ca1) for task 
26305fdd-edb0-4764-8b8a-2558f2b2d81b of framework 
c7900911-cc7a-4dde-92e7-48fe82cddd9e-
W1203 18:25:04.610076 28732 status_update_manager.cpp:762] Unexpected status 
update acknowledgement (received 6afd012e-8e88-41b2-8239-a9b852d07ca1, 
expecting 82fc7a7b-e64a-4f4d-ab74-76abac42b4e6) for update TASK_RUNNING (UUID: 
82fc7a7b-e64a-4f4d-ab74-76abac42b4e6) for task 
26305fdd-edb0-4764-8b8a-2558f2b2d81b of framework 
c7900911-cc7a-4dde-92e7-48fe82cddd9e-
E1203 18:25:04.610339 28736 slave.cpp:2339] Failed to handle status update 
acknowledgement (UUID: 6afd012e-8e88-41b2-8239-a9b852d07ca1) for task 
26305fdd-edb0-4764-8b8a-2558f2b2d81b of framework 
c7900911-cc7a-4dde-92e7-48fe82cddd9e-: Duplicate acknowledgemen
{code}

This is a race between [launching and acknowledging two 
tasks|https://github.com/apache/mesos/blob/75cb89fa961b249c9ab7fa0f45dfa9d415a5/src/tests/master_maintenance_tests.cpp#L1486-L1517].
  The status updates for each task are not necessarily received in the same 
order as launching the tasks.

*Flakiness in first inverse offer filter*
See [this comment in 
MESOS-3916|https://issues.apache.org/jira/browse/MESOS-3916?focusedCommentId=15027478=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15027478]
 for the explanation.  The related logs are above the comment.

  was:
Per comments in MESOS-3916, the fix for that issue decreased the degree of 
flakiness, but it seems that some intermittent test failures do occur -- should 
be investigated.

*Flakiness in task acknowledgment*
{code}
I1203 18:25:04.609817 28732 status_update_manager.cpp:392] Received status 
update acknowledgement (UUID: 6afd012e-8e88-41b2-8239-a9b852d07ca1) for task 
26305fdd-edb0-4764-8b8a-2558f2b2d81b of framework 
c7900911-cc7a-4dde-92e7-48fe82cddd9e-
W1203 18:25:04.610076 28732 status_update_manager.cpp:762] Unexpected status 
update acknowledgement (received 6afd012e-8e88-41b2-8239-a9b852d07ca1, 
expecting 82fc7a7b-e64a-4f4d-ab74-76abac42b4e6) for update TASK_RUNNING (UUID: 
82fc7a7b-e64a-4f4d-ab74-76abac42b4e6) for task 
26305fdd-edb0-4764-8b8a-2558f2b2d81b of framework 
c7900911-cc7a-4dde-92e7-48fe82cddd9e-
E1203 18:25:04.610339 28736 slave.cpp:2339] Failed to handle status update 
acknowledgement (UUID: 6afd012e-8e88-41b2-8239-a9b852d07ca1) for task 
26305fdd-edb0-4764-8b8a-2558f2b2d81b of framework 
c7900911-cc7a-4dde-92e7-48fe82cddd9e-: Duplicate acknowledgemen
{code}

*Flakiness in first inverse offer filter*
See [this comment in 
MESOS-3916|https://issues.apache.org/jira/browse/MESOS-3916?focusedCommentId=15027478=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15027478]
 for the explanation.  The related logs are above the comment.


> Investigate remaining flakiness in MasterMaintenanceTest.InverseOffersFilters
> -
>
> Key: MESOS-4059
> URL: https://issues.apache.org/jira/browse/MESOS-4059
> Project: Mesos
>  Issue Type: Bug
>Reporter: Neil Conway
>Priority: Minor
>  Labels: flaky-test, mesosphere
>
> Per comments in MESOS-3916, the fix for that issue decreased the degree of 
> flakiness, but it seems that some intermittent test failures do occur -- 
> should be investigated.
> *Flakiness in task acknowledgment*
> {code}
> I1203 18:25:04.609817 28732 status_update_manager.cpp:392] Received status 
> update acknowledgement (UUID: 6afd012e-8e88-41b2-8239-a9b852d07ca1) for task 
> 26305fdd-edb0-4764-8b8a-2558f2b2d81b of framework 
> c7900911-cc7a-4dde-92e7-48fe82cddd9e-
> W1203 18:25:04.610076 28732 status_update_manager.cpp:762] Unexpected status 
> update acknowledgement (received 6afd012e-8e88-41b2-8239-a9b852d07ca1, 
> expecting 82fc7a7b-e64a-4f4d-ab74-76abac42b4e6) for update TASK_RUNNING 
> (UUID: 82fc7a7b-e64a-4f4d-ab74-76abac42b4e6) for task 
> 26305fdd-edb0-4764-8b8a-2558f2b2d81b of framework 
> c7900911-cc7a-4dde-92e7-48fe82cddd9e-
> E1203 18:25:04.610339 28736 slave.cpp:2339] Failed to handle status update 
> acknowledgement (UUID: 6afd012e-8e88-41b2-8239-a9b852d07ca1) for task 
> 26305fdd-edb0-4764-8b8a-2558f2b2d81b of framework 
> c7900911-cc7a-4dde-92e7-48fe82cddd9e-: Duplicate acknowledgemen
> {code}
> This is a race between [launching and acknowledging two 
> 

[jira] [Commented] (MESOS-4055) SSL-related test fail reliably in optimized build

2015-12-03 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15038467#comment-15038467
 ] 

Joseph Wu commented on MESOS-4055:
--

No problems with {{SSLTests}} on OSX (Yosemite) + clang with {{-O2}}.

> SSL-related test fail reliably in optimized build
> -
>
> Key: MESOS-4055
> URL: https://issues.apache.org/jira/browse/MESOS-4055
> Project: Mesos
>  Issue Type: Task
>  Components: libprocess, test
>Affects Versions: 0.26.0
>Reporter: Benjamin Bannier
>Assignee: Joseph Wu
>
> Under ubuntu14.04 building {{5c0e4dc}} using {{gcc-4.8.4-2ubuntu1~14.04}} with
> {code}
> % ../configure --enable-ssl --enable-libevent --enable-optimize
> {code}
> most SSL-related tests fail reliably with SIGSEV. The full list of failing 
> tests is
> {code}
> SSL.Disabled
> SSLTest.BasicSameProcess
> SSLTest.SSLSocket
> SSLTest.NonSSLSocket
> SSLTest.NoVerifyBadCA
> SSLTest.RequireBadCA
> SSLTest.VerifyBadCA
> SSLTest.VerifyCertificate
> SSLTest.RequireCertificate
> SSLTest.ProtocolMismatch
> SSLTest.ValidDowngrade
> SSLtest.NoValidDowngrade
> SSLTest.NoValidDowngrade
> SSLTest.ValidDowngradeEachProtocol
> SSLTest.NoValidDowngradeEachProtocol
> SSLTest.PeerAddress
> SSLTest.HTTPSGet
> SSLTest.HTTPSPost
> {code}
> The test fail with {{SIGSEV}} or similarly worrisome reasons, e.g.,
> {code}
> [ RUN  ] SSLTest.SSLSocket
> *** Aborted at 1449135851 (unix time) try "date -d @1449135851" if you are 
> using GNU date ***
> PC: @   0x4418f4 Try<>::~Try()
> *** SIGSEGV (@0x5acce6) received by PID 29976 (TID 0x7fe601eb5780) from PID 
> 5950694; stack trace: ***
> @ 0x7fe601a9a340 (unknown)
> @   0x4418f4 Try<>::~Try()
> @   0x5a843c SSLTest::setup_server()
> @   0x595162 SSLTest_SSLSocket_Test::TestBody()
> @   0x5f2428 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @   0x5ec880 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @   0x5cd0ff testing::Test::Run()
> @   0x5cd882 testing::TestInfo::Run()
> @   0x5cdec8 testing::TestCase::Run()
> @   0x5d4610 testing::internal::UnitTestImpl::RunAllTests()
> @   0x5f3203 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @   0x5ed5f4 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @   0x5d33ac testing::UnitTest::Run()
> @   0x40fd70 main
> @ 0x7fe600024ec5 (unknown)
> @   0x413eb1 (unknown)
> Segmentation fault
> {code}
> Even though we do not typically release optimized builds we should still look 
> into these as optimizations tend to expose fragile constructs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3183) Documentation images do not load

2015-12-02 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-3183:
-
Description: 
Any images which are referenced from the generated docs ({{docs/*.md}}) do not 
show up on the website.  For example:
* [Architecture|http://mesos.apache.org/documentation/latest/architecture/]
* [External 
Containerizer|http://mesos.apache.org/documentation/latest/external-containerizer/]
* [Fetcher Cache 
Internals|http://mesos.apache.org/documentation/latest/fetcher-cache-internals/]
* [Maintenance|http://mesos.apache.org/documentation/latest/maintenance/]   
* 
[Oversubscription|http://mesos.apache.org/documentation/latest/oversubscription/]


  was:
Any images which are referenced from the generated docs ({{docs/*.md}}) do not 
show up on the website.  For example:
* [External 
Containerizer|http://mesos.apache.org/documentation/latest/external-containerizer/]
* [Fetcher Cache 
Internals|http://mesos.apache.org/documentation/latest/fetcher-cache-internals/]
* [Maintenance|http://mesos.apache.org/documentation/latest/maintenance/]   
* 
[Oversubscription|http://mesos.apache.org/documentation/latest/oversubscription/]



> Documentation images do not load
> 
>
> Key: MESOS-3183
> URL: https://issues.apache.org/jira/browse/MESOS-3183
> Project: Mesos
>  Issue Type: Documentation
>  Components: documentation
>Affects Versions: 0.24.0
>Reporter: James Mulcahy
>Priority: Minor
>  Labels: mesosphere
> Attachments: rake.patch
>
>
> Any images which are referenced from the generated docs ({{docs/*.md}}) do 
> not show up on the website.  For example:
> * [Architecture|http://mesos.apache.org/documentation/latest/architecture/]
> * [External 
> Containerizer|http://mesos.apache.org/documentation/latest/external-containerizer/]
> * [Fetcher Cache 
> Internals|http://mesos.apache.org/documentation/latest/fetcher-cache-internals/]
> * [Maintenance|http://mesos.apache.org/documentation/latest/maintenance/] 
> * 
> [Oversubscription|http://mesos.apache.org/documentation/latest/oversubscription/]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-4047) MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery is flaky

2015-12-02 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu reassigned MESOS-4047:


Assignee: Joseph Wu

> MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery is flaky
> ---
>
> Key: MESOS-4047
> URL: https://issues.apache.org/jira/browse/MESOS-4047
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.26.0
> Environment: Ubuntu 14, gcc 4.8.4
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: flaky, flaky-test
>
> {code:title=Output from passed test}
> [--] 1 test from MemoryPressureMesosTest
> 1+0 records in
> 1+0 records out
> 1048576 bytes (1.0 MB) copied, 0.000430889 s, 2.4 GB/s
> [ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery
> I1202 11:09:14.319327  5062 exec.cpp:134] Version: 0.27.0
> I1202 11:09:14.17  5079 exec.cpp:208] Executor registered on slave 
> bea15b35-9aa1-4b57-96fb-29b5f70638ac-S0
> Registered executor on ubuntu
> Starting task 4e62294c-cfcf-4a13-b699-c6a4b7ac5162
> sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done'
> Forked command at 5085
> I1202 11:09:14.391739  5077 exec.cpp:254] Received reconnect request from 
> slave bea15b35-9aa1-4b57-96fb-29b5f70638ac-S0
> I1202 11:09:14.398598  5082 exec.cpp:231] Executor re-registered on slave 
> bea15b35-9aa1-4b57-96fb-29b5f70638ac-S0
> Re-registered executor on ubuntu
> Shutting down
> Sending SIGTERM to process tree at pid 5085
> Killing the following process trees:
> [ 
> -+- 5085 sh -c while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done 
>  \--- 5086 dd count=512 bs=1M if=/dev/zero of=./temp 
> ]
> [   OK ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery (1096 ms)
> {code}
> {code:title=Output from failed test}
> [--] 1 test from MemoryPressureMesosTest
> 1+0 records in
> 1+0 records out
> 1048576 bytes (1.0 MB) copied, 0.000404489 s, 2.6 GB/s
> [ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery
> I1202 11:09:15.509950  5109 exec.cpp:134] Version: 0.27.0
> I1202 11:09:15.568183  5123 exec.cpp:208] Executor registered on slave 
> 88734acc-718e-45b0-95b9-d8f07cea8a9e-S0
> Registered executor on ubuntu
> Starting task 14b6bab9-9f60-4130-bdc4-44efba262bc6
> Forked command at 5132
> sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done'
> I1202 11:09:15.665498  5129 exec.cpp:254] Received reconnect request from 
> slave 88734acc-718e-45b0-95b9-d8f07cea8a9e-S0
> I1202 11:09:15.670995  5123 exec.cpp:381] Executor asked to shutdown
> Shutting down
> Sending SIGTERM to process tree at pid 5132
> ../../src/tests/containerizer/memory_pressure_tests.cpp:283: Failure
> (usage).failure(): Unknown container: ebe90e15-72fa-4519-837b-62f43052c913
> *** Aborted at 1449083355 (unix time) try "date -d @1449083355" if you are 
> using GNU date ***
> {code}
> Notice that in the failed test, the executor is asked to shutdown when it 
> tries to reconnect to the agent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3586) MemoryPressureMesosTest.CGROUPS_ROOT_Statistics and CGROUPS_ROOT_SlaveRecovery are flaky

2015-12-02 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-3586:
-
Description: 
I am install Mesos 0.24.0 on 4 servers which have very similar hardware and 
software configurations. 

After performing {{../configure}}, {{make}}, and {{make check}} some servers 
have completed successfully and other failed on test {{[ RUN  ] 
MemoryPressureMesosTest.CGROUPS_ROOT_Statistics}}.

Is there something I should check in this test? 

{code}
PERFORMED MAKE CHECK NODE-001
[ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics
I1005 14:37:35.585067 38479 exec.cpp:133] Version: 0.24.0
I1005 14:37:35.593789 38497 exec.cpp:207] Executor registered on slave 
20151005-143735-2393768202-35106-27900-S0
Registered executor on svdidac038.techlabs.accenture.com
Starting task 010b2fe9-4eac-4136-8a8a-6ce7665488b0
Forked command at 38510
sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done'


PERFORMED MAKE CHECK NODE-002
[ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics
I1005 14:38:58.794112 36997 exec.cpp:133] Version: 0.24.0
I1005 14:38:58.802851 37022 exec.cpp:207] Executor registered on slave 
20151005-143857-2360213770-50427-26325-S0
Registered executor on svdidac039.techlabs.accenture.com
Starting task 9bb317ba-41cb-44a4-b507-d1c85ceabc28
sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done'
Forked command at 37028
../../src/tests/containerizer/memory_pressure_tests.cpp:145: Failure
Expected: (usage.get().mem_medium_pressure_counter()) >= 
(usage.get().mem_critical_pressure_counter()), actual: 5 vs 6
2015-10-05 
14:39:00,130:26325(0x2af08cc78700):ZOO_ERROR@handle_socket_error_msg@1697: 
Socket [127.0.0.1:37198] zk retcode=-4, errno=111(Connection refused): server 
refused to accept the client
[  FAILED  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics (4303 ms)
{code}

  was:
I am install Mesos 0.24.0 on 4 servers which have very similar hardware and 
software configurations. 

After performing ../configure, make, and make check some servers have completed 
successfully and other failed on test [ RUN  ] 
MemoryPressureMesosTest.CGROUPS_ROOT_Statistics.

Is there something I should check in this test? 

PERFORMED MAKE CHECK NODE-001
[ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics
I1005 14:37:35.585067 38479 exec.cpp:133] Version: 0.24.0
I1005 14:37:35.593789 38497 exec.cpp:207] Executor registered on slave 
20151005-143735-2393768202-35106-27900-S0
Registered executor on svdidac038.techlabs.accenture.com
Starting task 010b2fe9-4eac-4136-8a8a-6ce7665488b0
Forked command at 38510
sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done'


PERFORMED MAKE CHECK NODE-002
[ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics
I1005 14:38:58.794112 36997 exec.cpp:133] Version: 0.24.0
I1005 14:38:58.802851 37022 exec.cpp:207] Executor registered on slave 
20151005-143857-2360213770-50427-26325-S0
Registered executor on svdidac039.techlabs.accenture.com
Starting task 9bb317ba-41cb-44a4-b507-d1c85ceabc28
sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done'
Forked command at 37028
../../src/tests/containerizer/memory_pressure_tests.cpp:145: Failure
Expected: (usage.get().mem_medium_pressure_counter()) >= 
(usage.get().mem_critical_pressure_counter()), actual: 5 vs 6
2015-10-05 
14:39:00,130:26325(0x2af08cc78700):ZOO_ERROR@handle_socket_error_msg@1697: 
Socket [127.0.0.1:37198] zk retcode=-4, errno=111(Connection refused): server 
refused to accept the client
[  FAILED  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics (4303 ms)



> MemoryPressureMesosTest.CGROUPS_ROOT_Statistics and 
> CGROUPS_ROOT_SlaveRecovery are flaky
> 
>
> Key: MESOS-3586
> URL: https://issues.apache.org/jira/browse/MESOS-3586
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.24.0, 0.26.0
> Environment: Ubuntu 14.04, 3.13.0-32 generic
> Debian 8, gcc 4.9.2
>Reporter: Miguel Bernadin
>Assignee: Joseph Wu
>  Labels: flaky, flaky-test
>
> I am install Mesos 0.24.0 on 4 servers which have very similar hardware and 
> software configurations. 
> After performing {{../configure}}, {{make}}, and {{make check}} some servers 
> have completed successfully and other failed on test {{[ RUN  ] 
> MemoryPressureMesosTest.CGROUPS_ROOT_Statistics}}.
> Is there something I should check in this test? 
> {code}
> PERFORMED MAKE CHECK NODE-001
> [ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics
> I1005 14:37:35.585067 38479 exec.cpp:133] Version: 0.24.0
> I1005 14:37:35.593789 38497 exec.cpp:207] Executor registered on slave 
> 20151005-143735-2393768202-35106-27900-S0
> Registered executor on 

[jira] [Created] (MESOS-4047) MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery is flaky

2015-12-02 Thread Joseph Wu (JIRA)
Joseph Wu created MESOS-4047:


 Summary: MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery is 
flaky
 Key: MESOS-4047
 URL: https://issues.apache.org/jira/browse/MESOS-4047
 Project: Mesos
  Issue Type: Bug
  Components: test
Affects Versions: 0.26.0
 Environment: Ubuntu 14, gcc 4.8.4
Reporter: Joseph Wu


{code:title=Output from passed test}
[--] 1 test from MemoryPressureMesosTest
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.000430889 s, 2.4 GB/s
[ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery
I1202 11:09:14.319327  5062 exec.cpp:134] Version: 0.27.0
I1202 11:09:14.17  5079 exec.cpp:208] Executor registered on slave 
bea15b35-9aa1-4b57-96fb-29b5f70638ac-S0
Registered executor on ubuntu
Starting task 4e62294c-cfcf-4a13-b699-c6a4b7ac5162
sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done'
Forked command at 5085
I1202 11:09:14.391739  5077 exec.cpp:254] Received reconnect request from slave 
bea15b35-9aa1-4b57-96fb-29b5f70638ac-S0
I1202 11:09:14.398598  5082 exec.cpp:231] Executor re-registered on slave 
bea15b35-9aa1-4b57-96fb-29b5f70638ac-S0
Re-registered executor on ubuntu
Shutting down
Sending SIGTERM to process tree at pid 5085
Killing the following process trees:
[ 
-+- 5085 sh -c while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done 
 \--- 5086 dd count=512 bs=1M if=/dev/zero of=./temp 
]
[   OK ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery (1096 ms)
{code}

{code:title=Output from failed test}
[--] 1 test from MemoryPressureMesosTest
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.000404489 s, 2.6 GB/s
[ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery
I1202 11:09:15.509950  5109 exec.cpp:134] Version: 0.27.0
I1202 11:09:15.568183  5123 exec.cpp:208] Executor registered on slave 
88734acc-718e-45b0-95b9-d8f07cea8a9e-S0
Registered executor on ubuntu
Starting task 14b6bab9-9f60-4130-bdc4-44efba262bc6
Forked command at 5132
sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done'
I1202 11:09:15.665498  5129 exec.cpp:254] Received reconnect request from slave 
88734acc-718e-45b0-95b9-d8f07cea8a9e-S0
I1202 11:09:15.670995  5123 exec.cpp:381] Executor asked to shutdown
Shutting down
Sending SIGTERM to process tree at pid 5132
../../src/tests/containerizer/memory_pressure_tests.cpp:283: Failure
(usage).failure(): Unknown container: ebe90e15-72fa-4519-837b-62f43052c913
*** Aborted at 1449083355 (unix time) try "date -d @1449083355" if you are 
using GNU date ***
{code}

Notice that in the failed test, the executor is asked to shutdown when it tries 
to reconnect to the agent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4047) MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery is flaky

2015-12-02 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036465#comment-15036465
 ] 

Joseph Wu commented on MESOS-4047:
--

Note: {{MesosContainerizerSlaveRecoveryTest.ResourceStatistics}} has similar 
logic for restarting the agent, re-registering an executor, and [calling 
{{MesosContainerizer::usage}}|https://github.com/apache/mesos/blob/master/src/tests/slave_recovery_tests.cpp#L3267].
  But this test is stable.  

The flaky test waits on:
{code}
  Future _recover = FUTURE_DISPATCH(_, ::_recover);

  Future slaveReregisteredMessage =
FUTURE_PROTOBUF(SlaveReregisteredMessage(), _, _);
{code}

Whereas the stable test waits on:
{code}
  // Set up so we can wait until the new slave updates the container's
  // resources (this occurs after the executor has re-registered).
  Future update =
FUTURE_DISPATCH(_, ::update);
{code}

> MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery is flaky
> ---
>
> Key: MESOS-4047
> URL: https://issues.apache.org/jira/browse/MESOS-4047
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.26.0
> Environment: Ubuntu 14, gcc 4.8.4
>Reporter: Joseph Wu
>  Labels: flaky, flaky-test
>
> {code:title=Output from passed test}
> [--] 1 test from MemoryPressureMesosTest
> 1+0 records in
> 1+0 records out
> 1048576 bytes (1.0 MB) copied, 0.000430889 s, 2.4 GB/s
> [ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery
> I1202 11:09:14.319327  5062 exec.cpp:134] Version: 0.27.0
> I1202 11:09:14.17  5079 exec.cpp:208] Executor registered on slave 
> bea15b35-9aa1-4b57-96fb-29b5f70638ac-S0
> Registered executor on ubuntu
> Starting task 4e62294c-cfcf-4a13-b699-c6a4b7ac5162
> sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done'
> Forked command at 5085
> I1202 11:09:14.391739  5077 exec.cpp:254] Received reconnect request from 
> slave bea15b35-9aa1-4b57-96fb-29b5f70638ac-S0
> I1202 11:09:14.398598  5082 exec.cpp:231] Executor re-registered on slave 
> bea15b35-9aa1-4b57-96fb-29b5f70638ac-S0
> Re-registered executor on ubuntu
> Shutting down
> Sending SIGTERM to process tree at pid 5085
> Killing the following process trees:
> [ 
> -+- 5085 sh -c while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done 
>  \--- 5086 dd count=512 bs=1M if=/dev/zero of=./temp 
> ]
> [   OK ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery (1096 ms)
> {code}
> {code:title=Output from failed test}
> [--] 1 test from MemoryPressureMesosTest
> 1+0 records in
> 1+0 records out
> 1048576 bytes (1.0 MB) copied, 0.000404489 s, 2.6 GB/s
> [ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery
> I1202 11:09:15.509950  5109 exec.cpp:134] Version: 0.27.0
> I1202 11:09:15.568183  5123 exec.cpp:208] Executor registered on slave 
> 88734acc-718e-45b0-95b9-d8f07cea8a9e-S0
> Registered executor on ubuntu
> Starting task 14b6bab9-9f60-4130-bdc4-44efba262bc6
> Forked command at 5132
> sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done'
> I1202 11:09:15.665498  5129 exec.cpp:254] Received reconnect request from 
> slave 88734acc-718e-45b0-95b9-d8f07cea8a9e-S0
> I1202 11:09:15.670995  5123 exec.cpp:381] Executor asked to shutdown
> Shutting down
> Sending SIGTERM to process tree at pid 5132
> ../../src/tests/containerizer/memory_pressure_tests.cpp:283: Failure
> (usage).failure(): Unknown container: ebe90e15-72fa-4519-837b-62f43052c913
> *** Aborted at 1449083355 (unix time) try "date -d @1449083355" if you are 
> using GNU date ***
> {code}
> Notice that in the failed test, the executor is asked to shutdown when it 
> tries to reconnect to the agent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3586) MemoryPressureMesosTest.CGROUPS_ROOT_Statistics and CGROUPS_ROOT_SlaveRecovery are flaky

2015-12-01 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034948#comment-15034948
 ] 

Joseph Wu commented on MESOS-3586:
--

This race _almost_ seems unavoidable (at least, given the test currently), and 
I don't think the sleep duration is really a problem.

*Background*
Both tests are essentially hammering away at memory, resulting in "memory 
pressure".  Depending on the load (low, medium, critical), this triggers some 
cgroup status events.  By definition, the "low" pressure event is always 
triggered whenever there is any pressure at all:
{quote}
Application will be notified through eventfd when memory pressure is at
the specific level (or higher).
{quote}
[Reference section "11. Memory 
Pressure"|https://www.kernel.org/doc/Documentation/cgroups/memory.txt]

In the tests, we check this by expecting "number of low pressure events" >= 
"number of medium pressure events" >= "number of critical pressure events".

*Problem*
There's no guarantee of the order of notification.  When we read from our 
memory pressure counters, there might be some events in-flight that haven't 
been processed yet.  Therefore, we occasionally see our expectations betrayed.

*???*
The memory pressure event counts should be eventually consistent with our 
expectations.  So the test should probably:
* Stop the memory-hammering task at some point.
* Wait for all pressure events to be processed.
* Then check the counters.

> MemoryPressureMesosTest.CGROUPS_ROOT_Statistics and 
> CGROUPS_ROOT_SlaveRecovery are flaky
> 
>
> Key: MESOS-3586
> URL: https://issues.apache.org/jira/browse/MESOS-3586
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.24.0, 0.26.0
> Environment: Ubuntu 14.04, 3.13.0-32 generic
> Debian 8, gcc 4.9.2
>Reporter: Miguel Bernadin
>  Labels: flaky, flaky-test
>
> I am install Mesos 0.24.0 on 4 servers which have very similar hardware and 
> software configurations. 
> After performing ../configure, make, and make check some servers have 
> completed successfully and other failed on test [ RUN  ] 
> MemoryPressureMesosTest.CGROUPS_ROOT_Statistics.
> Is there something I should check in this test? 
> PERFORMED MAKE CHECK NODE-001
> [ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics
> I1005 14:37:35.585067 38479 exec.cpp:133] Version: 0.24.0
> I1005 14:37:35.593789 38497 exec.cpp:207] Executor registered on slave 
> 20151005-143735-2393768202-35106-27900-S0
> Registered executor on svdidac038.techlabs.accenture.com
> Starting task 010b2fe9-4eac-4136-8a8a-6ce7665488b0
> Forked command at 38510
> sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done'
> PERFORMED MAKE CHECK NODE-002
> [ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics
> I1005 14:38:58.794112 36997 exec.cpp:133] Version: 0.24.0
> I1005 14:38:58.802851 37022 exec.cpp:207] Executor registered on slave 
> 20151005-143857-2360213770-50427-26325-S0
> Registered executor on svdidac039.techlabs.accenture.com
> Starting task 9bb317ba-41cb-44a4-b507-d1c85ceabc28
> sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done'
> Forked command at 37028
> ../../src/tests/containerizer/memory_pressure_tests.cpp:145: Failure
> Expected: (usage.get().mem_medium_pressure_counter()) >= 
> (usage.get().mem_critical_pressure_counter()), actual: 5 vs 6
> 2015-10-05 
> 14:39:00,130:26325(0x2af08cc78700):ZOO_ERROR@handle_socket_error_msg@1697: 
> Socket [127.0.0.1:37198] zk retcode=-4, errno=111(Connection refused): server 
> refused to accept the client
> [  FAILED  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics (4303 ms)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3586) MemoryPressureMesosTest.CGROUPS_ROOT_Statistics and CGROUPS_ROOT_SlaveRecovery are flaky

2015-12-01 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-3586:
-
Affects Version/s: 0.26.0
  Environment: 
Ubuntu 14.04, 3.13.0-32 generic
Debian 8, gcc 4.9.2

  was:Ubuntu 14.04, 3.13.0-32 generic

  Summary: MemoryPressureMesosTest.CGROUPS_ROOT_Statistics and 
CGROUPS_ROOT_SlaveRecovery are flaky  (was: Installing Mesos 0.24.0 on multiple 
systems. Failed test on MemoryPressureMesosTest.CGROUPS_ROOT_Statistics)

The {{CGROUPS_ROOT_Statistics}} and {{CGROUPS_ROOT_SlaveRecovery}} are both 
similarly flaky.

The tests also fail on Debian 8 with the same error.

> MemoryPressureMesosTest.CGROUPS_ROOT_Statistics and 
> CGROUPS_ROOT_SlaveRecovery are flaky
> 
>
> Key: MESOS-3586
> URL: https://issues.apache.org/jira/browse/MESOS-3586
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.24.0, 0.26.0
> Environment: Ubuntu 14.04, 3.13.0-32 generic
> Debian 8, gcc 4.9.2
>Reporter: Miguel Bernadin
>
> I am install Mesos 0.24.0 on 4 servers which have very similar hardware and 
> software configurations. 
> After performing ../configure, make, and make check some servers have 
> completed successfully and other failed on test [ RUN  ] 
> MemoryPressureMesosTest.CGROUPS_ROOT_Statistics.
> Is there something I should check in this test? 
> PERFORMED MAKE CHECK NODE-001
> [ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics
> I1005 14:37:35.585067 38479 exec.cpp:133] Version: 0.24.0
> I1005 14:37:35.593789 38497 exec.cpp:207] Executor registered on slave 
> 20151005-143735-2393768202-35106-27900-S0
> Registered executor on svdidac038.techlabs.accenture.com
> Starting task 010b2fe9-4eac-4136-8a8a-6ce7665488b0
> Forked command at 38510
> sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done'
> PERFORMED MAKE CHECK NODE-002
> [ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics
> I1005 14:38:58.794112 36997 exec.cpp:133] Version: 0.24.0
> I1005 14:38:58.802851 37022 exec.cpp:207] Executor registered on slave 
> 20151005-143857-2360213770-50427-26325-S0
> Registered executor on svdidac039.techlabs.accenture.com
> Starting task 9bb317ba-41cb-44a4-b507-d1c85ceabc28
> sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done'
> Forked command at 37028
> ../../src/tests/containerizer/memory_pressure_tests.cpp:145: Failure
> Expected: (usage.get().mem_medium_pressure_counter()) >= 
> (usage.get().mem_critical_pressure_counter()), actual: 5 vs 6
> 2015-10-05 
> 14:39:00,130:26325(0x2af08cc78700):ZOO_ERROR@handle_socket_error_msg@1697: 
> Socket [127.0.0.1:37198] zk retcode=-4, errno=111(Connection refused): server 
> refused to accept the client
> [  FAILED  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics (4303 ms)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-3586) MemoryPressureMesosTest.CGROUPS_ROOT_Statistics and CGROUPS_ROOT_SlaveRecovery are flaky

2015-12-01 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu reassigned MESOS-3586:


Assignee: Joseph Wu

> MemoryPressureMesosTest.CGROUPS_ROOT_Statistics and 
> CGROUPS_ROOT_SlaveRecovery are flaky
> 
>
> Key: MESOS-3586
> URL: https://issues.apache.org/jira/browse/MESOS-3586
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.24.0, 0.26.0
> Environment: Ubuntu 14.04, 3.13.0-32 generic
> Debian 8, gcc 4.9.2
>Reporter: Miguel Bernadin
>Assignee: Joseph Wu
>  Labels: flaky, flaky-test
>
> I am install Mesos 0.24.0 on 4 servers which have very similar hardware and 
> software configurations. 
> After performing ../configure, make, and make check some servers have 
> completed successfully and other failed on test [ RUN  ] 
> MemoryPressureMesosTest.CGROUPS_ROOT_Statistics.
> Is there something I should check in this test? 
> PERFORMED MAKE CHECK NODE-001
> [ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics
> I1005 14:37:35.585067 38479 exec.cpp:133] Version: 0.24.0
> I1005 14:37:35.593789 38497 exec.cpp:207] Executor registered on slave 
> 20151005-143735-2393768202-35106-27900-S0
> Registered executor on svdidac038.techlabs.accenture.com
> Starting task 010b2fe9-4eac-4136-8a8a-6ce7665488b0
> Forked command at 38510
> sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done'
> PERFORMED MAKE CHECK NODE-002
> [ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics
> I1005 14:38:58.794112 36997 exec.cpp:133] Version: 0.24.0
> I1005 14:38:58.802851 37022 exec.cpp:207] Executor registered on slave 
> 20151005-143857-2360213770-50427-26325-S0
> Registered executor on svdidac039.techlabs.accenture.com
> Starting task 9bb317ba-41cb-44a4-b507-d1c85ceabc28
> sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done'
> Forked command at 37028
> ../../src/tests/containerizer/memory_pressure_tests.cpp:145: Failure
> Expected: (usage.get().mem_medium_pressure_counter()) >= 
> (usage.get().mem_critical_pressure_counter()), actual: 5 vs 6
> 2015-10-05 
> 14:39:00,130:26325(0x2af08cc78700):ZOO_ERROR@handle_socket_error_msg@1697: 
> Socket [127.0.0.1:37198] zk retcode=-4, errno=111(Connection refused): server 
> refused to accept the client
> [  FAILED  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics (4303 ms)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3586) MemoryPressureMesosTest.CGROUPS_ROOT_Statistics and CGROUPS_ROOT_SlaveRecovery are flaky

2015-12-01 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035017#comment-15035017
 ] 

Joseph Wu commented on MESOS-3586:
--

Review: https://reviews.apache.org/r/40849/

> MemoryPressureMesosTest.CGROUPS_ROOT_Statistics and 
> CGROUPS_ROOT_SlaveRecovery are flaky
> 
>
> Key: MESOS-3586
> URL: https://issues.apache.org/jira/browse/MESOS-3586
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.24.0, 0.26.0
> Environment: Ubuntu 14.04, 3.13.0-32 generic
> Debian 8, gcc 4.9.2
>Reporter: Miguel Bernadin
>Assignee: Joseph Wu
>  Labels: flaky, flaky-test
>
> I am install Mesos 0.24.0 on 4 servers which have very similar hardware and 
> software configurations. 
> After performing ../configure, make, and make check some servers have 
> completed successfully and other failed on test [ RUN  ] 
> MemoryPressureMesosTest.CGROUPS_ROOT_Statistics.
> Is there something I should check in this test? 
> PERFORMED MAKE CHECK NODE-001
> [ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics
> I1005 14:37:35.585067 38479 exec.cpp:133] Version: 0.24.0
> I1005 14:37:35.593789 38497 exec.cpp:207] Executor registered on slave 
> 20151005-143735-2393768202-35106-27900-S0
> Registered executor on svdidac038.techlabs.accenture.com
> Starting task 010b2fe9-4eac-4136-8a8a-6ce7665488b0
> Forked command at 38510
> sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done'
> PERFORMED MAKE CHECK NODE-002
> [ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics
> I1005 14:38:58.794112 36997 exec.cpp:133] Version: 0.24.0
> I1005 14:38:58.802851 37022 exec.cpp:207] Executor registered on slave 
> 20151005-143857-2360213770-50427-26325-S0
> Registered executor on svdidac039.techlabs.accenture.com
> Starting task 9bb317ba-41cb-44a4-b507-d1c85ceabc28
> sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done'
> Forked command at 37028
> ../../src/tests/containerizer/memory_pressure_tests.cpp:145: Failure
> Expected: (usage.get().mem_medium_pressure_counter()) >= 
> (usage.get().mem_critical_pressure_counter()), actual: 5 vs 6
> 2015-10-05 
> 14:39:00,130:26325(0x2af08cc78700):ZOO_ERROR@handle_socket_error_msg@1697: 
> Socket [127.0.0.1:37198] zk retcode=-4, errno=111(Connection refused): server 
> refused to accept the client
> [  FAILED  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics (4303 ms)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3975) SSL build of mesos causes flaky testsuite.

2015-11-30 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15032088#comment-15032088
 ] 

Joseph Wu commented on MESOS-3975:
--

In that case, I posted a fix a week ago:
https://reviews.apache.org/r/40453/
https://reviews.apache.org/r/40454/

> SSL build of mesos causes flaky testsuite.
> --
>
> Key: MESOS-3975
> URL: https://issues.apache.org/jira/browse/MESOS-3975
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.26.0
> Environment: CentOS 7.1, Kernel 3.10.0-229.20.1.el7.x86_64, gcc 
> 4.8.3, Docker 1.9
>Reporter: Till Toenshoff
>Assignee: Joseph Wu
>  Labels: mesosphere
>
> When running the tests of an SSL build of Mesos on CentOS 7.1, I see spurious 
> test failures that are, so far, not reproducible.
> The following tests did fail for me in complete runs but did seem fine when 
> running them individually, in repetition.  
> {noformat}
> DockerTest.ROOT_DOCKER_CheckPortResource
> {noformat}
> {noformat}
> ContainerizerTest.ROOT_CGROUPS_BalloonFramework
> {noformat}
> {noformat}
> [ RUN  ] 
> LinuxFilesystemIsolatorTest.ROOT_ChangeRootFilesystemCommandExecutor
> 2015-11-20 
> 19:08:38,826:21380(0x7fa10d5f2700):ZOO_ERROR@handle_socket_error_msg@1697: 
> Socket [127.0.0.1:53444] zk retcode=-4, errno=111(Connection refused): server 
> refused to accept the client
> + /home/vagrant/mesos/build/src/mesos-containerizer mount --help=false 
> --operation=make-rslave --path=/
> + grep -E 
> /tmp/LinuxFilesystemIsolatorTest_ROOT_ChangeRootFilesystemCommandExecutor_Tz7P8c/.+
>  /proc/self/mountinfo
> + grep -v 2b98025c-74f1-41d2-b35a-ce2cdfae347e
> + cut '-d ' -f5
> + xargs --no-run-if-empty umount -l
> + mount -n --rbind 
> /tmp/LinuxFilesystemIsolatorTest_ROOT_ChangeRootFilesystemCommandExecutor_Tz7P8c/provisioner/containers/2b98025c-74f1-41d2-b35a-ce2cdfae347e/backends/copy/rootfses/bed11080-474b-4c69-8e7f-0ab85e895b0d
>  
> /tmp/LinuxFilesystemIsolatorTest_ROOT_ChangeRootFilesystemCommandExecutor_Tz7P8c/slaves/830e842e-c36a-4e4c-bff4-5b9568d7df12-S0/frameworks/830e842e-c36a-4e4c-bff4-5b9568d7df12-/executors/c735be54-c47f-4645-bfc1-2f4647e2cddb/runs/2b98025c-74f1-41d2-b35a-ce2cdfae347e/.rootfs
> Could not load cert file
> ../../src/tests/containerizer/filesystem_isolator_tests.cpp:354: Failure
> Value of: statusRunning.get().state()
>   Actual: TASK_FAILED
> Expected: TASK_RUNNING
> 2015-11-20 
> 19:08:42,164:21380(0x7fa10d5f2700):ZOO_ERROR@handle_socket_error_msg@1697: 
> Socket [127.0.0.1:53444] zk retcode=-4, errno=111(Connection refused): server 
> refused to accept the client
> 2015-11-20 
> 19:08:45,501:21380(0x7fa10d5f2700):ZOO_ERROR@handle_socket_error_msg@1697: 
> Socket [127.0.0.1:53444] zk retcode=-4, errno=111(Connection refused): server 
> refused to accept the client
> 2015-11-20 
> 19:08:48,837:21380(0x7fa10d5f2700):ZOO_ERROR@handle_socket_error_msg@1697: 
> Socket [127.0.0.1:53444] zk retcode=-4, errno=111(Connection refused): server 
> refused to accept the client
> 2015-11-20 
> 19:08:52,174:21380(0x7fa10d5f2700):ZOO_ERROR@handle_socket_error_msg@1697: 
> Socket [127.0.0.1:53444] zk retcode=-4, errno=111(Connection refused): server 
> refused to accept the client
> ../../src/tests/containerizer/filesystem_isolator_tests.cpp:355: Failure
> Failed to wait 15secs for statusFinished
> ../../src/tests/containerizer/filesystem_isolator_tests.cpp:349: Failure
> Actual function call count doesn't match EXPECT_CALL(sched, 
> statusUpdate(, _))...
>  Expected: to be called twice
>Actual: called once - unsatisfied and active
> 2015-11-20 
> 19:08:55,511:21380(0x7fa10d5f2700):ZOO_ERROR@handle_socket_error_msg@1697: 
> Socket [127.0.0.1:53444] zk retcode=-4, errno=111(Connection refused): server 
> refused to accept the client
> *** Aborted at 1448046536 (unix time) try "date -d @1448046536" if you are 
> using GNU date ***
> PC: @0x0 (unknown)
> *** SIGSEGV (@0x0) received by PID 21380 (TID 0x7fa1549e68c0) from PID 0; 
> stack trace: ***
> @ 0x7fa141796fbb (unknown)
> @ 0x7fa14179b341 (unknown)
> @ 0x7fa14f096130 (unknown)
> {noformat}
> Vagrantfile generator:
> {noformat}
> cat << EOF > Vagrantfile
> # -*- mode: ruby -*-" >
> # vi: set ft=ruby :
> Vagrant.configure(2) do |config|
>   # Disable shared folder to prevent certain kernel module dependencies.
>   config.vm.synced_folder ".", "/vagrant", disabled: true
>   config.vm.hostname = "centos71"
>   config.vm.box = "bento/centos-7.1"
>   config.vm.provider "virtualbox" do |vb|
> vb.memory = 16384
> vb.cpus = 8
>   end
>   config.vm.provider "vmware_fusion" do |vb|
> vb.memory = 9216
> vb.cpus = 4
>   end
>   config.vm.provision "shell", inline: <<-SHELL
>  sudo yum -y update systemd
>  sudo yum install -y tar wget
>   

[jira] [Comment Edited] (MESOS-3753) Test the HTTP Scheduler library with SSL enabled

2015-11-30 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15012106#comment-15012106
 ] 

Joseph Wu edited comment on MESOS-3753 at 11/30/15 6:09 PM:


A little test cleanup:
https://reviews.apache.org/r/40453/
https://reviews.apache.org/r/40454/

Edit: These reviews will be tracked by [MESOS-3975].


was (Author: kaysoky):
A little test cleanup:
https://reviews.apache.org/r/40453/
https://reviews.apache.org/r/40454/

> Test the HTTP Scheduler library with SSL enabled
> 
>
> Key: MESOS-3753
> URL: https://issues.apache.org/jira/browse/MESOS-3753
> Project: Mesos
>  Issue Type: Story
>  Components: framework, HTTP API, test
>Reporter: Joseph Wu
>Assignee: Anand Mazumdar
>  Labels: mesosphere, security
>
> Currently, the HTTP Scheduler library does not support SSL-enabled Mesos.  
> (You can manually test this by spinning up an SSL-enabled master and attempt 
> to run the event-call framework example against it.)
> We need to add tests that check the HTTP Scheduler library against 
> SSL-enabled Mesos:
> * with downgrade support,
> * with required framework/client-side certifications,
> * with/without verification of certificates (master-side),
> * with/without verification of certificates (framework-side),
> * with a custom certificate authority (CA)
> These options should be controlled by the same environment variables found on 
> the [SSL user doc|http://mesos.apache.org/documentation/latest/ssl/].
> Note: This issue will be broken down into smaller sub-issues as bugs/problems 
> are discovered.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3916) MasterMaintenanceTest.InverseOffersFilters is flaky

2015-11-25 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15027478#comment-15027478
 ] 

Joseph Wu commented on MESOS-3916:
--

That's a very odd failure.

The first batch of inverse offers are both received by the master:
{code:title=inverseOffer2}
I1125 10:05:53.152995 29359 master.cpp:3316] Processing DECLINE call for 
offers: [ 932f7d7b-f2d4-42c7-9391-222c19b9d35b-O3 ] for framework 
932f7d7b-f2d4-42c7-9391-222c19b9d35b- (default)
{code}

Note: This message shows up regardless, since {{Master::GetOffer}} does not 
search for inverse offers.  We might want to silence this incorrect warning.
{code:title=inverseOffer1}
W1125 10:05:53.155109 29362 master.cpp:2897] ACCEPT call used invalid offers '[ 
932f7d7b-f2d4-42c7-9391-222c19b9d35b-O2 ]': Offer 
932f7d7b-f2d4-42c7-9391-222c19b9d35b-O2 is no longer valid
{code}

Somehow, the allocation was not triggered by the subsequent clock advancement 
in the test.  I'm guessing:
# The clock was settled while the ACCEPT call was still in flight.
# The clock was then advanced before the ACCEPT call reached the master.  [This 
comment seems 
relevant](https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/process.cpp#L2845-L2856)
# The allocation went ahead, meaning the inverse offer had not been filtered to 
0 seconds yet.
# Clock is paused, so we don't allocate -> test times out.

> MasterMaintenanceTest.InverseOffersFilters is flaky
> ---
>
> Key: MESOS-3916
> URL: https://issues.apache.org/jira/browse/MESOS-3916
> Project: Mesos
>  Issue Type: Bug
> Environment: Ubuntu Wily 64 bit
>Reporter: Neil Conway
>Assignee: Neil Conway
>  Labels: flaky-test, maintenance, mesosphere
> Attachments: wily_maintenance_test_verbose.txt
>
>
> Verbose Logs:
> {code}
> [ RUN  ] MasterMaintenanceTest.InverseOffersFilters
> I1113 16:43:58.486469  8728 leveldb.cpp:176] Opened db in 2.360405ms
> I1113 16:43:58.486935  8728 leveldb.cpp:183] Compacted db in 407105ns
> I1113 16:43:58.486995  8728 leveldb.cpp:198] Created db iterator in 16221ns
> I1113 16:43:58.487030  8728 leveldb.cpp:204] Seeked to beginning of db in 
> 10935ns
> I1113 16:43:58.487046  8728 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 999ns
> I1113 16:43:58.487090  8728 replica.cpp:780] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I1113 16:43:58.487735  8747 recover.cpp:449] Starting replica recovery
> I1113 16:43:58.488047  8747 recover.cpp:475] Replica is in EMPTY status
> I1113 16:43:58.488977  8745 replica.cpp:676] Replica in EMPTY status received 
> a broadcasted recover request from (58)@10.0.2.15:45384
> I1113 16:43:58.489452  8746 recover.cpp:195] Received a recover response from 
> a replica in EMPTY status
> I1113 16:43:58.489712  8747 recover.cpp:566] Updating replica status to 
> STARTING
> I1113 16:43:58.490706  8742 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 745443ns
> I1113 16:43:58.490739  8742 replica.cpp:323] Persisted replica status to 
> STARTING
> I1113 16:43:58.490859  8742 recover.cpp:475] Replica is in STARTING status
> I1113 16:43:58.491786  8747 replica.cpp:676] Replica in STARTING status 
> received a broadcasted recover request from (59)@10.0.2.15:45384
> I1113 16:43:58.492542  8749 recover.cpp:195] Received a recover response from 
> a replica in STARTING status
> I1113 16:43:58.493221  8743 recover.cpp:566] Updating replica status to VOTING
> I1113 16:43:58.493710  8743 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 331874ns
> I1113 16:43:58.493767  8743 replica.cpp:323] Persisted replica status to 
> VOTING
> I1113 16:43:58.493868  8743 recover.cpp:580] Successfully joined the Paxos 
> group
> I1113 16:43:58.494119  8743 recover.cpp:464] Recover process terminated
> I1113 16:43:58.504369  8749 master.cpp:367] Master 
> d59449fc-5462-43c5-b935-e05563fdd4b6 (vagrant-ubuntu-wily-64) started on 
> 10.0.2.15:45384
> I1113 16:43:58.504438  8749 master.cpp:369] Flags at startup: --acls="" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate="false" --authenticate_slaves="true" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/ZB7csS/credentials" --framework_sorter="drf" 
> --help="false" --hostname_lookup="true" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_slave_ping_timeouts="5" --quiet="false" 
> --recovery_slave_removal_limit="100%" --registry="replicated_log" 
> --registry_fetch_timeout="1mins" --registry_store_timeout="25secs" 
> --registry_strict="true" --root_submissions="true" 
> --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" 
> --user_sorter="drf" --version="false" 
> 

[jira] [Commented] (MESOS-3969) Failing 'make distcheck' on Debian 8, somehow SSL-related.

2015-11-24 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025147#comment-15025147
 ] 

Joseph Wu commented on MESOS-3969:
--

Thanks!

I'll test this on a few systems to double-check we don't have any odd 
{{distcheck}} dependency on this particular {{pip}} version.

> Failing 'make distcheck' on Debian 8, somehow SSL-related.
> --
>
> Key: MESOS-3969
> URL: https://issues.apache.org/jira/browse/MESOS-3969
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.26.0
> Environment: Debian 8, gcc 4.9.2, Docker 1.9.0, vagrant, libvirt
> Vagrantfile see MESOS-3957
>Reporter: Bernd Mathiske
>Assignee: Joseph Wu
>  Labels: build, build-failure, mesosphere
>
> As non-root: make distcheck.
> {noformat}
> /bin/mkdir -p '/home/vagrant/mesos/build/mesos-0.26.0/_inst/bin'
> /bin/bash ../libtool --mode=install /usr/bin/install -c mesos-local mesos-log 
> mesos mesos-execute mesos-resolve 
> '/home/vagrant/mesos/build/mesos-0.26.0/_inst/bin'
> libtool: install: /usr/bin/install -c .libs/mesos-local 
> /home/vagrant/mesos/build/mesos-0.26.0/_inst/bin/mesos-local
> libtool: install: /usr/bin/install -c .libs/mesos-log 
> /home/vagrant/mesos/build/mesos-0.26.0/_inst/bin/mesos-log
> libtool: install: /usr/bin/install -c .libs/mesos 
> /home/vagrant/mesos/build/mesos-0.26.0/_inst/bin/mesos
> libtool: install: /usr/bin/install -c .libs/mesos-execute 
> /home/vagrant/mesos/build/mesos-0.26.0/_inst/bin/mesos-execute
> libtool: install: /usr/bin/install -c .libs/mesos-resolve 
> /home/vagrant/mesos/build/mesos-0.26.0/_inst/bin/mesos-resolve
> Traceback (most recent call last):
> File "", line 1, in 
> File 
> "/home/vagrant/mesos/build/mesos-0.26.0/build/3rdparty/pip-1.5.6/pip/__init_.py",
>  line 11, in 
> from pip.vcs import git, mercurial, subversion, bazaar # noqa
> File 
> "/home/vagrant/mesos/build/mesos-0.26.0/_build/3rdparty/pip-1.5.6/pip/vcs/mercurial.py",
>  line 9, in 
> from pip.download import path_to_url
> File 
> "/home/vagrant/mesos/build/mesos-0.26.0/_build/3rdparty/pip-1.5.6/pip/download.py",
>  line 22, in 
> from pip._vendor import requests, six
> File 
> "/home/vagrant/mesos/build/mesos-0.26.0/build/3rdparty/pip-1.5.6/pip/_vendor/requests/__init_.py",
>  line 53, in 
> from .packages.urllib3.contrib import pyopenssl
> File 
> "/home/vagrant/mesos/build/mesos-0.26.0/_build/3rdparty/pip-1.5.6/pip/_vendor/requests/packages/urllib3/contrib/pyopenssl.py",
>  line 70, in 
> ssl.PROTOCOL_SSLv3: OpenSSL.SSL.SSLv3_METHOD,
> AttributeError: 'module' object has no attribute 'PROTOCOL_SSLv3'
> Traceback (most recent call last):
> File "", line 1, in 
> File "/home/vagrant/mesos/build/mesos-0.26.0/_build/3rd
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3975) SSL build of mesos causes flaky testsuite.

2015-11-23 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15023253#comment-15023253
 ] 

Joseph Wu commented on MESOS-3975:
--

It might also be worthwhile to check if the tests fail without {{--enable-ssl}}.

> SSL build of mesos causes flaky testsuite.
> --
>
> Key: MESOS-3975
> URL: https://issues.apache.org/jira/browse/MESOS-3975
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.26.0
> Environment: CentOS 7.1, Kernel 3.10.0-229.20.1.el7.x86_64, gcc 
> 4.8.3, Docker 1.9
>Reporter: Till Toenshoff
>Assignee: Joris Van Remoortere
>  Labels: mesosphere
>
> When running the tests of an SSL build of Mesos on CentOS 7.1, I see spurious 
> test failures that are, so far, not reproducible.
> The following tests did fail for me in complete runs but did seem fine when 
> running them individually, in repetition.  
> {noformat}
> DockerTest.ROOT_DOCKER_CheckPortResource
> {noformat}
> {noformat}
> ContainerizerTest.ROOT_CGROUPS_BalloonFramework
> {noformat}
> {noformat}
> [ RUN  ] 
> LinuxFilesystemIsolatorTest.ROOT_ChangeRootFilesystemCommandExecutor
> 2015-11-20 
> 19:08:38,826:21380(0x7fa10d5f2700):ZOO_ERROR@handle_socket_error_msg@1697: 
> Socket [127.0.0.1:53444] zk retcode=-4, errno=111(Connection refused): server 
> refused to accept the client
> + /home/vagrant/mesos/build/src/mesos-containerizer mount --help=false 
> --operation=make-rslave --path=/
> + grep -E 
> /tmp/LinuxFilesystemIsolatorTest_ROOT_ChangeRootFilesystemCommandExecutor_Tz7P8c/.+
>  /proc/self/mountinfo
> + grep -v 2b98025c-74f1-41d2-b35a-ce2cdfae347e
> + cut '-d ' -f5
> + xargs --no-run-if-empty umount -l
> + mount -n --rbind 
> /tmp/LinuxFilesystemIsolatorTest_ROOT_ChangeRootFilesystemCommandExecutor_Tz7P8c/provisioner/containers/2b98025c-74f1-41d2-b35a-ce2cdfae347e/backends/copy/rootfses/bed11080-474b-4c69-8e7f-0ab85e895b0d
>  
> /tmp/LinuxFilesystemIsolatorTest_ROOT_ChangeRootFilesystemCommandExecutor_Tz7P8c/slaves/830e842e-c36a-4e4c-bff4-5b9568d7df12-S0/frameworks/830e842e-c36a-4e4c-bff4-5b9568d7df12-/executors/c735be54-c47f-4645-bfc1-2f4647e2cddb/runs/2b98025c-74f1-41d2-b35a-ce2cdfae347e/.rootfs
> Could not load cert file
> ../../src/tests/containerizer/filesystem_isolator_tests.cpp:354: Failure
> Value of: statusRunning.get().state()
>   Actual: TASK_FAILED
> Expected: TASK_RUNNING
> 2015-11-20 
> 19:08:42,164:21380(0x7fa10d5f2700):ZOO_ERROR@handle_socket_error_msg@1697: 
> Socket [127.0.0.1:53444] zk retcode=-4, errno=111(Connection refused): server 
> refused to accept the client
> 2015-11-20 
> 19:08:45,501:21380(0x7fa10d5f2700):ZOO_ERROR@handle_socket_error_msg@1697: 
> Socket [127.0.0.1:53444] zk retcode=-4, errno=111(Connection refused): server 
> refused to accept the client
> 2015-11-20 
> 19:08:48,837:21380(0x7fa10d5f2700):ZOO_ERROR@handle_socket_error_msg@1697: 
> Socket [127.0.0.1:53444] zk retcode=-4, errno=111(Connection refused): server 
> refused to accept the client
> 2015-11-20 
> 19:08:52,174:21380(0x7fa10d5f2700):ZOO_ERROR@handle_socket_error_msg@1697: 
> Socket [127.0.0.1:53444] zk retcode=-4, errno=111(Connection refused): server 
> refused to accept the client
> ../../src/tests/containerizer/filesystem_isolator_tests.cpp:355: Failure
> Failed to wait 15secs for statusFinished
> ../../src/tests/containerizer/filesystem_isolator_tests.cpp:349: Failure
> Actual function call count doesn't match EXPECT_CALL(sched, 
> statusUpdate(, _))...
>  Expected: to be called twice
>Actual: called once - unsatisfied and active
> 2015-11-20 
> 19:08:55,511:21380(0x7fa10d5f2700):ZOO_ERROR@handle_socket_error_msg@1697: 
> Socket [127.0.0.1:53444] zk retcode=-4, errno=111(Connection refused): server 
> refused to accept the client
> *** Aborted at 1448046536 (unix time) try "date -d @1448046536" if you are 
> using GNU date ***
> PC: @0x0 (unknown)
> *** SIGSEGV (@0x0) received by PID 21380 (TID 0x7fa1549e68c0) from PID 0; 
> stack trace: ***
> @ 0x7fa141796fbb (unknown)
> @ 0x7fa14179b341 (unknown)
> @ 0x7fa14f096130 (unknown)
> {noformat}
> Vagrantfile generator:
> {noformat}
> cat << EOF > Vagrantfile
> # -*- mode: ruby -*-" >
> # vi: set ft=ruby :
> Vagrant.configure(2) do |config|
>   # Disable shared folder to prevent certain kernel module dependencies.
>   config.vm.synced_folder ".", "/vagrant", disabled: true
>   config.vm.hostname = "centos71"
>   config.vm.box = "bento/centos-7.1"
>   config.vm.provider "virtualbox" do |vb|
> vb.memory = 16384
> vb.cpus = 8
>   end
>   config.vm.provider "vmware_fusion" do |vb|
> vb.memory = 9216
> vb.cpus = 4
>   end
>   config.vm.provision "shell", inline: <<-SHELL
>  sudo yum -y update systemd
>  sudo yum install -y tar wget
>  sudo wget 
> 

[jira] [Issue Comment Deleted] (MESOS-3975) SSL build of mesos causes flaky testsuite.

2015-11-23 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-3975:
-
Comment: was deleted

(was: It might also be worthwhile to check if the tests fail without 
{{--enable-ssl}}.)

> SSL build of mesos causes flaky testsuite.
> --
>
> Key: MESOS-3975
> URL: https://issues.apache.org/jira/browse/MESOS-3975
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.26.0
> Environment: CentOS 7.1, Kernel 3.10.0-229.20.1.el7.x86_64, gcc 
> 4.8.3, Docker 1.9
>Reporter: Till Toenshoff
>Assignee: Joris Van Remoortere
>  Labels: mesosphere
>
> When running the tests of an SSL build of Mesos on CentOS 7.1, I see spurious 
> test failures that are, so far, not reproducible.
> The following tests did fail for me in complete runs but did seem fine when 
> running them individually, in repetition.  
> {noformat}
> DockerTest.ROOT_DOCKER_CheckPortResource
> {noformat}
> {noformat}
> ContainerizerTest.ROOT_CGROUPS_BalloonFramework
> {noformat}
> {noformat}
> [ RUN  ] 
> LinuxFilesystemIsolatorTest.ROOT_ChangeRootFilesystemCommandExecutor
> 2015-11-20 
> 19:08:38,826:21380(0x7fa10d5f2700):ZOO_ERROR@handle_socket_error_msg@1697: 
> Socket [127.0.0.1:53444] zk retcode=-4, errno=111(Connection refused): server 
> refused to accept the client
> + /home/vagrant/mesos/build/src/mesos-containerizer mount --help=false 
> --operation=make-rslave --path=/
> + grep -E 
> /tmp/LinuxFilesystemIsolatorTest_ROOT_ChangeRootFilesystemCommandExecutor_Tz7P8c/.+
>  /proc/self/mountinfo
> + grep -v 2b98025c-74f1-41d2-b35a-ce2cdfae347e
> + cut '-d ' -f5
> + xargs --no-run-if-empty umount -l
> + mount -n --rbind 
> /tmp/LinuxFilesystemIsolatorTest_ROOT_ChangeRootFilesystemCommandExecutor_Tz7P8c/provisioner/containers/2b98025c-74f1-41d2-b35a-ce2cdfae347e/backends/copy/rootfses/bed11080-474b-4c69-8e7f-0ab85e895b0d
>  
> /tmp/LinuxFilesystemIsolatorTest_ROOT_ChangeRootFilesystemCommandExecutor_Tz7P8c/slaves/830e842e-c36a-4e4c-bff4-5b9568d7df12-S0/frameworks/830e842e-c36a-4e4c-bff4-5b9568d7df12-/executors/c735be54-c47f-4645-bfc1-2f4647e2cddb/runs/2b98025c-74f1-41d2-b35a-ce2cdfae347e/.rootfs
> Could not load cert file
> ../../src/tests/containerizer/filesystem_isolator_tests.cpp:354: Failure
> Value of: statusRunning.get().state()
>   Actual: TASK_FAILED
> Expected: TASK_RUNNING
> 2015-11-20 
> 19:08:42,164:21380(0x7fa10d5f2700):ZOO_ERROR@handle_socket_error_msg@1697: 
> Socket [127.0.0.1:53444] zk retcode=-4, errno=111(Connection refused): server 
> refused to accept the client
> 2015-11-20 
> 19:08:45,501:21380(0x7fa10d5f2700):ZOO_ERROR@handle_socket_error_msg@1697: 
> Socket [127.0.0.1:53444] zk retcode=-4, errno=111(Connection refused): server 
> refused to accept the client
> 2015-11-20 
> 19:08:48,837:21380(0x7fa10d5f2700):ZOO_ERROR@handle_socket_error_msg@1697: 
> Socket [127.0.0.1:53444] zk retcode=-4, errno=111(Connection refused): server 
> refused to accept the client
> 2015-11-20 
> 19:08:52,174:21380(0x7fa10d5f2700):ZOO_ERROR@handle_socket_error_msg@1697: 
> Socket [127.0.0.1:53444] zk retcode=-4, errno=111(Connection refused): server 
> refused to accept the client
> ../../src/tests/containerizer/filesystem_isolator_tests.cpp:355: Failure
> Failed to wait 15secs for statusFinished
> ../../src/tests/containerizer/filesystem_isolator_tests.cpp:349: Failure
> Actual function call count doesn't match EXPECT_CALL(sched, 
> statusUpdate(, _))...
>  Expected: to be called twice
>Actual: called once - unsatisfied and active
> 2015-11-20 
> 19:08:55,511:21380(0x7fa10d5f2700):ZOO_ERROR@handle_socket_error_msg@1697: 
> Socket [127.0.0.1:53444] zk retcode=-4, errno=111(Connection refused): server 
> refused to accept the client
> *** Aborted at 1448046536 (unix time) try "date -d @1448046536" if you are 
> using GNU date ***
> PC: @0x0 (unknown)
> *** SIGSEGV (@0x0) received by PID 21380 (TID 0x7fa1549e68c0) from PID 0; 
> stack trace: ***
> @ 0x7fa141796fbb (unknown)
> @ 0x7fa14179b341 (unknown)
> @ 0x7fa14f096130 (unknown)
> {noformat}
> Vagrantfile generator:
> {noformat}
> cat << EOF > Vagrantfile
> # -*- mode: ruby -*-" >
> # vi: set ft=ruby :
> Vagrant.configure(2) do |config|
>   # Disable shared folder to prevent certain kernel module dependencies.
>   config.vm.synced_folder ".", "/vagrant", disabled: true
>   config.vm.hostname = "centos71"
>   config.vm.box = "bento/centos-7.1"
>   config.vm.provider "virtualbox" do |vb|
> vb.memory = 16384
> vb.cpus = 8
>   end
>   config.vm.provider "vmware_fusion" do |vb|
> vb.memory = 9216
> vb.cpus = 4
>   end
>   config.vm.provision "shell", inline: <<-SHELL
>  sudo yum -y update systemd
>  sudo yum install -y tar wget
>  sudo wget 
> 

[jira] [Commented] (MESOS-3975) SSL build of mesos causes flaky testsuite.

2015-11-23 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15023254#comment-15023254
 ] 

Joseph Wu commented on MESOS-3975:
--

It might also be worthwhile to check if the tests fail without {{--enable-ssl}}.

> SSL build of mesos causes flaky testsuite.
> --
>
> Key: MESOS-3975
> URL: https://issues.apache.org/jira/browse/MESOS-3975
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.26.0
> Environment: CentOS 7.1, Kernel 3.10.0-229.20.1.el7.x86_64, gcc 
> 4.8.3, Docker 1.9
>Reporter: Till Toenshoff
>Assignee: Joris Van Remoortere
>  Labels: mesosphere
>
> When running the tests of an SSL build of Mesos on CentOS 7.1, I see spurious 
> test failures that are, so far, not reproducible.
> The following tests did fail for me in complete runs but did seem fine when 
> running them individually, in repetition.  
> {noformat}
> DockerTest.ROOT_DOCKER_CheckPortResource
> {noformat}
> {noformat}
> ContainerizerTest.ROOT_CGROUPS_BalloonFramework
> {noformat}
> {noformat}
> [ RUN  ] 
> LinuxFilesystemIsolatorTest.ROOT_ChangeRootFilesystemCommandExecutor
> 2015-11-20 
> 19:08:38,826:21380(0x7fa10d5f2700):ZOO_ERROR@handle_socket_error_msg@1697: 
> Socket [127.0.0.1:53444] zk retcode=-4, errno=111(Connection refused): server 
> refused to accept the client
> + /home/vagrant/mesos/build/src/mesos-containerizer mount --help=false 
> --operation=make-rslave --path=/
> + grep -E 
> /tmp/LinuxFilesystemIsolatorTest_ROOT_ChangeRootFilesystemCommandExecutor_Tz7P8c/.+
>  /proc/self/mountinfo
> + grep -v 2b98025c-74f1-41d2-b35a-ce2cdfae347e
> + cut '-d ' -f5
> + xargs --no-run-if-empty umount -l
> + mount -n --rbind 
> /tmp/LinuxFilesystemIsolatorTest_ROOT_ChangeRootFilesystemCommandExecutor_Tz7P8c/provisioner/containers/2b98025c-74f1-41d2-b35a-ce2cdfae347e/backends/copy/rootfses/bed11080-474b-4c69-8e7f-0ab85e895b0d
>  
> /tmp/LinuxFilesystemIsolatorTest_ROOT_ChangeRootFilesystemCommandExecutor_Tz7P8c/slaves/830e842e-c36a-4e4c-bff4-5b9568d7df12-S0/frameworks/830e842e-c36a-4e4c-bff4-5b9568d7df12-/executors/c735be54-c47f-4645-bfc1-2f4647e2cddb/runs/2b98025c-74f1-41d2-b35a-ce2cdfae347e/.rootfs
> Could not load cert file
> ../../src/tests/containerizer/filesystem_isolator_tests.cpp:354: Failure
> Value of: statusRunning.get().state()
>   Actual: TASK_FAILED
> Expected: TASK_RUNNING
> 2015-11-20 
> 19:08:42,164:21380(0x7fa10d5f2700):ZOO_ERROR@handle_socket_error_msg@1697: 
> Socket [127.0.0.1:53444] zk retcode=-4, errno=111(Connection refused): server 
> refused to accept the client
> 2015-11-20 
> 19:08:45,501:21380(0x7fa10d5f2700):ZOO_ERROR@handle_socket_error_msg@1697: 
> Socket [127.0.0.1:53444] zk retcode=-4, errno=111(Connection refused): server 
> refused to accept the client
> 2015-11-20 
> 19:08:48,837:21380(0x7fa10d5f2700):ZOO_ERROR@handle_socket_error_msg@1697: 
> Socket [127.0.0.1:53444] zk retcode=-4, errno=111(Connection refused): server 
> refused to accept the client
> 2015-11-20 
> 19:08:52,174:21380(0x7fa10d5f2700):ZOO_ERROR@handle_socket_error_msg@1697: 
> Socket [127.0.0.1:53444] zk retcode=-4, errno=111(Connection refused): server 
> refused to accept the client
> ../../src/tests/containerizer/filesystem_isolator_tests.cpp:355: Failure
> Failed to wait 15secs for statusFinished
> ../../src/tests/containerizer/filesystem_isolator_tests.cpp:349: Failure
> Actual function call count doesn't match EXPECT_CALL(sched, 
> statusUpdate(, _))...
>  Expected: to be called twice
>Actual: called once - unsatisfied and active
> 2015-11-20 
> 19:08:55,511:21380(0x7fa10d5f2700):ZOO_ERROR@handle_socket_error_msg@1697: 
> Socket [127.0.0.1:53444] zk retcode=-4, errno=111(Connection refused): server 
> refused to accept the client
> *** Aborted at 1448046536 (unix time) try "date -d @1448046536" if you are 
> using GNU date ***
> PC: @0x0 (unknown)
> *** SIGSEGV (@0x0) received by PID 21380 (TID 0x7fa1549e68c0) from PID 0; 
> stack trace: ***
> @ 0x7fa141796fbb (unknown)
> @ 0x7fa14179b341 (unknown)
> @ 0x7fa14f096130 (unknown)
> {noformat}
> Vagrantfile generator:
> {noformat}
> cat << EOF > Vagrantfile
> # -*- mode: ruby -*-" >
> # vi: set ft=ruby :
> Vagrant.configure(2) do |config|
>   # Disable shared folder to prevent certain kernel module dependencies.
>   config.vm.synced_folder ".", "/vagrant", disabled: true
>   config.vm.hostname = "centos71"
>   config.vm.box = "bento/centos-7.1"
>   config.vm.provider "virtualbox" do |vb|
> vb.memory = 16384
> vb.cpus = 8
>   end
>   config.vm.provider "vmware_fusion" do |vb|
> vb.memory = 9216
> vb.cpus = 4
>   end
>   config.vm.provision "shell", inline: <<-SHELL
>  sudo yum -y update systemd
>  sudo yum install -y tar wget
>  sudo wget 
> 

[jira] [Updated] (MESOS-3753) Test the HTTP Scheduler library with SSL enabled

2015-11-20 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-3753:
-
Assignee: Anand Mazumdar  (was: Joseph Wu)

Assigning to [~anandmazumdar].  

Here is the basic pattern for testing with different SSL configurations:
https://reviews.apache.org/r/40513/

You should add more tests (and possibly fix the underlying library) as 
necessary.

> Test the HTTP Scheduler library with SSL enabled
> 
>
> Key: MESOS-3753
> URL: https://issues.apache.org/jira/browse/MESOS-3753
> Project: Mesos
>  Issue Type: Story
>  Components: framework, HTTP API, test
>Reporter: Joseph Wu
>Assignee: Anand Mazumdar
>  Labels: mesosphere, security
>
> Currently, the HTTP Scheduler library does not support SSL-enabled Mesos.  
> (You can manually test this by spinning up an SSL-enabled master and attempt 
> to run the event-call framework example against it.)
> We need to add tests that check the HTTP Scheduler library against 
> SSL-enabled Mesos:
> * with downgrade support,
> * with required framework/client-side certifications,
> * with/without verification of certificates (master-side),
> * with/without verification of certificates (framework-side),
> * with a custom certificate authority (CA)
> These options should be controlled by the same environment variables found on 
> the [SSL user doc|http://mesos.apache.org/documentation/latest/ssl/].
> Note: This issue will be broken down into smaller sub-issues as bugs/problems 
> are discovered.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-3976) HTTP Scheduler Library does not work with SSL enabled

2015-11-20 Thread Joseph Wu (JIRA)
Joseph Wu created MESOS-3976:


 Summary: HTTP Scheduler Library does not work with SSL enabled
 Key: MESOS-3976
 URL: https://issues.apache.org/jira/browse/MESOS-3976
 Project: Mesos
  Issue Type: Bug
  Components: framework, HTTP API
Reporter: Joseph Wu
Assignee: Anand Mazumdar


The HTTP scheduler library does not work against Mesos when SSL is enabled 
(without downgrade).

The fix should be simple:
* The library should detect if SSL is enabled.
* If SSL is enabled, connections should be made with HTTPS instead of HTTP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3794) Master should not store arbitrarily sized data in ExecutorInfo

2015-11-20 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-3794:
-
Assignee: (was: Joseph Wu)

> Master should not store arbitrarily sized data in ExecutorInfo
> --
>
> Key: MESOS-3794
> URL: https://issues.apache.org/jira/browse/MESOS-3794
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Reporter: Joseph Wu
>Priority: Critical
>  Labels: mesosphere
>
> From a comment in [MESOS-3771]:
> Master should not be storing the {{data}} fields from {{ExecutorInfo}}.  We 
> currently [store the entire 
> object|https://github.com/apache/mesos/blob/master/src/master/master.hpp#L262-L271],
>  which means master would be at high risk of OOM-ing if a bunch of executors 
> were started with big {{data}} blobs.
> * Master should scrub out unneeded bloat from {{ExecutorInfo}} before storing 
> it.
> * We can use an alternate internal object, like we do for {{TaskInfo}} vs 
> {{Task}}; see 
> [this|https://github.com/apache/mesos/blob/master/src/messages/messages.proto#L39-L41].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


<    3   4   5   6   7   8   9   10   11   >