[jira] [Updated] (MESOS-5425) Consider using IntervalSet for Port range resource math

2016-07-12 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-5425:
-
Sprint: Mesosphere Sprint 39

> Consider using IntervalSet for Port range resource math
> ---
>
> Key: MESOS-5425
> URL: https://issues.apache.org/jira/browse/MESOS-5425
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Joseph Wu
>Assignee: Yanyan Hu
>  Labels: allocator, mesosphere
> Fix For: 1.1.0
>
> Attachments: graycol.gif
>
>
> Follow-up JIRA for comments raised in MESOS-3051 (see comments there).
> We should consider utilizing 
> [{{IntervalSet}}|https://github.com/apache/mesos/blob/a0b798d2fac39445ce0545cfaf05a682cd393abe/3rdparty/stout/include/stout/interval.hpp]
>  in [Port range resource 
> math|https://github.com/apache/mesos/blob/a0b798d2fac39445ce0545cfaf05a682cd393abe/src/common/values.cpp#L143].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5756) Cmake build system needs to regenerate protobufs when they are updated.

2016-07-11 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-5756:
-
Fix Version/s: 1.1.0

> Cmake build system needs to regenerate protobufs when they are updated.
> ---
>
> Key: MESOS-5756
> URL: https://issues.apache.org/jira/browse/MESOS-5756
> Project: Mesos
>  Issue Type: Improvement
>  Components: build, cmake
>Reporter: Joseph Wu
>Assignee: Srinivas
>Priority: Minor
>  Labels: cmake, mesosphere, newbie
> Fix For: 1.1.0
>
>
> Generated header files, such as protobufs are currently generated all at once 
> in the CMake build system:
> https://github.com/apache/mesos/blob/db8b0f16c1c8c6e683a4b788262f307a8bc218e0/cmake/MesosConfigure.cmake#L77-L80
> This means, if a protobuf is changed, the CMake build system will not 
> regenerate new protobufs unless you delete the generated {{/include}} 
> directory.
> 
> Should be a trivial fix, as the CMake protobuf functions merely need to 
> depend on the input file:
> * 
> https://github.com/apache/mesos/blob/db8b0f16c1c8c6e683a4b788262f307a8bc218e0/src/cmake/MesosProtobuf.cmake#L67
> * 
> https://github.com/apache/mesos/blob/db8b0f16c1c8c6e683a4b788262f307a8bc218e0/src/cmake/MesosProtobuf.cmake#L100



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2567) Python binding method 'declineOffer(offerid, filters=None)' raises exception when 'filters=None' is assigned explicitly

2016-07-11 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371893#comment-15371893
 ] 

Joseph Wu commented on MESOS-2567:
--

The review was discarded because it had been neglected (for over a year!).

This problem may still exist as the offending code is still present, albeit in 
a different location now:
https://github.com/apache/mesos/blob/2ed457a063d06da63750d464807799dcd7f72350/src/python/scheduler/src/mesos/scheduler/mesos_scheduler_driver_impl.cpp#L498-L503

> Python binding method 'declineOffer(offerid, filters=None)' raises exception 
> when 'filters=None' is assigned explicitly
> ---
>
> Key: MESOS-2567
> URL: https://issues.apache.org/jira/browse/MESOS-2567
> Project: Mesos
>  Issue Type: Bug
>  Components: python api
>Reporter: Yan Xu
>Assignee: haosdent
>
> {code}
> def launchTasks(self, offerIds, tasks, filters=None):  # The method's 
> signature.
>...
> declineOffer(offerId)  # OK to call it this way
> declineOffer(offerId, filters=None)  # Error when calling it this way
> {code}
> The error is printed from here
> https://github.com/apache/mesos/blob/04f8302c0cf81196e33ac538710dc5f48cd809d9/src/python/native/src/mesos/native/module.hpp#L66
> {code}
> if (obj == Py_None) {
> std::cerr << "None object given where protobuf expected" << std::endl;
> return false;
>  }
> {code}
> And I think it's because when parsing the arguments the missing optional 
> argument is interpreted as NULL and an explicit 'None' as Py_None and we 
> didn't check the result properly here: 
> https://github.com/apache/mesos/blob/04f8302c0cf81196e33ac538710dc5f48cd809d9/src/python/native/src/mesos/native/mesos_scheduler_driver_impl.cpp#L632
> {code}
>   if (filtersObj != NULL) {
> if (!readPythonProtobuf(filtersObj, &filters)) {
>   PyErr_Format(PyExc_Exception,
>"Could not deserialize Python Filters");
>   return NULL;
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-5740) Consider adding `relink` functionality to libprocess

2016-07-11 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15356302#comment-15356302
 ] 

Joseph Wu edited comment on MESOS-5740 at 7/11/16 9:52 PM:
---

{code}
commit 482257a9f46ecab4f888af46cd514d29d7cfd980
Author: Joseph Wu 
Date:   Wed Jun 29 18:20:41 2016 -0700

Added test-only function for retrieving link sockets from libprocess.

This can be used in a test to "break" a socket without libprocess's
explicit knowledge.  For example, we can disable transmission on a
persistent link.  The next message sent over that link will be dropped.

Review: https://reviews.apache.org/r/49174/
{code}
{code}
commit 3ae62e27ecb83da36849f6528b997d2087d330f2
Author: Joseph Wu 
Date:   Wed Jun 29 18:20:43 2016 -0700

Added tests for libprocess linking and unlinking behavior.

Adds tests which exercise "link" semantics against remote processes.
This includes detection of `ExitedEvents` when the process exits
as well as mixing "link" semantics.

Includes a test case that emulates the failure observed in MESOS-5576.

Review: https://reviews.apache.org/r/49175/
{code}
{code}
commit 414f937f270f1aadfe239d61a40ba88d5f2f0501
Date:   Wed Jun 29 18:20:44 2016 -0700

Added "relink" semantics to ProcessBase::link.

The `RemoteConnection:RECONNECT` option for `ProcessBase::link` will
force the `SocketManager` to create a new socket if a persistent link
already exists.

Review: https://reviews.apache.org/r/49177/
{code}


was (Author: kaysoky):
{code}
commit 482257a9f46ecab4f888af46cd514d29d7cfd980
Author: Joseph Wu 
Date:   Wed Jun 29 18:20:41 2016 -0700

Added test-only function for retrieving link sockets from libprocess.

This can be used in a test to "break" a socket without libprocess's
explicit knowledge.  For example, we can disable transmission on a
persistent link.  The next message sent over that link will be dropped.

Review: https://reviews.apache.org/r/49174/
{code}
{code}
commit 3ae62e27ecb83da36849f6528b997d2087d330f2
Author: Joseph Wu 
Date:   Wed Jun 29 18:20:43 2016 -0700

Added tests for libprocess linking and unlinking behavior.

Adds tests which exercise "link" semantics against remote processes.
This includes detection of `ExitedEvents` when the process exits
as well as mixing "link" semantics.

Includes a test case that emulates the failure observed in MESOS-5576.

Review: https://reviews.apache.org/r/49175/
{code}
{code}
commit 482257a9f46ecab4f888af46cd514d29d7cfd980
Date:   Wed Jun 29 18:20:44 2016 -0700

Added "relink" semantics to ProcessBase::link.

The `RemoteConnection:RECONNECT` option for `ProcessBase::link` will
force the `SocketManager` to create a new socket if a persistent link
already exists.

Review: https://reviews.apache.org/r/49177/
{code}

> Consider adding `relink` functionality to libprocess
> 
>
> Key: MESOS-5740
> URL: https://issues.apache.org/jira/browse/MESOS-5740
> Project: Mesos
>  Issue Type: Improvement
>  Components: libprocess
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: libprocess, mesosphere
> Fix For: 0.28.3, 1.0.0
>
>
> Currently we don't have the {{relink}} functionality in libprocess.  i.e. A 
> way to create a new persistent connection between actors, even if a 
> connection already exists. 
> This can benefit us in a couple of ways:
> - The application may have more information on the state of a connection than 
> libprocess does, as libprocess only checks if the connection is alive or not. 
>  For example, a linkee may accept a connection, then fork, pass the 
> connection to a child, and subsequently exit.  As the connection is still 
> active, libprocess may not detect the exit.
> - Sometimes, the {{ExitedEvent}} might be delayed or might be dropped due to 
> the remote instance being unavailable (e.g., partition, network 
> intermediaries not sending RST's etc). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5740) Consider adding `relink` functionality to libprocess

2016-07-11 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-5740:
-
Fix Version/s: 0.28.3

> Consider adding `relink` functionality to libprocess
> 
>
> Key: MESOS-5740
> URL: https://issues.apache.org/jira/browse/MESOS-5740
> Project: Mesos
>  Issue Type: Improvement
>  Components: libprocess
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: libprocess, mesosphere
> Fix For: 0.28.3, 1.0.0
>
>
> Currently we don't have the {{relink}} functionality in libprocess.  i.e. A 
> way to create a new persistent connection between actors, even if a 
> connection already exists. 
> This can benefit us in a couple of ways:
> - The application may have more information on the state of a connection than 
> libprocess does, as libprocess only checks if the connection is alive or not. 
>  For example, a linkee may accept a connection, then fork, pass the 
> connection to a child, and subsequently exit.  As the connection is still 
> active, libprocess may not detect the exit.
> - Sometimes, the {{ExitedEvent}} might be delayed or might be dropped due to 
> the remote instance being unavailable (e.g., partition, network 
> intermediaries not sending RST's etc). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5756) Cmake build system needs to regenerate protobufs when they are updated.

2016-07-11 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-5756:
-
Assignee: Srinivas

> Cmake build system needs to regenerate protobufs when they are updated.
> ---
>
> Key: MESOS-5756
> URL: https://issues.apache.org/jira/browse/MESOS-5756
> Project: Mesos
>  Issue Type: Improvement
>  Components: build, cmake
>Reporter: Joseph Wu
>Assignee: Srinivas
>Priority: Minor
>  Labels: cmake, mesosphere, newbie
>
> Generated header files, such as protobufs are currently generated all at once 
> in the CMake build system:
> https://github.com/apache/mesos/blob/db8b0f16c1c8c6e683a4b788262f307a8bc218e0/cmake/MesosConfigure.cmake#L77-L80
> This means, if a protobuf is changed, the CMake build system will not 
> regenerate new protobufs unless you delete the generated {{/include}} 
> directory.
> 
> Should be a trivial fix, as the CMake protobuf functions merely need to 
> depend on the input file:
> * 
> https://github.com/apache/mesos/blob/db8b0f16c1c8c6e683a4b788262f307a8bc218e0/src/cmake/MesosProtobuf.cmake#L67
> * 
> https://github.com/apache/mesos/blob/db8b0f16c1c8c6e683a4b788262f307a8bc218e0/src/cmake/MesosProtobuf.cmake#L100



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5828) Modularize Network in replicated_log

2016-07-11 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-5828:
-
Shepherd: Joseph Wu

> Modularize Network in replicated_log
> 
>
> Key: MESOS-5828
> URL: https://issues.apache.org/jira/browse/MESOS-5828
> Project: Mesos
>  Issue Type: Bug
>  Components: replicated log
>Reporter: Jay Guo
>Assignee: Jay Guo
>
> Currently replicated_log relies on Zookeeper for coordinator election. This 
> is done through network abstraction _ZookeeperNetwork_. We need to modularize 
> this part in order to enable replicated_log when using Master 
> contender/detector modules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5829) Mesos should be able to consume module for replicated_log

2016-07-11 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-5829:
-
Shepherd: Joseph Wu

> Mesos should be able to consume module for replicated_log
> -
>
> Key: MESOS-5829
> URL: https://issues.apache.org/jira/browse/MESOS-5829
> Project: Mesos
>  Issue Type: Bug
>  Components: modules, replicated log
>Reporter: Jay Guo
>Assignee: Jay Guo
>
> Currently {{--quorum}} is hardcoded to 1 if no *zk* provided, assuming 
> standalone mode, however this is not the true when using master contender and 
> detector modules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5832) Mesos replicated log corruption with disconnects from ZK

2016-07-11 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371515#comment-15371515
 ] 

Joseph Wu commented on MESOS-5832:
--

Are you using a non-default value for {{--registrar_fetch_timeout}}?

If you could set that flag to a smaller value, say {{5secs}}, and retry your 
test, the masters will probably recover after 3 restarts (<1 minute).  We may 
have already fixed this in [MESOS-5576]

> Mesos replicated log corruption with disconnects from ZK
> 
>
> Key: MESOS-5832
> URL: https://issues.apache.org/jira/browse/MESOS-5832
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.25.1, 0.27.1
>Reporter: Christopher M Luciano
>
> Setup:
> I setup 5 mesos and marathon masters ( which I'll refer to as m1,m2,m3,m4,m5 
> ) running the mesos version 0.27.2 (confirmed to affect 0.25.0 also)
> I setup 5 mesos agents ( which I'll refer to as a1,a2,a3,a4,a5 ) (installs 
> the same mesos version as the masters). 
> All of these were pointed at a single zookeeper ( NOT an ensemble ). 
> mesos-slave and mesos-master is run by upstart, and both are configured to be 
> restarted on halting/crashing.
> Procedure:
> 1) I confirm a mesos master has been elected and all agents have been 
> discovered
> 2) On the zookeeper machine, I add an IPTABLES rule which blocks all incoming 
> traffic from m1 and m2
> 3) the mesos-master process on m1 and m2 halt - upstart restarts them. They 
> are not able to communicate with zookeeper, and therefore are no longer 
> considered part of the cluster
> 4) A leader election happens ( m3 is elected leader )
> 4) I shutdown the mesos-slave process on a1 (note - I do a initctl stop 
> mesos-slave - just killing it will cause it to be restarted)
> 5) I wait to confirm the slave is reported as down by m3
> 6) I add IPTABLES rules on the zookeeper machine to block all incoming 
> traffic from m3,m4, and m5
> 7) I confirm that the mesos-master process on m3,m4,and m5 have all halted 
> and restarted
> 8) I confirm that all masters report themselves as not in the cluster
> 9) I remove the IPTABLES rule from the zookeeper machine that is blocking all 
> traffic from m1 and m2
> 10) m1 and m2 now report they are part of the cluster - there is a leader 
> election and either m1 or m2 is now elected leader. NOTE : because the 
> cluster does not have quorum, no agents are listed.
> 11) I shutdown the mesos-slave process on a2
> 12) In the logs of the current master, I can see this information being 
> processed by the master.
> 13) I add IPTABLES rules on the zookeeper machine to block all masters
> 14) I wait for all masters to report themselves as not being in the cluster
> 15) I remove all IPTABLES rules on the zookeeper machine
> 16) All masters join the cluster, and a leader election happens
> 17) After ten minutes, the leader's mesos-master process will halt, a leader 
> election will happen...and this repeats every 10 minutes
> Summary :
> Here is what I think is happening in the above test case : I think that at 
> the end of step 16, the masters all try to do replica log reconciliation, and 
> can't. I think the state of the agents isn't actually relevant - the replica 
> log reconciliation causes a hang or a silent failure. After 10 minutes, it 
> hits a timeout for communicating with the registry (i.e. zookeeper) - even 
> though it can communicate with zookeeper, it never does because of the 
> previous hanging/silent failure.
> Attached is a perl script I used on the zookeeper machine to automate the 
> steps above. If you want to use it, you'll need to change the IPs set in the 
> script, and make sure that one of the first 2 ips is the current mesos master.
> Setup:
> I setup 5 mesos and marathon masters ( which I'll refer to as m1,m2,m3,m4,m5 
> ) running the mesos version 0.27.2 (confirmed to affect 0.25.0 also)
> I setup 5 mesos agents ( which I'll refer to as a1,a2,a3,a4,a5 ) (installs 
> the same mesos version as the masters). 
> All of these were pointed at a single zookeeper ( NOT an ensemble ). 
> mesos-slave and mesos-master is run by upstart, and both are configured to be 
> restarted on halting/crashing.
> Procedure:
> 1) I confirm a mesos master has been elected and all agents have been 
> discovered
> 2) On the zookeeper machine, I add an IPTABLES rule which blocks all incoming 
> traffic from m1 and m2
> 3) the mesos-master process on m1 and m2 halt - upstart restarts them. They 
> are not able to communicate with zookeeper, and therefore are no longer 
> considered part of the cluster
> 4) A leader election happens ( m3 is elected leader )
> 4) I shutdown the mesos-slave process on a1 (note - I do a initctl stop 
> mesos-slave - just killing it will cause it to be restarted)
> 5) I wait to confirm the slave is reported as down by

[jira] [Commented] (MESOS-5831) Building mesos from make.bat fails

2016-07-11 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371369#comment-15371369
 ] 

Joseph Wu commented on MESOS-5831:
--

What are you passing into CMake?

i.e. {{cmake .. -G "Visual Studio 14 2015 Win64" -DENABLE_LIBEVENT=1}}

Also, can you grab some of the build logs from the associated project?  The 
error summary you've posted only lists which projects have failed, there should 
be more information earlier in the logs.

> Building mesos from make.bat fails
> --
>
> Key: MESOS-5831
> URL: https://issues.apache.org/jira/browse/MESOS-5831
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.28.2
> Environment: Windows 10 64bit Visual Studio 2015 CMake 3.5.2
>Reporter: Kaloyan Kotlarski
>  Labels: build, cmake, windows
>
> After generating the build files with CMake and starting make.bat it fails 
> with 404 to download the following components:
> Build FAILED.
> "E:\Work\C++\mesos\Build\Mesos.sln" (default target) (1) ->
> "E:\Work\C++\mesos\Build\3rdparty\elfio-3.1.vcxproj.metaproj" (default 
> target) (7) ->
> "E:\Work\C++\mesos\Build\3rdparty\elfio-3.1.vcxproj" (default target) (8) ->
> (CustomBuild target) ->
>   CUSTOMBUILD : error : downloading 
> [E:\Work\C++\mesos\Build\3rdparty\elfio-3.1.vcxproj]
>   CUSTOMBUILD : The requested URL returned error : 404 Not Found 
> [E:\Work\C++\mesos\Build\3rdparty\elfio-3.1.vcxproj
> ]
> "E:\Work\C++\mesos\Build\Mesos.sln" (default target) (1) ->
> "E:\Work\C++\mesos\Build\3rdparty\http_parser-2.6.2.vcxproj.metaproj" 
> (default target) (13) ->
> "E:\Work\C++\mesos\Build\3rdparty\http_parser-2.6.2.vcxproj" (default target) 
> (14) ->
>   CUSTOMBUILD : error : downloading 
> [E:\Work\C++\mesos\Build\3rdparty\http_parser-2.6.2.vcxproj]
>   CUSTOMBUILD : The requested URL returned error : 404 Not Found 
> [E:\Work\C++\mesos\Build\3rdparty\http_parser-2.6.2
> .vcxproj]
> "E:\Work\C++\mesos\Build\Mesos.sln" (default target) (1) ->
> "E:\Work\C++\mesos\Build\src\mesos-1.1.0.vcxproj.metaproj" (default target) 
> (23) ->
> "E:\Work\C++\mesos\Build\3rdparty\nvml-352.79.vcxproj.metaproj" (default 
> target) (24) ->
> "E:\Work\C++\mesos\Build\3rdparty\nvml-352.79.vcxproj" (default target) (25) 
> ->
>   CUSTOMBUILD : error : downloading 
> [E:\Work\C++\mesos\Build\3rdparty\nvml-352.79.vcxproj]
>   CUSTOMBUILD : The requested URL returned error : 404 Not Found 
> [E:\Work\C++\mesos\Build\3rdparty\nvml-352.79.vcxpr
> oj]
> 0 Warning(s)
> 6 Error(s)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-5827) Add example framework for using inverse offers

2016-07-11 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu reassigned MESOS-5827:


Assignee: Joseph Wu

I wrote this a while ago and should open a review to merge it in.
https://github.com/kaysoky/InverseOfferExampleFramework/blob/master/example.cpp

This framework actually has two components, the framework + a script that posts 
random maintenance schedules.

> Add example framework for using inverse offers
> --
>
> Key: MESOS-5827
> URL: https://issues.apache.org/jira/browse/MESOS-5827
> Project: Mesos
>  Issue Type: Task
>Reporter: Artem Harutyunyan
>Assignee: Joseph Wu
>Priority: Minor
>  Labels: beginner
>
> We should have an example framework (in src/examples) demonstrating how to 
> handle inverse offers. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5824) Include disk source information in stringification

2016-07-11 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371243#comment-15371243
 ] 

Joseph Wu commented on MESOS-5824:
--

Not sure why this would be backported?

If I recall correctly, the resources string was passed between versions of 
Mesos, but this should no longer be the case.  That's why I was wondering about 
backwards compatibility.

> Include disk source information in stringification
> --
>
> Key: MESOS-5824
> URL: https://issues.apache.org/jira/browse/MESOS-5824
> Project: Mesos
>  Issue Type: Improvement
>  Components: stout
>Affects Versions: 0.28.2
>Reporter: Tim Harper
>Priority: Minor
>  Labels: mesosphere
> Fix For: 1.1.0
>
> Attachments: 0001-Output-disk-resource-source-information.patch
>
>
> Some frameworks (like kafka_mesos) ignore the Source field when trying to 
> reserve an offered mount or path persistent volume; the resulting error 
> message is bewildering:
> {code:none}
> Task uses more resources
> cpus(*):4; mem(*):4096; ports(*):[31000-31000]; disk(kafka, 
> kafka)[kafka_0:data]:960679
> than available
> cpus(*):32; mem(*):256819;  ports(*):[31000-32000]; disk(kafka, 
> kafka)[kafka_0:data]:960679;   disk(*):240169;
> {code}
> The stringification of disk resources should include source information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5822) Add a build script for the Windows CI

2016-07-08 Thread Joseph Wu (JIRA)
Joseph Wu created MESOS-5822:


 Summary: Add a build script for the Windows CI
 Key: MESOS-5822
 URL: https://issues.apache.org/jira/browse/MESOS-5822
 Project: Mesos
  Issue Type: Improvement
  Components: build
Reporter: Joseph Wu
Assignee: Joseph Wu


The ASF CI for Mesos runs a script that lives inside the Mesos codebase:
https://github.com/apache/mesos/blob/1cbfdc3c1e4b8498a67f8531ab264003c8c19fb1/support/docker_build.sh

ASF Infrastructure have set up a machine that we can use for building Mesos on 
Windows.  Considering the environment, we will need a separate script to build 
here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5821) Clean up the billions of compiler warnings on MSVC

2016-07-08 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-5821:
-
Description: Clean builds of Mesos on Windows will result in approximately 
{{5800 Warning(s)}} or more.

> Clean up the billions of compiler warnings on MSVC
> --
>
> Key: MESOS-5821
> URL: https://issues.apache.org/jira/browse/MESOS-5821
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Reporter: Alex Clemmer
>Assignee: Alex Clemmer
>  Labels: mesosphere, slave
>
> Clean builds of Mesos on Windows will result in approximately {{5800 
> Warning(s)}} or more.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5425) Consider using IntervalSet for Port range resource math

2016-07-08 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-5425:
-
Shepherd: Joseph Wu
Story Points: 3
  Labels: allocator mesosphere  (was: mesosphere)

> Consider using IntervalSet for Port range resource math
> ---
>
> Key: MESOS-5425
> URL: https://issues.apache.org/jira/browse/MESOS-5425
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Joseph Wu
>Assignee: Yanyan Hu
>  Labels: allocator, mesosphere
> Attachments: graycol.gif
>
>
> Follow-up JIRA for comments raised in MESOS-3051 (see comments there).
> We should consider utilizing 
> [{{IntervalSet}}|https://github.com/apache/mesos/blob/a0b798d2fac39445ce0545cfaf05a682cd393abe/3rdparty/stout/include/stout/interval.hpp]
>  in [Port range resource 
> math|https://github.com/apache/mesos/blob/a0b798d2fac39445ce0545cfaf05a682cd393abe/src/common/values.cpp#L143].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3968) DiskQuotaTest.SlaveRecovery is flaky

2016-07-05 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15363519#comment-15363519
 ] 

Joseph Wu commented on MESOS-3968:
--

This test has been failing more frequently (on ASF and local builds):
{code}
[ RUN  ] DiskQuotaTest.SlaveRecovery
I0706 00:02:21.991916 19907 cluster.cpp:155] Creating default 'local' authorizer
I0706 00:02:21.998934 19907 leveldb.cpp:174] Opened db in 6.606049ms
I0706 00:02:22.72 19907 leveldb.cpp:181] Compacted db in 1.093827ms
I0706 00:02:22.000119 19907 leveldb.cpp:196] Created db iterator in 19963ns
I0706 00:02:22.000128 19907 leveldb.cpp:202] Seeked to beginning of db in 1271ns
I0706 00:02:22.000131 19907 leveldb.cpp:271] Iterated through 0 keys in the db 
in 120ns
I0706 00:02:22.000169 19907 replica.cpp:779] Replica recovered with log 
positions 0 -> 0 with 1 holes and 0 unlearned
I0706 00:02:22.000886 19922 recover.cpp:451] Starting replica recovery
I0706 00:02:22.001183 19922 recover.cpp:477] Replica is in EMPTY status
I0706 00:02:22.002557 19927 replica.cpp:673] Replica in EMPTY status received a 
broadcasted recover request from (1328)@127.0.0.1:54648
I0706 00:02:22.003260 19928 master.cpp:382] Master 
8a9140ac-c7b3-45dd-961d-aeff38eae88e (centos71) started on 127.0.0.1:54648
I0706 00:02:22.003288 19928 master.cpp:384] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="HierarchicalDRF" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http="true" --authenticate_http_frameworks="true" 
--authenticators="crammd5" --authorizers="local" 
--credentials="/tmp/XoD9Xk/credentials" --framework_sorter="drf" --help="false" 
--hostname_lookup="true" --http_authenticators="basic" 
--http_framework_authenticators="basic" --initialize_driver_logging="true" 
--log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
--max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
--max_completed_tasks_per_framework="1000" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="replicated_log" 
--registry_fetch_timeout="1mins" --registry_store_timeout="100secs" 
--registry_strict="true" --root_submissions="true" --user_sorter="drf" 
--version="false" --webui_dir="/usr/local/share/mesos/webui" 
--work_dir="/tmp/XoD9Xk/master" --zk_session_timeout="10secs"
W0706 00:02:22.003545 19928 master.cpp:387] 
**
Master bound to loopback interface! Cannot communicate with remote schedulers 
or agents. You might want to set '--ip' flag to a routable IP address.
**
I0706 00:02:22.003564 19928 master.cpp:434] Master only allowing authenticated 
frameworks to register
I0706 00:02:22.003569 19928 master.cpp:448] Master only allowing authenticated 
agents to register
I0706 00:02:22.003573 19928 master.cpp:461] Master only allowing authenticated 
HTTP frameworks to register
I0706 00:02:22.003577 19928 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/XoD9Xk/credentials'
I0706 00:02:22.003829 19928 master.cpp:506] Using default 'crammd5' 
authenticator
I0706 00:02:22.003933 19928 master.cpp:578] Using default 'basic' HTTP 
authenticator
I0706 00:02:22.004132 19923 recover.cpp:197] Received a recover response from a 
replica in EMPTY status
I0706 00:02:22.004261 19928 master.cpp:658] Using default 'basic' HTTP 
framework authenticator
I0706 00:02:22.004370 19928 master.cpp:705] Authorization enabled
I0706 00:02:22.004560 19923 recover.cpp:568] Updating replica status to STARTING
I0706 00:02:22.006342 19922 leveldb.cpp:304] Persisting metadata (8 bytes) to 
leveldb took 1.264999ms
I0706 00:02:22.006395 19922 replica.cpp:320] Persisted replica status to 
STARTING
I0706 00:02:22.006669 19924 recover.cpp:477] Replica is in STARTING status
I0706 00:02:22.008113 19922 master.cpp:1972] The newly elected leader is 
master@127.0.0.1:54648 with id 8a9140ac-c7b3-45dd-961d-aeff38eae88e
I0706 00:02:22.008174 19922 master.cpp:1985] Elected as the leading master!
I0706 00:02:22.008215 19922 master.cpp:1672] Recovering from registrar
I0706 00:02:22.008404 19923 replica.cpp:673] Replica in STARTING status 
received a broadcasted recover request from (1331)@127.0.0.1:54648
I0706 00:02:22.008600 19925 registrar.cpp:332] Recovering registrar
I0706 00:02:22.009078 19928 recover.cpp:197] Received a recover response from a 
replica in STARTING status
I0706 00:02:22.009968 19928 recover.cpp:568] Updating replica status to VOTING
I0706 00:02:22.011096 19928 leveldb.cpp:304] Persisting metadata (8 bytes) to 
leveldb took 923116ns
I0706 00:02:22.011121 19928 replica.cpp:320] Persisted replica status to VOTING
I0706 00:02:22.011214 19928 recover.cpp:582] Successfully joined the Paxos group
I0706 00:02:22.011320 19928 recover.cpp:466] Recover process terminated
I

[jira] [Comment Edited] (MESOS-5759) ProcessRemoteLinkTest.RemoteUseStaleLink and RemoteStaleLinkRelink are flaky

2016-07-01 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15359849#comment-15359849
 ] 

Joseph Wu edited comment on MESOS-5759 at 7/2/16 12:04 AM:
---

Disabled the tests for now, as this fix is pretty big.

Fix for the problem observed on ASF builds:
https://reviews.apache.org/r/49543/


was (Author: kaysoky):
Fix for the problem observed on ASF builds:
https://reviews.apache.org/r/49543/

> ProcessRemoteLinkTest.RemoteUseStaleLink and RemoteStaleLinkRelink are flaky
> 
>
> Key: MESOS-5759
> URL: https://issues.apache.org/jira/browse/MESOS-5759
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess, test
>Affects Versions: 1.0.0
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: libprocess, mesosphere
>
> {{ProcessRemoteLinkTest.RemoteUseStaleLink}} and 
> {{ProcessRemoteLinkTest.RemoteStaleLinkRelink}} are failing occasionally with 
> the error:
> {code}
> [ RUN  ] ProcessRemoteLinkTest.RemoteStaleLinkRelink
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> I0630 07:42:34.661110 1 process.cpp:1066] libprocess is initialized on 
> 172.17.0.2:56294 with 16 worker threads
> E0630 07:42:34.666393 18765 process.cpp:2104] Failed to shutdown socket with 
> fd 7: Transport endpoint is not connected
> /mesos/3rdparty/libprocess/src/tests/process_tests.cpp:1059: Failure
> Value of: exitedPid.isPending()
>   Actual: false
> Expected: true
> [  FAILED  ] ProcessRemoteLinkTest.RemoteStaleLinkRelink (56 ms)
> {code}
> There appears to be a race between establishing a socket connection and the 
> test calling {{::shutdown}} on the socket.  Under some circumstances, the 
> {{::shutdown}} may actually result in failing the future in 
> {{SocketManager::link_connect}} error and thereby trigger 
> {{SocketManager::close}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5759) ProcessRemoteLinkTest.RemoteUseStaleLink and RemoteStaleLinkRelink are flaky

2016-07-01 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15359482#comment-15359482
 ] 

Joseph Wu commented on MESOS-5759:
--

Any logs?  (Preferably {{GLOG_v=2}})

> ProcessRemoteLinkTest.RemoteUseStaleLink and RemoteStaleLinkRelink are flaky
> 
>
> Key: MESOS-5759
> URL: https://issues.apache.org/jira/browse/MESOS-5759
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess, test
>Affects Versions: 1.0.0
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: libprocess, mesosphere
>
> {{ProcessRemoteLinkTest.RemoteUseStaleLink}} and 
> {{ProcessRemoteLinkTest.RemoteStaleLinkRelink}} are failing occasionally with 
> the error:
> {code}
> [ RUN  ] ProcessRemoteLinkTest.RemoteStaleLinkRelink
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> I0630 07:42:34.661110 1 process.cpp:1066] libprocess is initialized on 
> 172.17.0.2:56294 with 16 worker threads
> E0630 07:42:34.666393 18765 process.cpp:2104] Failed to shutdown socket with 
> fd 7: Transport endpoint is not connected
> /mesos/3rdparty/libprocess/src/tests/process_tests.cpp:1059: Failure
> Value of: exitedPid.isPending()
>   Actual: false
> Expected: true
> [  FAILED  ] ProcessRemoteLinkTest.RemoteStaleLinkRelink (56 ms)
> {code}
> There appears to be a race between establishing a socket connection and the 
> test calling {{::shutdown}} on the socket.  Under some circumstances, the 
> {{::shutdown}} may actually result in failing the future in 
> {{SocketManager::link_connect}} error and thereby trigger 
> {{SocketManager::close}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5759) ProcessRemoteLinkTest.RemoteUseStaleLink and RemoteStaleLinkRelink are flaky

2016-06-30 Thread Joseph Wu (JIRA)
Joseph Wu created MESOS-5759:


 Summary: ProcessRemoteLinkTest.RemoteUseStaleLink and 
RemoteStaleLinkRelink are flaky
 Key: MESOS-5759
 URL: https://issues.apache.org/jira/browse/MESOS-5759
 Project: Mesos
  Issue Type: Bug
  Components: libprocess, test
Affects Versions: 1.0.0
Reporter: Joseph Wu
Assignee: Joseph Wu


{{ProcessRemoteLinkTest.RemoteUseStaleLink}} and 
{{ProcessRemoteLinkTest.RemoteStaleLinkRelink}} are failing occasionally with 
the error:
{code}
[ RUN  ] ProcessRemoteLinkTest.RemoteStaleLinkRelink
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0630 07:42:34.661110 1 process.cpp:1066] libprocess is initialized on 
172.17.0.2:56294 with 16 worker threads
E0630 07:42:34.666393 18765 process.cpp:2104] Failed to shutdown socket with fd 
7: Transport endpoint is not connected
/mesos/3rdparty/libprocess/src/tests/process_tests.cpp:1059: Failure
Value of: exitedPid.isPending()
  Actual: false
Expected: true
[  FAILED  ] ProcessRemoteLinkTest.RemoteStaleLinkRelink (56 ms)
{code}

There appears to be a race between establishing a socket connection and the 
test calling {{::shutdown}} on the socket.  Under some circumstances, the 
{{::shutdown}} may actually result in failing the future in 
{{SocketManager::link_connect}} error and thereby trigger 
{{SocketManager::close}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3541) Add CMakeLists that builds the Mesos master

2016-06-30 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-3541:
-
Shepherd: Joseph Wu  (was: Joris Van Remoortere)

> Add CMakeLists that builds the Mesos master
> ---
>
> Key: MESOS-3541
> URL: https://issues.apache.org/jira/browse/MESOS-3541
> Project: Mesos
>  Issue Type: Task
>  Components: cmake
>Reporter: Alex Clemmer
>Assignee: Srinivas
>  Labels: build, cmake, mesosphere
>
> Right now CMake builds only the agent. We want it to also build the master as 
> part of the libmesos binary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5756) Cmake build system needs to regenerate protobufs when they are updated.

2016-06-30 Thread Joseph Wu (JIRA)
Joseph Wu created MESOS-5756:


 Summary: Cmake build system needs to regenerate protobufs when 
they are updated.
 Key: MESOS-5756
 URL: https://issues.apache.org/jira/browse/MESOS-5756
 Project: Mesos
  Issue Type: Improvement
  Components: build, cmake
Reporter: Joseph Wu
Priority: Minor


Generated header files, such as protobufs are currently generated all at once 
in the CMake build system:
https://github.com/apache/mesos/blob/db8b0f16c1c8c6e683a4b788262f307a8bc218e0/cmake/MesosConfigure.cmake#L77-L80

This means, if a protobuf is changed, the CMake build system will not 
regenerate new protobufs unless you delete the generated {{/include}} directory.



Should be a trivial fix, as the CMake protobuf functions merely need to depend 
on the input file:
* 
https://github.com/apache/mesos/blob/db8b0f16c1c8c6e683a4b788262f307a8bc218e0/src/cmake/MesosProtobuf.cmake#L67
* 
https://github.com/apache/mesos/blob/db8b0f16c1c8c6e683a4b788262f307a8bc218e0/src/cmake/MesosProtobuf.cmake#L100



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5754) CommandInfo.user not honored in docker containerizer

2016-06-30 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15357994#comment-15357994
 ] 

Joseph Wu commented on MESOS-5754:
--

I'd be curious if this has affected any users negatively.  If users have not 
noticed this, then they may be inadvertently relying on the incorrect behavior 
(of always running docker tasks as root).

The workaround is to specify a CLI parameter: 
https://github.com/apache/mesos/blob/db8b0f16c1c8c6e683a4b788262f307a8bc218e0/include/mesos/v1/mesos.proto#L1826-L1830
i.e.
{code}
"container" : {
  ...,
  "docker" : {
...,
"parameters" : [{
  "key": "user",
  "value": "not-root"
}]
  }
}
{code}

> CommandInfo.user not honored in docker containerizer
> 
>
> Key: MESOS-5754
> URL: https://issues.apache.org/jira/browse/MESOS-5754
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Michael Gummelt
>
> Repro by creating a framework that starts a task with CommandInfo.user set, 
> and observe that the dockerized executor is still running as the default 
> (e.g. root).
> cc [~kaysoky]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3541) Add CMakeLists that builds the Mesos master

2016-06-30 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-3541:
-
Assignee: Srinivas  (was: Alex Clemmer)

> Add CMakeLists that builds the Mesos master
> ---
>
> Key: MESOS-3541
> URL: https://issues.apache.org/jira/browse/MESOS-3541
> Project: Mesos
>  Issue Type: Task
>  Components: cmake
>Reporter: Alex Clemmer
>Assignee: Srinivas
>  Labels: build, cmake, mesosphere
>
> Right now CMake builds only the agent. We want it to also build the master as 
> part of the libmesos binary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5433) Add 'distcheck' target to CMake build

2016-06-30 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-5433:
-
Component/s: cmake

> Add 'distcheck' target to CMake build
> -
>
> Key: MESOS-5433
> URL: https://issues.apache.org/jira/browse/MESOS-5433
> Project: Mesos
>  Issue Type: Improvement
>  Components: cmake
>Reporter: Juan Larriba
>Assignee: Juan Larriba
>
> We should add the "distcheck" option to the makefiles created by CMake 
> configuration. 
> This way, all the testing battery can be executed during CMake generation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5723) SSL-enabled libprocess will leak incoming links to forks

2016-06-30 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-5723:
-
Fix Version/s: 0.27.4
   0.28.3

> SSL-enabled libprocess will leak incoming links to forks
> 
>
> Key: MESOS-5723
> URL: https://issues.apache.org/jira/browse/MESOS-5723
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Affects Versions: 0.24.0, 0.25.0, 0.26.0, 0.27.0, 0.28.0
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>Priority: Blocker
>  Labels: libprocess, mesosphere, ssl
> Fix For: 0.28.3, 1.0.0, 0.27.4
>
>
> Encountered two different buggy behaviors that can be tracked down to the 
> same underlying problem.
> Repro #1 (non-crashy):
> (1) Start a master.  Doesn't matter if SSL is enabled or not.
> (2) Start an agent, with SSL enabled.  Downgrade support has the same 
> problem.  The master/agent {{link}} to one another.
> (3) Run a sleep task.  Keep this alive.  If you inspect FDs at this point, 
> you'll notice the task has inherited the {{link}} FD (master -> agent).
> (4) Restart the agent.  Due to (3), the master's {{link}} stays open.
> (5) Check master's logs for the agent's re-registration message.
> (6) Check the agent's logs for re-registration.  The message will not appear. 
>  The master is actually using the old {{link}} which is not connected to the 
> agent.
> 
> Repro #2 (crashy):
> (1) Start a master.  Doesn't matter if SSL is enabled or not.
> (2) Start an agent, with SSL enabled.  Downgrade support has the same problem.
> (3) Run ~100 sleep task one after the other, keep them all alive.  Each task 
> links back to the agent.  Due to an FD leak, each task will inherit the 
> incoming links from all other actors...
> (4) At some point, the agent will run out of FDs and kernel panic.
> 
> It appears that the SSL socket {{accept}} call is missing {{os::nonblock}} 
> and {{os::cloexec}} calls:
> https://github.com/apache/mesos/blob/4b91d936f50885b6a66277e26ea3c32fe942cf1a/3rdparty/libprocess/src/libevent_ssl_socket.cpp#L794-L806
> For reference, here's {{poll}} socket's {{accept}}:
> https://github.com/apache/mesos/blob/4b91d936f50885b6a66277e26ea3c32fe942cf1a/3rdparty/libprocess/src/poll_socket.cpp#L53-L75



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5748) Potential segfault in `link` and `send` when linking to a remote process

2016-06-30 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-5748:
-
Fix Version/s: 0.27.4
   0.28.3

> Potential segfault in `link` and `send` when linking to a remote process
> 
>
> Key: MESOS-5748
> URL: https://issues.apache.org/jira/browse/MESOS-5748
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Affects Versions: 0.22.0, 0.23.0, 0.24.0, 0.25.0, 0.26.0, 0.27.0, 0.28.0
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: libprocess, mesosphere
> Fix For: 0.28.3, 1.0.0, 0.27.4
>
>
> There is a race in the SocketManager, between a remote {{link}} and 
> disconnection of the underlying socket.
> We potentially segfault here: 
> https://github.com/apache/mesos/blob/215e79f571a989e998488077d713c28c7528926e/3rdparty/libprocess/src/process.cpp#L1512
> {{\*socket}} dereferences the shared pointer underpinning the {{Socket*}} 
> object.  However, the code above this line actually has ownership of the 
> pointer:
> https://github.com/apache/mesos/blob/215e79f571a989e998488077d713c28c7528926e/3rdparty/libprocess/src/process.cpp#L1494-L1499
> If the socket dies during the link, the {{ignore_recv_data}} may delete the 
> Socket underneath {{link}}:
> https://github.com/apache/mesos/blob/215e79f571a989e998488077d713c28c7528926e/3rdparty/libprocess/src/process.cpp#L1399-L1411
> 
> The same race exists for {{send}}.
> This race was discovered while running a new test in repetition:
> https://reviews.apache.org/r/49175/
> On OSX, I hit the race consistently every 500-800 repetitions:
> {code}
> 3rdparty/libprocess/libprocess-tests 
> --gtest_filter="ProcessRemoteLinkTest.RemoteLink"  --gtest_break_on_failure 
> --gtest_repeat=1000
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-5748) Potential segfault in `link` and `send` when linking to a remote process

2016-06-29 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15356216#comment-15356216
 ] 

Joseph Wu edited comment on MESOS-5748 at 6/30/16 1:11 AM:
---

Backport-able fix: https://reviews.apache.org/r/49416/
Complete fix + cleanup: https://reviews.apache.org/r/49404/


was (Author: kaysoky):
A fix: https://reviews.apache.org/r/49404/

> Potential segfault in `link` and `send` when linking to a remote process
> 
>
> Key: MESOS-5748
> URL: https://issues.apache.org/jira/browse/MESOS-5748
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Affects Versions: 0.22.0, 0.23.0, 0.24.0, 0.25.0, 0.26.0, 0.27.0, 0.28.0
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: libprocess, mesosphere
> Fix For: 1.0.0
>
>
> There is a race in the SocketManager, between a remote {{link}} and 
> disconnection of the underlying socket.
> We potentially segfault here: 
> https://github.com/apache/mesos/blob/215e79f571a989e998488077d713c28c7528926e/3rdparty/libprocess/src/process.cpp#L1512
> {{\*socket}} dereferences the shared pointer underpinning the {{Socket*}} 
> object.  However, the code above this line actually has ownership of the 
> pointer:
> https://github.com/apache/mesos/blob/215e79f571a989e998488077d713c28c7528926e/3rdparty/libprocess/src/process.cpp#L1494-L1499
> If the socket dies during the link, the {{ignore_recv_data}} may delete the 
> Socket underneath {{link}}:
> https://github.com/apache/mesos/blob/215e79f571a989e998488077d713c28c7528926e/3rdparty/libprocess/src/process.cpp#L1399-L1411
> 
> The same race exists for {{send}}.
> This race was discovered while running a new test in repetition:
> https://reviews.apache.org/r/49175/
> On OSX, I hit the race consistently every 500-800 repetitions:
> {code}
> 3rdparty/libprocess/libprocess-tests 
> --gtest_filter="ProcessRemoteLinkTest.RemoteLink"  --gtest_break_on_failure 
> --gtest_repeat=1000
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-5748) Potential segfault in `link` and `send` when linking to a remote process

2016-06-29 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu reassigned MESOS-5748:


Assignee: Joseph Wu

> Potential segfault in `link` and `send` when linking to a remote process
> 
>
> Key: MESOS-5748
> URL: https://issues.apache.org/jira/browse/MESOS-5748
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Affects Versions: 0.22.0, 0.23.0, 0.24.0, 0.25.0, 0.26.0, 0.27.0, 0.28.0
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: libprocess, mesosphere
> Fix For: 1.0.0
>
>
> There is a race in the SocketManager, between a remote {{link}} and 
> disconnection of the underlying socket.
> We potentially segfault here: 
> https://github.com/apache/mesos/blob/215e79f571a989e998488077d713c28c7528926e/3rdparty/libprocess/src/process.cpp#L1512
> {{\*socket}} dereferences the shared pointer underpinning the {{Socket*}} 
> object.  However, the code above this line actually has ownership of the 
> pointer:
> https://github.com/apache/mesos/blob/215e79f571a989e998488077d713c28c7528926e/3rdparty/libprocess/src/process.cpp#L1494-L1499
> If the socket dies during the link, the {{ignore_recv_data}} may delete the 
> Socket underneath {{link}}:
> https://github.com/apache/mesos/blob/215e79f571a989e998488077d713c28c7528926e/3rdparty/libprocess/src/process.cpp#L1399-L1411
> 
> The same race exists for {{send}}.
> This race was discovered while running a new test in repetition:
> https://reviews.apache.org/r/49175/
> On OSX, I hit the race consistently every 500-800 repetitions:
> {code}
> 3rdparty/libprocess/libprocess-tests 
> --gtest_filter="ProcessRemoteLinkTest.RemoteLink"  --gtest_break_on_failure 
> --gtest_repeat=1000
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5748) Potential segfault in `link` and `send` when linking to a remote process

2016-06-29 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-5748:
-
Description: 
There is a race in the SocketManager, between a remote {{link}} and 
disconnection of the underlying socket.

We potentially segfault here: 
https://github.com/apache/mesos/blob/215e79f571a989e998488077d713c28c7528926e/3rdparty/libprocess/src/process.cpp#L1512

{{\*socket}} dereferences the shared pointer underpinning the {{Socket*}} 
object.  However, the code above this line actually has ownership of the 
pointer:
https://github.com/apache/mesos/blob/215e79f571a989e998488077d713c28c7528926e/3rdparty/libprocess/src/process.cpp#L1494-L1499

If the socket dies during the link, the {{ignore_recv_data}} may delete the 
Socket underneath {{link}}:
https://github.com/apache/mesos/blob/215e79f571a989e998488077d713c28c7528926e/3rdparty/libprocess/src/process.cpp#L1399-L1411


The same race exists for {{send}}.

This race was discovered while running a new test in repetition:
https://reviews.apache.org/r/49175/

On OSX, I hit the race consistently every 500-800 repetitions:
{code}
3rdparty/libprocess/libprocess-tests 
--gtest_filter="ProcessRemoteLinkTest.RemoteLink"  --gtest_break_on_failure 
--gtest_repeat=1000
{code}

  was:
There is a race the SocketManager, between a remote {{link}} and disconnection 
of the underlying socket.

We potentially segfault here: 
https://github.com/apache/mesos/blob/215e79f571a989e998488077d713c28c7528926e/3rdparty/libprocess/src/process.cpp#L1512

{{\*socket}} dereferences the shared pointer underpinning the {{Socket*}} 
object.  However, the code above this line actually has ownership of the 
pointer:
https://github.com/apache/mesos/blob/215e79f571a989e998488077d713c28c7528926e/3rdparty/libprocess/src/process.cpp#L1494-L1499

If the socket dies during the link, the {{ignore_recv_data}} may delete the 
Socket underneath {{link}}:
https://github.com/apache/mesos/blob/215e79f571a989e998488077d713c28c7528926e/3rdparty/libprocess/src/process.cpp#L1399-L1411


The same race exists for {{send}}.

This race was discovered while running a new test in repetition:
https://reviews.apache.org/r/49175/

On OSX, I hit the race consistently every 500-800 repetitions:
{code}
3rdparty/libprocess/libprocess-tests 
--gtest_filter="ProcessRemoteLinkTest.RemoteLink"  --gtest_break_on_failure 
--gtest_repeat=1000
{code}


> Potential segfault in `link` and `send` when linking to a remote process
> 
>
> Key: MESOS-5748
> URL: https://issues.apache.org/jira/browse/MESOS-5748
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Affects Versions: 0.22.0, 0.23.0, 0.24.0, 0.25.0, 0.26.0, 0.27.0, 0.28.0
>Reporter: Joseph Wu
>  Labels: libprocess, mesosphere
> Fix For: 1.0.0
>
>
> There is a race in the SocketManager, between a remote {{link}} and 
> disconnection of the underlying socket.
> We potentially segfault here: 
> https://github.com/apache/mesos/blob/215e79f571a989e998488077d713c28c7528926e/3rdparty/libprocess/src/process.cpp#L1512
> {{\*socket}} dereferences the shared pointer underpinning the {{Socket*}} 
> object.  However, the code above this line actually has ownership of the 
> pointer:
> https://github.com/apache/mesos/blob/215e79f571a989e998488077d713c28c7528926e/3rdparty/libprocess/src/process.cpp#L1494-L1499
> If the socket dies during the link, the {{ignore_recv_data}} may delete the 
> Socket underneath {{link}}:
> https://github.com/apache/mesos/blob/215e79f571a989e998488077d713c28c7528926e/3rdparty/libprocess/src/process.cpp#L1399-L1411
> 
> The same race exists for {{send}}.
> This race was discovered while running a new test in repetition:
> https://reviews.apache.org/r/49175/
> On OSX, I hit the race consistently every 500-800 repetitions:
> {code}
> 3rdparty/libprocess/libprocess-tests 
> --gtest_filter="ProcessRemoteLinkTest.RemoteLink"  --gtest_break_on_failure 
> --gtest_repeat=1000
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5748) Potential segfault in `link` and `send` when linking to a remote process

2016-06-29 Thread Joseph Wu (JIRA)
Joseph Wu created MESOS-5748:


 Summary: Potential segfault in `link` and `send` when linking to a 
remote process
 Key: MESOS-5748
 URL: https://issues.apache.org/jira/browse/MESOS-5748
 Project: Mesos
  Issue Type: Bug
  Components: libprocess
Affects Versions: 0.28.0, 0.27.0, 0.26.0, 0.25.0, 0.24.0, 0.23.0, 0.22.0
Reporter: Joseph Wu
 Fix For: 1.0.0


There is a race the SocketManager, between a remote {{link}} and disconnection 
of the underlying socket.

We potentially segfault here: 
https://github.com/apache/mesos/blob/215e79f571a989e998488077d713c28c7528926e/3rdparty/libprocess/src/process.cpp#L1512

{{\*socket}} dereferences the shared pointer underpinning the {{Socket*}} 
object.  However, the code above this line actually has ownership of the 
pointer:
https://github.com/apache/mesos/blob/215e79f571a989e998488077d713c28c7528926e/3rdparty/libprocess/src/process.cpp#L1494-L1499

If the socket dies during the link, the {{ignore_recv_data}} may delete the 
Socket underneath {{link}}:
https://github.com/apache/mesos/blob/215e79f571a989e998488077d713c28c7528926e/3rdparty/libprocess/src/process.cpp#L1399-L1411


The same race exists for {{send}}.

This race was discovered while running a new test in repetition:
https://reviews.apache.org/r/49175/

On OSX, I hit the race consistently every 500-800 repetitions:
{code}
3rdparty/libprocess/libprocess-tests 
--gtest_filter="ProcessRemoteLinkTest.RemoteLink"  --gtest_break_on_failure 
--gtest_repeat=1000
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-5576) Masters may drop the first message they send between masters after a network partition

2016-06-28 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347587#comment-15347587
 ] 

Joseph Wu edited comment on MESOS-5576 at 6/29/16 2:27 AM:
---

After a discussion with [~benjaminhindman], [~bmahler], and [~jieyu], we 
determined that {{unlink}} semantics are not adequate when the application 
level knows about a broken socket (while libprocess does not).  Instead, the 
option to "relink" is preferable, as this should create a new persistent 
socket, without regards to how other processes are interacting inside 
libprocess.

Review based on [MESOS-5740]: https://reviews.apache.org/r/49346/


was (Author: kaysoky):
After a discussion with [~benjaminhindman], [~bmahler], and [~jieyu], we 
determined that {{unlink}} semantics are not adequate when the application 
level knows about a broken socket (while libprocess does not).  Instead, the 
option to "relink" is preferable, as this should create a new persistent 
socket, without regards to how other processes are interacting inside 
libprocess.

See: [MESOS-5740]

> Masters may drop the first message they send between masters after a network 
> partition
> --
>
> Key: MESOS-5576
> URL: https://issues.apache.org/jira/browse/MESOS-5576
> Project: Mesos
>  Issue Type: Improvement
>  Components: leader election, master, replicated log
>Affects Versions: 0.28.2
> Environment: Observed in an OpenStack environment where each master 
> lives on a separate VM.
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: mesosphere
>
> We observed the following situation in a cluster of five masters:
> || Time || Master 1 || Master 2 || Master 3 || Master 4 || Master 5 ||
> | 0 | Follower | Follower | Follower | Follower | Leader |
> | 1 | Follower | Follower | Follower | Follower || Partitioned from cluster 
> by downing this VM's network ||
> | 2 || Elected Leader by ZK | Voting | Voting | Voting | Suicides due to lost 
> leadership |
> | 3 | Performs consensus | Replies to leader | Replies to leader | Replies to 
> leader | Still down |
> | 4 | Performs writing | Acks to leader | Acks to leader | Acks to leader | 
> Still down |
> | 5 | Leader | Follower | Follower | Follower | Still down |
> | 6 | Leader | Follower | Follower | Follower | Comes back up |
> | 7 | Leader | Follower | Follower | Follower | Follower |
> | 8 || Partitioned in the same way as Master 5 | Follower | Follower | 
> Follower | Follower |
> | 9 | Suicides due to lost leadership || Elected Leader by ZK | Follower | 
> Follower | Follower |
> | 10 | Still down | Performs consensus | Replies to leader | Replies to 
> leader || Doesn't get the message! ||
> | 11 | Still down | Performs writing | Acks to leader | Acks to leader || 
> Acks to leader ||
> | 12 | Still down | Leader | Follower | Follower | Follower |
> Master 2 sends a series of messages to the recently-restarted Master 5.  The 
> first message is dropped, but subsequent messages are not dropped.
> This appears to be due to a stale link between the masters.  Before leader 
> election, the replicated log actors create a network watcher, which adds 
> links to masters that join the ZK group:
> https://github.com/apache/mesos/blob/7a23d0da817be4e8f68d96f524cecf802431033c/src/log/network.hpp#L157-L159
> This link does not appear to break (Master 2 -> 5) when Master 5 goes down, 
> perhaps due to how the network partition was induced (in the hypervisor 
> layer, rather than in the VM itself).
> When Master 2 tries to send an {{PromiseRequest}} to Master 5, we do not 
> observe the [expected log 
> message|https://github.com/apache/mesos/blob/7a23d0da817be4e8f68d96f524cecf802431033c/src/log/replica.cpp#L493-L494]
> Instead, we see a log line in Master 2:
> {code}
> process.cpp:2040] Failed to shutdown socket with fd 27: Transport endpoint is 
> not connected
> {code}
> The broken link is removed by the libprocess {{socket_manager}} and the 
> following {{WriteRequest}} from Master 2 to Master 5 succeeds via a new 
> socket.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-5740) Consider adding `relink` functionality to libprocess

2016-06-28 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15354394#comment-15354394
 ] 

Joseph Wu edited comment on MESOS-5740 at 6/29/16 2:27 AM:
---

|| Review || Summary ||
| https://reviews.apache.org/r/49174/ | Test-only libprocess hook |
| https://reviews.apache.org/r/49175/ | Tests + repro |
| https://reviews.apache.org/r/49177/ | Implement "relink" |


was (Author: kaysoky):
|| Review || Summary ||
| https://reviews.apache.org/r/49174/ | Test-only libprocess hook |
| https://reviews.apache.org/r/49175/ | Tests + repro |
| https://reviews.apache.org/r/49177/ | Implement "relink" |
| https://reviews.apache.org/r/49346/ | Network.hpp + relink |

> Consider adding `relink` functionality to libprocess
> 
>
> Key: MESOS-5740
> URL: https://issues.apache.org/jira/browse/MESOS-5740
> Project: Mesos
>  Issue Type: Improvement
>  Components: libprocess
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: libprocess, mesosphere
>
> Currently we don't have the {{relink}} functionality in libprocess.  i.e. A 
> way to create a new persistent connection between actors, even if a 
> connection already exists. 
> This can benefit us in a couple of ways:
> - The application may have more information on the state of a connection than 
> libprocess does, as libprocess only checks if the connection is alive or not. 
>  For example, a linkee may accept a connection, then fork, pass the 
> connection to a child, and subsequently exit.  As the connection is still 
> active, libprocess may not detect the exit.
> - Sometimes, the {{ExitedEvent}} might be delayed or might be dropped due to 
> the remote instance being unavailable (e.g., partition, network 
> intermediaries not sending RST's etc). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-5576) Masters may drop the first message they send between masters after a network partition

2016-06-28 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347587#comment-15347587
 ] 

Joseph Wu edited comment on MESOS-5576 at 6/29/16 2:26 AM:
---

After a discussion with [~benjaminhindman], [~bmahler], and [~jieyu], we 
determined that {{unlink}} semantics are not adequate when the application 
level knows about a broken socket (while libprocess does not).  Instead, the 
option to "relink" is preferable, as this should create a new persistent 
socket, without regards to how other processes are interacting inside 
libprocess.

See: [MESOS-5740]


was (Author: kaysoky):
After a discussion with [~benjaminhindman], [~bmahler], and [~jieyu], we 
determined that {{unlink}} semantics are not adequate when the application 
level knows about a broken socket (while libprocess does not).  Instead, the 
option to "relink" is preferable, as this should create a new persistent 
socket, without regards to how other processes are interacting inside 
libprocess.

|| Review || Summary ||
| https://reviews.apache.org/r/49174/ | Test-only libprocess hook |
| https://reviews.apache.org/r/49175/ | Tests + repro |
| https://reviews.apache.org/r/49177/ | Implement "relink" |
| https://reviews.apache.org/r/49346/ | Network.hpp + relink |

> Masters may drop the first message they send between masters after a network 
> partition
> --
>
> Key: MESOS-5576
> URL: https://issues.apache.org/jira/browse/MESOS-5576
> Project: Mesos
>  Issue Type: Improvement
>  Components: leader election, master, replicated log
>Affects Versions: 0.28.2
> Environment: Observed in an OpenStack environment where each master 
> lives on a separate VM.
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: mesosphere
>
> We observed the following situation in a cluster of five masters:
> || Time || Master 1 || Master 2 || Master 3 || Master 4 || Master 5 ||
> | 0 | Follower | Follower | Follower | Follower | Leader |
> | 1 | Follower | Follower | Follower | Follower || Partitioned from cluster 
> by downing this VM's network ||
> | 2 || Elected Leader by ZK | Voting | Voting | Voting | Suicides due to lost 
> leadership |
> | 3 | Performs consensus | Replies to leader | Replies to leader | Replies to 
> leader | Still down |
> | 4 | Performs writing | Acks to leader | Acks to leader | Acks to leader | 
> Still down |
> | 5 | Leader | Follower | Follower | Follower | Still down |
> | 6 | Leader | Follower | Follower | Follower | Comes back up |
> | 7 | Leader | Follower | Follower | Follower | Follower |
> | 8 || Partitioned in the same way as Master 5 | Follower | Follower | 
> Follower | Follower |
> | 9 | Suicides due to lost leadership || Elected Leader by ZK | Follower | 
> Follower | Follower |
> | 10 | Still down | Performs consensus | Replies to leader | Replies to 
> leader || Doesn't get the message! ||
> | 11 | Still down | Performs writing | Acks to leader | Acks to leader || 
> Acks to leader ||
> | 12 | Still down | Leader | Follower | Follower | Follower |
> Master 2 sends a series of messages to the recently-restarted Master 5.  The 
> first message is dropped, but subsequent messages are not dropped.
> This appears to be due to a stale link between the masters.  Before leader 
> election, the replicated log actors create a network watcher, which adds 
> links to masters that join the ZK group:
> https://github.com/apache/mesos/blob/7a23d0da817be4e8f68d96f524cecf802431033c/src/log/network.hpp#L157-L159
> This link does not appear to break (Master 2 -> 5) when Master 5 goes down, 
> perhaps due to how the network partition was induced (in the hypervisor 
> layer, rather than in the VM itself).
> When Master 2 tries to send an {{PromiseRequest}} to Master 5, we do not 
> observe the [expected log 
> message|https://github.com/apache/mesos/blob/7a23d0da817be4e8f68d96f524cecf802431033c/src/log/replica.cpp#L493-L494]
> Instead, we see a log line in Master 2:
> {code}
> process.cpp:2040] Failed to shutdown socket with fd 27: Transport endpoint is 
> not connected
> {code}
> The broken link is removed by the libprocess {{socket_manager}} and the 
> following {{WriteRequest}} from Master 2 to Master 5 succeeds via a new 
> socket.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5740) Consider adding `relink` functionality to libprocess

2016-06-28 Thread Joseph Wu (JIRA)
Joseph Wu created MESOS-5740:


 Summary: Consider adding `relink` functionality to libprocess
 Key: MESOS-5740
 URL: https://issues.apache.org/jira/browse/MESOS-5740
 Project: Mesos
  Issue Type: Improvement
  Components: libprocess
Reporter: Joseph Wu
Assignee: Joseph Wu


Currently we don't have the {{relink}} functionality in libprocess.  i.e. A way 
to create a new persistent connection between actors, even if a connection 
already exists. 

This can benefit us in a couple of ways:
- The application may have more information on the state of a connection than 
libprocess does, as libprocess only checks if the connection is alive or not.  
For example, a linkee may accept a connection, then fork, pass the connection 
to a child, and subsequently exit.  As the connection is still active, 
libprocess may not detect the exit.
- Sometimes, the {{ExitedEvent}} might be delayed or might be dropped due to 
the remote instance being unavailable (e.g., partition, network intermediaries 
not sending RST's etc). 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5364) Consider adding `unlink` functionality to libprocess

2016-06-28 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15353931#comment-15353931
 ] 

Joseph Wu commented on MESOS-5364:
--

Note: Due to a similar issue with stale links, we will be introducing "relink" 
semantics to libprocess.  Relinking provides better guarantees than "unlinking" 
because the application is guaranteed to have a new socket connection, 
regardless of other linkers.

Here's an example of how relink is used: https://reviews.apache.org/r/49346/

> Consider adding `unlink` functionality to libprocess
> 
>
> Key: MESOS-5364
> URL: https://issues.apache.org/jira/browse/MESOS-5364
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Anand Mazumdar
>  Labels: libprocess, mesosphere
>
> Currently we don't have the {{unlink}} functionality in libprocess i.e. 
> Erlang's equivalent of http://erlang.org/doc/man/erlang.html#unlink-1. We 
> have a lot of places in our current code with {{TODO's}} for implementing it.
> It can benefit us in a couple of ways:
> - Based on the business logic of the actor, it would want to authoritatively 
> communicate that it is no longer interested in {{ExitedEvent}} for the 
> external remote link.
> - Sometimes, the {{ExitedEvent}} might be delayed or might be dropped due to 
> the remote instance being unavailable (e.g., partition, network 
> intermediaries not sending RST's etc). 
> I did not find any old JIRA's pertaining to this but I did come across an 
> initial attempt to add this though albeit for injecting {{exited}} events as 
> part of the initial review for MESOS-1059.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-5576) Masters may drop the first message they send between masters after a network partition

2016-06-28 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347587#comment-15347587
 ] 

Joseph Wu edited comment on MESOS-5576 at 6/28/16 11:33 PM:


After a discussion with [~benjaminhindman], [~bmahler], and [~jieyu], we 
determined that {{unlink}} semantics are not adequate when the application 
level knows about a broken socket (while libprocess does not).  Instead, the 
option to "relink" is preferable, as this should create a new persistent 
socket, without regards to how other processes are interacting inside 
libprocess.

|| Review || Summary ||
| https://reviews.apache.org/r/49174/ | Test-only libprocess hook |
| https://reviews.apache.org/r/49175/ | Tests + repro |
| https://reviews.apache.org/r/49177/ | Implement "relink" |
| https://reviews.apache.org/r/49346/ | Network.hpp + relink |


was (Author: kaysoky):
After a discussion with [~benjaminhindman], [~bmahler], and [~jieyu], we 
determined that {{unlink}} semantics are not adequate when the application 
level knows about a broken socket (while libprocess does not).  Instead, the 
option to "relink" is preferable, as this should create a new persistent 
socket, without regards to how other processes are interacting inside 
libprocess.

|| Review || Summary ||
| https://reviews.apache.org/r/49174/ | Test-only libprocess hook |
| https://reviews.apache.org/r/49175/ | Tests + repro |
| https://reviews.apache.org/r/49176/ | Network::remove unused |
| https://reviews.apache.org/r/49177/ | Implement "relink" |

> Masters may drop the first message they send between masters after a network 
> partition
> --
>
> Key: MESOS-5576
> URL: https://issues.apache.org/jira/browse/MESOS-5576
> Project: Mesos
>  Issue Type: Improvement
>  Components: leader election, master, replicated log
>Affects Versions: 0.28.2
> Environment: Observed in an OpenStack environment where each master 
> lives on a separate VM.
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: mesosphere
>
> We observed the following situation in a cluster of five masters:
> || Time || Master 1 || Master 2 || Master 3 || Master 4 || Master 5 ||
> | 0 | Follower | Follower | Follower | Follower | Leader |
> | 1 | Follower | Follower | Follower | Follower || Partitioned from cluster 
> by downing this VM's network ||
> | 2 || Elected Leader by ZK | Voting | Voting | Voting | Suicides due to lost 
> leadership |
> | 3 | Performs consensus | Replies to leader | Replies to leader | Replies to 
> leader | Still down |
> | 4 | Performs writing | Acks to leader | Acks to leader | Acks to leader | 
> Still down |
> | 5 | Leader | Follower | Follower | Follower | Still down |
> | 6 | Leader | Follower | Follower | Follower | Comes back up |
> | 7 | Leader | Follower | Follower | Follower | Follower |
> | 8 || Partitioned in the same way as Master 5 | Follower | Follower | 
> Follower | Follower |
> | 9 | Suicides due to lost leadership || Elected Leader by ZK | Follower | 
> Follower | Follower |
> | 10 | Still down | Performs consensus | Replies to leader | Replies to 
> leader || Doesn't get the message! ||
> | 11 | Still down | Performs writing | Acks to leader | Acks to leader || 
> Acks to leader ||
> | 12 | Still down | Leader | Follower | Follower | Follower |
> Master 2 sends a series of messages to the recently-restarted Master 5.  The 
> first message is dropped, but subsequent messages are not dropped.
> This appears to be due to a stale link between the masters.  Before leader 
> election, the replicated log actors create a network watcher, which adds 
> links to masters that join the ZK group:
> https://github.com/apache/mesos/blob/7a23d0da817be4e8f68d96f524cecf802431033c/src/log/network.hpp#L157-L159
> This link does not appear to break (Master 2 -> 5) when Master 5 goes down, 
> perhaps due to how the network partition was induced (in the hypervisor 
> layer, rather than in the VM itself).
> When Master 2 tries to send an {{PromiseRequest}} to Master 5, we do not 
> observe the [expected log 
> message|https://github.com/apache/mesos/blob/7a23d0da817be4e8f68d96f524cecf802431033c/src/log/replica.cpp#L493-L494]
> Instead, we see a log line in Master 2:
> {code}
> process.cpp:2040] Failed to shutdown socket with fd 27: Transport endpoint is 
> not connected
> {code}
> The broken link is removed by the libprocess {{socket_manager}} and the 
> following {{WriteRequest}} from Master 2 to Master 5 succeeds via a new 
> socket.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4609) Subprocess should be more intelligent about setting/inheriting libprocess environment variables

2016-06-28 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-4609:
-
Fix Version/s: (was: 1.0.0)

> Subprocess should be more intelligent about setting/inheriting libprocess 
> environment variables 
> 
>
> Key: MESOS-4609
> URL: https://issues.apache.org/jira/browse/MESOS-4609
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Affects Versions: 0.27.0
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: mesosphere
>
> Mostly copied from [this 
> comment|https://issues.apache.org/jira/browse/MESOS-4598?focusedCommentId=15133497&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15133497]
> A subprocess inheriting the environment variables {{LIBPROCESS_*}} may run 
> into some accidental fatalities:
> | || Subprocess uses libprocess || Subprocess is something else ||
> || Subprocess sets/inherits the same {{PORT}} by accident | Bind failure -> 
> exit | Nothing happens (?) |
> || Subprocess sets a different {{PORT}} on purpose | Bind success (?) | 
> Nothing happens (?) |
> (?) = means this is usually the case, but not 100%.
> A complete fix would look something like:
> * If the {{subprocess}} call gets {{environment = None()}}, we should 
> automatically remove {{LIBPROCESS_PORT}} from the inherited environment.  
> * The parts of 
> [{{executorEnvironment}}|https://github.com/apache/mesos/blame/master/src/slave/containerizer/containerizer.cpp#L265]
>  dealing with libprocess & libmesos should be refactored into libprocess as a 
> helper.  We would use this helper for the Containerizer, Fetcher, and 
> ContainerLogger module.
> * If the {{subprocess}} call is given {{LIBPROCESS_PORT == 
> os::getenv("LIBPROCESS_PORT")}}, we can LOG(WARN) and unset the env var 
> locally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5729) Consider allowing the libprocess caller an option to not set CLOEXEC on libprocess sockets

2016-06-27 Thread Joseph Wu (JIRA)
Joseph Wu created MESOS-5729:


 Summary: Consider allowing the libprocess caller an option to not 
set CLOEXEC on libprocess sockets
 Key: MESOS-5729
 URL: https://issues.apache.org/jira/browse/MESOS-5729
 Project: Mesos
  Issue Type: Improvement
  Components: libprocess
Reporter: Joseph Wu


Both implementations of libprocess's {{Socket}} interface will set the 
{{CLOEXEC}} option on all new sockets (incoming or outgoing).  This assumption 
is pervasive across Mesos, but since libprocess aims to be a general-purpose 
library, the caller should be able to *not* {{CLOEXEC}} sockets when desired.

See TODOs added here: https://reviews.apache.org/r/49281/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5723) SSL-enabled libprocess will leak incoming links to forks

2016-06-27 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-5723:
-
Shepherd: Joris Van Remoortere

> SSL-enabled libprocess will leak incoming links to forks
> 
>
> Key: MESOS-5723
> URL: https://issues.apache.org/jira/browse/MESOS-5723
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Affects Versions: 0.24.0, 0.25.0, 0.26.0, 0.27.0, 0.28.0
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>Priority: Blocker
>  Labels: libprocess, mesosphere, ssl
> Fix For: 1.0.0
>
>
> Encountered two different buggy behaviors that can be tracked down to the 
> same underlying problem.
> Repro #1 (non-crashy):
> (1) Start a master.  Doesn't matter if SSL is enabled or not.
> (2) Start an agent, with SSL enabled.  Downgrade support has the same 
> problem.  The master/agent {{link}} to one another.
> (3) Run a sleep task.  Keep this alive.  If you inspect FDs at this point, 
> you'll notice the task has inherited the {{link}} FD (master -> agent).
> (4) Restart the agent.  Due to (3), the master's {{link}} stays open.
> (5) Check master's logs for the agent's re-registration message.
> (6) Check the agent's logs for re-registration.  The message will not appear. 
>  The master is actually using the old {{link}} which is not connected to the 
> agent.
> 
> Repro #2 (crashy):
> (1) Start a master.  Doesn't matter if SSL is enabled or not.
> (2) Start an agent, with SSL enabled.  Downgrade support has the same problem.
> (3) Run ~100 sleep task one after the other, keep them all alive.  Each task 
> links back to the agent.  Due to an FD leak, each task will inherit the 
> incoming links from all other actors...
> (4) At some point, the agent will run out of FDs and kernel panic.
> 
> It appears that the SSL socket {{accept}} call is missing {{os::nonblock}} 
> and {{os::cloexec}} calls:
> https://github.com/apache/mesos/blob/4b91d936f50885b6a66277e26ea3c32fe942cf1a/3rdparty/libprocess/src/libevent_ssl_socket.cpp#L794-L806
> For reference, here's {{poll}} socket's {{accept}}:
> https://github.com/apache/mesos/blob/4b91d936f50885b6a66277e26ea3c32fe942cf1a/3rdparty/libprocess/src/poll_socket.cpp#L53-L75



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5723) SSL-enabled libprocess will leak incoming links to forks

2016-06-27 Thread Joseph Wu (JIRA)
Joseph Wu created MESOS-5723:


 Summary: SSL-enabled libprocess will leak incoming links to forks
 Key: MESOS-5723
 URL: https://issues.apache.org/jira/browse/MESOS-5723
 Project: Mesos
  Issue Type: Bug
  Components: libprocess
Affects Versions: 0.28.0, 0.27.0, 0.26.0, 0.25.0, 0.24.0
Reporter: Joseph Wu
Assignee: Joseph Wu
Priority: Blocker
 Fix For: 1.0.0


Encountered two different buggy behaviors that can be tracked down to the same 
underlying problem.

Repro #1 (non-crashy):
(1) Start a master.  Doesn't matter if SSL is enabled or not.
(2) Start an agent, with SSL enabled.  Downgrade support has the same problem.  
The master/agent {{link}} to one another.
(3) Run a sleep task.  Keep this alive.  If you inspect FDs at this point, 
you'll notice the task has inherited the {{link}} FD (master -> agent).
(4) Restart the agent.  Due to (3), the master's {{link}} stays open.
(5) Check master's logs for the agent's re-registration message.
(6) Check the agent's logs for re-registration.  The message will not appear.  
The master is actually using the old {{link}} which is not connected to the 
agent.



Repro #2 (crashy):
(1) Start a master.  Doesn't matter if SSL is enabled or not.
(2) Start an agent, with SSL enabled.  Downgrade support has the same problem.
(3) Run ~100 sleep task one after the other, keep them all alive.  Each task 
links back to the agent.  Due to an FD leak, each task will inherit the 
incoming links from all other actors...
(4) At some point, the agent will run out of FDs and kernel panic.



It appears that the SSL socket {{accept}} call is missing {{os::nonblock}} and 
{{os::cloexec}} calls:
https://github.com/apache/mesos/blob/4b91d936f50885b6a66277e26ea3c32fe942cf1a/3rdparty/libprocess/src/libevent_ssl_socket.cpp#L794-L806

For reference, here's {{poll}} socket's {{accept}}:
https://github.com/apache/mesos/blob/4b91d936f50885b6a66277e26ea3c32fe942cf1a/3rdparty/libprocess/src/poll_socket.cpp#L53-L75




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5713) Add a __sockets__ diagnostic endpoint to libprocess.

2016-06-24 Thread Joseph Wu (JIRA)
Joseph Wu created MESOS-5713:


 Summary: Add a __sockets__ diagnostic endpoint to libprocess.
 Key: MESOS-5713
 URL: https://issues.apache.org/jira/browse/MESOS-5713
 Project: Mesos
  Issue Type: Wish
  Components: libprocess
Reporter: Joseph Wu


Libprocess exposes a endpoint {{/__processes__}}, which displays some info on 
the existing actors and messages queued up on each.

It would be nice to inspect the state of libprocess's {{SocketManager}} too.  
This could be an endpoint like {{/__sockets__}} that exposes information like:
* Inbound FDs: type and source
* Outbound FDs: type and source
* Temporary and persistent sockets
* Linkers and linkees.
* Outgoing messages and their associated socket



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5576) Masters may drop the first message they send between masters after a network partition

2016-06-23 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347587#comment-15347587
 ] 

Joseph Wu commented on MESOS-5576:
--

After a discussion with [~benjaminhindman], [~bmahler], and [~jieyu], we 
determined that {{unlink}} semantics are not adequate when the application 
level knows about a broken socket (while libprocess does not).  Instead, the 
option to "relink" is preferable, as this should create a new persistent 
socket, without regards to how other processes are interacting inside 
libprocess.

|| Review || Summary ||
| https://reviews.apache.org/r/49174/ | Test-only libprocess hook |
| https://reviews.apache.org/r/49175/ | Tests + repro |
| https://reviews.apache.org/r/49176/ | Network::remove unused |
| https://reviews.apache.org/r/49177/ | Implement "relink" |

> Masters may drop the first message they send between masters after a network 
> partition
> --
>
> Key: MESOS-5576
> URL: https://issues.apache.org/jira/browse/MESOS-5576
> Project: Mesos
>  Issue Type: Improvement
>  Components: leader election, master, replicated log
>Affects Versions: 0.28.2
> Environment: Observed in an OpenStack environment where each master 
> lives on a separate VM.
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: mesosphere
>
> We observed the following situation in a cluster of five masters:
> || Time || Master 1 || Master 2 || Master 3 || Master 4 || Master 5 ||
> | 0 | Follower | Follower | Follower | Follower | Leader |
> | 1 | Follower | Follower | Follower | Follower || Partitioned from cluster 
> by downing this VM's network ||
> | 2 || Elected Leader by ZK | Voting | Voting | Voting | Suicides due to lost 
> leadership |
> | 3 | Performs consensus | Replies to leader | Replies to leader | Replies to 
> leader | Still down |
> | 4 | Performs writing | Acks to leader | Acks to leader | Acks to leader | 
> Still down |
> | 5 | Leader | Follower | Follower | Follower | Still down |
> | 6 | Leader | Follower | Follower | Follower | Comes back up |
> | 7 | Leader | Follower | Follower | Follower | Follower |
> | 8 || Partitioned in the same way as Master 5 | Follower | Follower | 
> Follower | Follower |
> | 9 | Suicides due to lost leadership || Elected Leader by ZK | Follower | 
> Follower | Follower |
> | 10 | Still down | Performs consensus | Replies to leader | Replies to 
> leader || Doesn't get the message! ||
> | 11 | Still down | Performs writing | Acks to leader | Acks to leader || 
> Acks to leader ||
> | 12 | Still down | Leader | Follower | Follower | Follower |
> Master 2 sends a series of messages to the recently-restarted Master 5.  The 
> first message is dropped, but subsequent messages are not dropped.
> This appears to be due to a stale link between the masters.  Before leader 
> election, the replicated log actors create a network watcher, which adds 
> links to masters that join the ZK group:
> https://github.com/apache/mesos/blob/7a23d0da817be4e8f68d96f524cecf802431033c/src/log/network.hpp#L157-L159
> This link does not appear to break (Master 2 -> 5) when Master 5 goes down, 
> perhaps due to how the network partition was induced (in the hypervisor 
> layer, rather than in the VM itself).
> When Master 2 tries to send an {{PromiseRequest}} to Master 5, we do not 
> observe the [expected log 
> message|https://github.com/apache/mesos/blob/7a23d0da817be4e8f68d96f524cecf802431033c/src/log/replica.cpp#L493-L494]
> Instead, we see a log line in Master 2:
> {code}
> process.cpp:2040] Failed to shutdown socket with fd 27: Transport endpoint is 
> not connected
> {code}
> The broken link is removed by the libprocess {{socket_manager}} and the 
> following {{WriteRequest}} from Master 2 to Master 5 succeeds via a new 
> socket.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5691) SSL downgrade support will leak sockets in CLOSE_WAIT status

2016-06-22 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-5691:
-
Affects Version/s: 0.25.0
   0.26.0
   0.27.0
   0.28.0

> SSL downgrade support will leak sockets in CLOSE_WAIT status
> 
>
> Key: MESOS-5691
> URL: https://issues.apache.org/jira/browse/MESOS-5691
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Affects Versions: 0.24.0, 0.25.0, 0.26.0, 0.27.0, 0.28.0
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>Priority: Blocker
>  Labels: libprocess, mesosphere
> Fix For: 1.0.0
>
>
> Repro steps:
> 1) Start a master:
> {code}
> bin/mesos-master.sh --work_dir=/tmp/master
> {code}
> 2) Start an agent with SSL and downgrade enabled:
> {code}
> # Taken from http://mesos.apache.org/documentation/latest/ssl/
> openssl genrsa -des3 -f4 -passout pass:some_password -out key.pem 4096
> openssl req -new -x509 -passin pass:some_password -days 365 -key key.pem -out 
> cert.pem
> SSL_KEY_FILE=key.pem SSL_CERT_FILE=cert.pem SSL_ENABLED=true 
> SSL_SUPPORT_DOWNGRADE=true sudo -E bin/mesos-agent.sh --master=localhost:5050 
> --work_dir=/tmp/agent
> {code}
> 3) Start a framework that launches lots of executors, one after another:
> {code}
> sudo src/balloon-framework --master=localhost:5050 --task_memory=64mb 
> --task_memory_usage_limit=256mb --long_running
> {code}
> 4) Check FDs, repeatedly
> {code}
> sudo lsof -i | grep mesos | grep CLOSE_WAIT | wc -l
> {code}
> The number of sockets in {{CLOSE_WAIT}} will increase linearly with the 
> number of launched executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5691) SSL downgrade support will leak sockets in CLOSE_WAIT status

2016-06-22 Thread Joseph Wu (JIRA)
Joseph Wu created MESOS-5691:


 Summary: SSL downgrade support will leak sockets in CLOSE_WAIT 
status
 Key: MESOS-5691
 URL: https://issues.apache.org/jira/browse/MESOS-5691
 Project: Mesos
  Issue Type: Bug
  Components: libprocess
Affects Versions: 0.24.0
Reporter: Joseph Wu
Assignee: Joseph Wu
Priority: Blocker
 Fix For: 1.0.0


Repro steps:
1) Start a master:
{code}
bin/mesos-master.sh --work_dir=/tmp/master
{code}

2) Start an agent with SSL and downgrade enabled:
{code}
# Taken from http://mesos.apache.org/documentation/latest/ssl/
openssl genrsa -des3 -f4 -passout pass:some_password -out key.pem 4096
openssl req -new -x509 -passin pass:some_password -days 365 -key key.pem -out 
cert.pem

SSL_KEY_FILE=key.pem SSL_CERT_FILE=cert.pem SSL_ENABLED=true 
SSL_SUPPORT_DOWNGRADE=true sudo -E bin/mesos-agent.sh --master=localhost:5050 
--work_dir=/tmp/agent
{code}

3) Start a framework that launches lots of executors, one after another:
{code}
sudo src/balloon-framework --master=localhost:5050 --task_memory=64mb 
--task_memory_usage_limit=256mb --long_running
{code}

4) Check FDs, repeatedly
{code}
sudo lsof -i | grep mesos | grep CLOSE_WAIT | wc -l
{code}

The number of sockets in {{CLOSE_WAIT}} will increase linearly with the number 
of launched executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5683) Can't see the finished tasks when run the Java example framework

2016-06-22 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15344540#comment-15344540
 ] 

Joseph Wu commented on MESOS-5683:
--

[~ZLuo], most of the example frameworks will run and exit relatively quickly.  
When a framework "completes" the associated tasks are moved to a separate 
section of the web UI: {{http://localhost:5050/#/frameworks}}

> Can't see the finished tasks when run the Java example framework
> 
>
> Key: MESOS-5683
> URL: https://issues.apache.org/jira/browse/MESOS-5683
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Reporter: Zhigang Luo
>
> Following the steps in "Getting Started" and run example framework(Java), 
> then can't see the finished tasks from the mesos wed page 
> (http://127.0.0.1:5050).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-5660) ContainerizerTest.ROOT_CGROUPS_BalloonFramework fails because executor environment isn't inherited

2016-06-21 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu reassigned MESOS-5660:


Assignee: Joseph Wu  (was: Jan Schlicht)

> ContainerizerTest.ROOT_CGROUPS_BalloonFramework fails because executor 
> environment isn't inherited
> --
>
> Key: MESOS-5660
> URL: https://issues.apache.org/jira/browse/MESOS-5660
> Project: Mesos
>  Issue Type: Bug
>  Components: tests
>Reporter: Jan Schlicht
>Assignee: Joseph Wu
>  Labels: mesosphere
> Fix For: 1.0.0
>
>
> A recent change forbits the executor to inherit environment variables from 
> the agent's environment. As a regression this break 
> {{ContainerizerTest.ROOT_CGROUPS_BalloonFramework}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5660) ContainerizerTest.ROOT_CGROUPS_BalloonFramework fails because executor environment isn't inherited

2016-06-21 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15343094#comment-15343094
 ] 

Joseph Wu commented on MESOS-5660:
--

Another fix: https://reviews.apache.org/r/49054/

> ContainerizerTest.ROOT_CGROUPS_BalloonFramework fails because executor 
> environment isn't inherited
> --
>
> Key: MESOS-5660
> URL: https://issues.apache.org/jira/browse/MESOS-5660
> Project: Mesos
>  Issue Type: Bug
>  Components: tests
>Reporter: Jan Schlicht
>Assignee: Jan Schlicht
>  Labels: mesosphere
> Fix For: 1.0.0
>
>
> A recent change forbits the executor to inherit environment variables from 
> the agent's environment. As a regression this break 
> {{ContainerizerTest.ROOT_CGROUPS_BalloonFramework}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5576) Masters may drop the first message they send between masters after a network partition

2016-06-17 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-5576:
-
Issue Type: Improvement  (was: Bug)

Changing type from {{Bug}} to {{Improvement}} because the masters will still 
recover *eventually* in this case.  Bad sockets are cleaned out when the 
masters abort due to {{--registry_fetch_timeout}}.

> Masters may drop the first message they send between masters after a network 
> partition
> --
>
> Key: MESOS-5576
> URL: https://issues.apache.org/jira/browse/MESOS-5576
> Project: Mesos
>  Issue Type: Improvement
>  Components: leader election, master, replicated log
>Affects Versions: 0.28.2
> Environment: Observed in an OpenStack environment where each master 
> lives on a separate VM.
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: mesosphere
>
> We observed the following situation in a cluster of five masters:
> || Time || Master 1 || Master 2 || Master 3 || Master 4 || Master 5 ||
> | 0 | Follower | Follower | Follower | Follower | Leader |
> | 1 | Follower | Follower | Follower | Follower || Partitioned from cluster 
> by downing this VM's network ||
> | 2 || Elected Leader by ZK | Voting | Voting | Voting | Suicides due to lost 
> leadership |
> | 3 | Performs consensus | Replies to leader | Replies to leader | Replies to 
> leader | Still down |
> | 4 | Performs writing | Acks to leader | Acks to leader | Acks to leader | 
> Still down |
> | 5 | Leader | Follower | Follower | Follower | Still down |
> | 6 | Leader | Follower | Follower | Follower | Comes back up |
> | 7 | Leader | Follower | Follower | Follower | Follower |
> | 8 || Partitioned in the same way as Master 5 | Follower | Follower | 
> Follower | Follower |
> | 9 | Suicides due to lost leadership || Elected Leader by ZK | Follower | 
> Follower | Follower |
> | 10 | Still down | Performs consensus | Replies to leader | Replies to 
> leader || Doesn't get the message! ||
> | 11 | Still down | Performs writing | Acks to leader | Acks to leader || 
> Acks to leader ||
> | 12 | Still down | Leader | Follower | Follower | Follower |
> Master 2 sends a series of messages to the recently-restarted Master 5.  The 
> first message is dropped, but subsequent messages are not dropped.
> This appears to be due to a stale link between the masters.  Before leader 
> election, the replicated log actors create a network watcher, which adds 
> links to masters that join the ZK group:
> https://github.com/apache/mesos/blob/7a23d0da817be4e8f68d96f524cecf802431033c/src/log/network.hpp#L157-L159
> This link does not appear to break (Master 2 -> 5) when Master 5 goes down, 
> perhaps due to how the network partition was induced (in the hypervisor 
> layer, rather than in the VM itself).
> When Master 2 tries to send an {{PromiseRequest}} to Master 5, we do not 
> observe the [expected log 
> message|https://github.com/apache/mesos/blob/7a23d0da817be4e8f68d96f524cecf802431033c/src/log/replica.cpp#L493-L494]
> Instead, we see a log line in Master 2:
> {code}
> process.cpp:2040] Failed to shutdown socket with fd 27: Transport endpoint is 
> not connected
> {code}
> The broken link is removed by the libprocess {{socket_manager}} and the 
> following {{WriteRequest}} from Master 2 to Master 5 succeeds via a new 
> socket.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4087) Introduce a module for logging executor/task output

2016-06-17 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15336709#comment-15336709
 ] 

Joseph Wu commented on MESOS-4087:
--

Sounds like you're trying to build a custom solution for your specific 
framework.  You might want to ask in the Spark community on how they've done 
logging.

The {{ContainerLogger}} (this JIRA) is meant to encompass the stdout/stderr of 
*any* executor, and involves loading a module into your agents.  If you are 
willing to dip into C++, you can write your own appender/forwarder.  Examples:
https://github.com/apache/mesos/tree/master/src/slave/container_loggers

> Introduce a module for logging executor/task output
> ---
>
> Key: MESOS-4087
> URL: https://issues.apache.org/jira/browse/MESOS-4087
> Project: Mesos
>  Issue Type: Task
>  Components: containerization, modules
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: logging, mesosphere
> Fix For: 0.27.0
>
>
> Existing executor/task logs are logged to files in their sandbox directory, 
> with some nuances based on which containerizer is used (see background 
> section in linked document).
> A logger for executor/task logs has the following requirements:
> * The logger is given a command to run and must handle the stdout/stderr of 
> the command.
> * The handling of stdout/stderr must be resilient across agent failover.  
> Logging should not stop if the agent fails.
> * Logs should be readable, presumably via the web UI, or via some other 
> module-specific UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4248) mesos slave can't start in CentOS-7 docker container

2016-06-17 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15336464#comment-15336464
 ] 

Joseph Wu commented on MESOS-4248:
--

This might be related to what you want: [MESOS-5544].

> mesos slave can't start in CentOS-7 docker container
> 
>
> Key: MESOS-4248
> URL: https://issues.apache.org/jira/browse/MESOS-4248
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Affects Versions: 0.26.0
> Environment: My host OS is Debian Jessie,  the container OS is CentOS 
> 7.2.
> {code}
> # cat /etc/system-release
> CentOS Linux release 7.2.1511 (Core) 
> # rpm -qa |grep mesos
> mesosphere-zookeeper-3.4.6-0.1.20141204175332.centos7.x86_64
> mesosphere-el-repo-7-1.noarch
> mesos-0.26.0-0.2.145.centos701406.x86_64
> $ docker version
> Client:
>  Version:  1.9.1
>  API version:  1.21
>  Go version:   go1.4.2
>  Git commit:   a34a1d5
>  Built:Fri Nov 20 12:59:02 UTC 2015
>  OS/Arch:  linux/amd64
> Server:
>  Version:  1.9.1
>  API version:  1.21
>  Go version:   go1.4.2
>  Git commit:   a34a1d5
>  Built:Fri Nov 20 12:59:02 UTC 2015
>  OS/Arch:  linux/amd64
> {code}
>Reporter: Yubao Liu
>
> // Check the "Environment" label above for kinds of software versions.
> "systemctl start mesos-slave" can't start mesos-slave:
> {code}
> # journalctl -u mesos-slave
> 
> Dec 24 10:35:25 mesos-slave1 systemd[1]: Started Mesos Slave.
> Dec 24 10:35:25 mesos-slave1 systemd[1]: Starting Mesos Slave...
> Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.210180 12838 
> logging.cpp:172] INFO level logging started!
> Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.210603 12838 
> main.cpp:190] Build: 2015-12-16 23:06:16 by root
> Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.210625 12838 
> main.cpp:192] Version: 0.26.0
> Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.210634 12838 
> main.cpp:195] Git tag: 0.26.0
> Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.210644 12838 
> main.cpp:199] Git SHA: d3717e5c4d1bf4fca5c41cd7ea54fae489028faa
> Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.210765 12838 
> containerizer.cpp:142] Using isolation: posix/cpu,posix/mem,filesystem/posix
> Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.215638 12838 
> linux_launcher.cpp:103] Using /sys/fs/cgroup/freezer as the freezer hierarchy 
> for the Linux launcher
> Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.220279 12838 
> systemd.cpp:128] systemd version `219` detected
> Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.227017 12838 
> systemd.cpp:210] Started systemd slice `mesos_executors.slice`
> Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: Failed to create a 
> containerizer: Could not create MesosContainerizer: Failed to create 
> launcher: Failed to locate systemd cgroups hierarchy: does not exist
> Dec 24 10:35:25 mesos-slave1 systemd[1]: mesos-slave.service: main process 
> exited, code=exited, status=1/FAILURE
> Dec 24 10:35:25 mesos-slave1 systemd[1]: Unit mesos-slave.service entered 
> failed state.
> Dec 24 10:35:25 mesos-slave1 systemd[1]: mesos-slave.service failed.
> {code}
> I used strace to debug it, mesos-slave tried to access 
> "/sys/fs/cgroup/systemd/mesos_executors.slice",  but it's actually at 
> "/sys/fs/cgroup/systemd/system.slice/docker-45875efce9019375cd0c5b29bb1a12275fb6033293f9bf3d97d774a1e5d4de52.scope/mesos_executors.slice/",
>mesos-slave should check "/proc/self/cgroup" to find those intermediate 
> directories:
> {code}
> # cat /proc/self/cgroup 
> 8:perf_event:/system.slice/docker-45875efce9019375cd0c5b29bb1a12275fb6033293f9bf3d97d774a1e5d4de52.scope
> 7:blkio:/system.slice/docker-45875efce9019375cd0c5b29bb1a12275fb6033293f9bf3d97d774a1e5d4de52.scope
> 6:net_cls,net_prio:/system.slice/docker-45875efce9019375cd0c5b29bb1a12275fb6033293f9bf3d97d774a1e5d4de52.scope
> 5:freezer:/system.slice/docker-45875efce9019375cd0c5b29bb1a12275fb6033293f9bf3d97d774a1e5d4de52.scope
> 4:devices:/system.slice/docker-45875efce9019375cd0c5b29bb1a12275fb6033293f9bf3d97d774a1e5d4de52.scope
> 3:cpu,cpuacct:/system.slice/docker-45875efce9019375cd0c5b29bb1a12275fb6033293f9bf3d97d774a1e5d4de52.scope
> 2:cpuset:/system.slice/docker-45875efce9019375cd0c5b29bb1a12275fb6033293f9bf3d97d774a1e5d4de52.scope
> 1:name=systemd:/system.slice/docker-45875efce9019375cd0c5b29bb1a12275fb6033293f9bf3d97d774a1e5d4de52.scope
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4087) Introduce a module for logging executor/task output

2016-06-17 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15336435#comment-15336435
 ] 

Joseph Wu commented on MESOS-4087:
--

Just to clarify, are you looking at the stdout/stderr of your {{spark-submit}} 
command?  Or are you looking at the [agent 
sandboxes|http://mesos.apache.org/documentation/latest/sandbox/#where-is-it] 
for your spark executors?

Under the default settings, the spark executors' sandboxes will have a 
{{stdout}} and {{stderr}} file for their stdout/stderr logging.  If {{log4j}} 
places logs in a different location, you'll have to check that location.

> Introduce a module for logging executor/task output
> ---
>
> Key: MESOS-4087
> URL: https://issues.apache.org/jira/browse/MESOS-4087
> Project: Mesos
>  Issue Type: Task
>  Components: containerization, modules
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: logging, mesosphere
> Fix For: 0.27.0
>
>
> Existing executor/task logs are logged to files in their sandbox directory, 
> with some nuances based on which containerizer is used (see background 
> section in linked document).
> A logger for executor/task logs has the following requirements:
> * The logger is given a command to run and must handle the stdout/stderr of 
> the command.
> * The handling of stdout/stderr must be resilient across agent failover.  
> Logging should not stop if the agent fails.
> * Logs should be readable, presumably via the web UI, or via some other 
> module-specific UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5619) Add task_num to mesos-executor

2016-06-16 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334584#comment-15334584
 ] 

Joseph Wu commented on MESOS-5619:
--

[~klausma] This ticket has the {{cli}} component, are you proposing a change to 
{{mesos-execute}} (the command scheduler) or {{mesos-executor}} (the command 
executor)?

> Add task_num to mesos-executor
> --
>
> Key: MESOS-5619
> URL: https://issues.apache.org/jira/browse/MESOS-5619
> Project: Mesos
>  Issue Type: Bug
>  Components: cli
>Reporter: Klaus Ma
>Assignee: Klaus Ma
>
> According to current code, {{mesos-executor}} will only launch one task. It's 
> better to add a parameter to special how many task to launch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5576) Masters may drop the first message they send between masters after a network partition

2016-06-15 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332562#comment-15332562
 ] 

Joseph Wu commented on MESOS-5576:
--

Some cleanup while reviewing the associated log replica tests:
|| Review || Summary ||
| https://reviews.apache.org/r/48571/ | Initialization asserts |
| https://reviews.apache.org/r/48572/ | Whitespace |
| https://reviews.apache.org/r/48573/ | Added check for non-deprecated field |
| https://reviews.apache.org/r/48574/ | Tweak CoordinatorTest.Elect |
| https://reviews.apache.org/r/48753/ | Tweak RecoverTest.CatchupRetry |
| https://reviews.apache.org/r/48752/ | Test summary comments |

> Masters may drop the first message they send between masters after a network 
> partition
> --
>
> Key: MESOS-5576
> URL: https://issues.apache.org/jira/browse/MESOS-5576
> Project: Mesos
>  Issue Type: Bug
>  Components: leader election, master, replicated log
>Affects Versions: 0.28.2
> Environment: Observed in an OpenStack environment where each master 
> lives on a separate VM.
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: mesosphere
>
> We observed the following situation in a cluster of five masters:
> || Time || Master 1 || Master 2 || Master 3 || Master 4 || Master 5 ||
> | 0 | Follower | Follower | Follower | Follower | Leader |
> | 1 | Follower | Follower | Follower | Follower || Partitioned from cluster 
> by downing this VM's network ||
> | 2 || Elected Leader by ZK | Voting | Voting | Voting | Suicides due to lost 
> leadership |
> | 3 | Performs consensus | Replies to leader | Replies to leader | Replies to 
> leader | Still down |
> | 4 | Performs writing | Acks to leader | Acks to leader | Acks to leader | 
> Still down |
> | 5 | Leader | Follower | Follower | Follower | Still down |
> | 6 | Leader | Follower | Follower | Follower | Comes back up |
> | 7 | Leader | Follower | Follower | Follower | Follower |
> | 8 || Partitioned in the same way as Master 5 | Follower | Follower | 
> Follower | Follower |
> | 9 | Suicides due to lost leadership || Elected Leader by ZK | Follower | 
> Follower | Follower |
> | 10 | Still down | Performs consensus | Replies to leader | Replies to 
> leader || Doesn't get the message! ||
> | 11 | Still down | Performs writing | Acks to leader | Acks to leader || 
> Acks to leader ||
> | 12 | Still down | Leader | Follower | Follower | Follower |
> Master 2 sends a series of messages to the recently-restarted Master 5.  The 
> first message is dropped, but subsequent messages are not dropped.
> This appears to be due to a stale link between the masters.  Before leader 
> election, the replicated log actors create a network watcher, which adds 
> links to masters that join the ZK group:
> https://github.com/apache/mesos/blob/7a23d0da817be4e8f68d96f524cecf802431033c/src/log/network.hpp#L157-L159
> This link does not appear to break (Master 2 -> 5) when Master 5 goes down, 
> perhaps due to how the network partition was induced (in the hypervisor 
> layer, rather than in the VM itself).
> When Master 2 tries to send an {{PromiseRequest}} to Master 5, we do not 
> observe the [expected log 
> message|https://github.com/apache/mesos/blob/7a23d0da817be4e8f68d96f524cecf802431033c/src/log/replica.cpp#L493-L494]
> Instead, we see a log line in Master 2:
> {code}
> process.cpp:2040] Failed to shutdown socket with fd 27: Transport endpoint is 
> not connected
> {code}
> The broken link is removed by the libprocess {{socket_manager}} and the 
> following {{WriteRequest}} from Master 2 to Master 5 succeeds via a new 
> socket.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5576) Masters may drop the first message they send between masters after a network partition

2016-06-10 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15325281#comment-15325281
 ] 

Joseph Wu commented on MESOS-5576:
--

Some offline discussion notes with [~bmahler] & [~benjaminhindman]:

* We can fix this immediate case by implementing this TODO: 
https://github.com/apache/mesos/blob/7a23d0da817be4e8f68d96f524cecf802431033c/src/log/network.hpp#L168
** Fixing the above ^ only alleviates stale sockets we observed in this case.  
Other network partitions, like a un-responsive, but not-broken socket are 
unchanged.
** Unlinking may suffer from regressions due to 1) lack of an 
integration/regression test, 2) if we add a separate actor that links between 
masters in future, unlinking may race with this actor.
** Investigate {{link(...)}}-level heartbeats to detect stale sockets.
* We should look into a retrying broadcasts.  i.e. the 
{{ImplicitPromiseRequest}} sends one broadcast and then waits *indefinitely* 
for a quorum of responses.
** See: 
https://github.com/apache/mesos/blob/7a23d0da817be4e8f68d96f524cecf802431033c/src/log/consensus.cpp#L332-L373
** This indefinite wait is cut short by the {{--registry_fetch_timeout}}, when 
the master exits.
** We can effectively add an application-level heartbeat by automatically 
retrying the broadcast after some timeout.  Same for each phase of the 
consensus algorithm.

> Masters may drop the first message they send between masters after a network 
> partition
> --
>
> Key: MESOS-5576
> URL: https://issues.apache.org/jira/browse/MESOS-5576
> Project: Mesos
>  Issue Type: Bug
>  Components: leader election, master, replicated log
>Affects Versions: 0.28.2
> Environment: Observed in an OpenStack environment where each master 
> lives on a separate VM.
>Reporter: Joseph Wu
>  Labels: mesosphere
>
> We observed the following situation in a cluster of five masters:
> || Time || Master 1 || Master 2 || Master 3 || Master 4 || Master 5 ||
> | 0 | Follower | Follower | Follower | Follower | Leader |
> | 1 | Follower | Follower | Follower | Follower || Partitioned from cluster 
> by downing this VM's network ||
> | 2 || Elected Leader by ZK | Voting | Voting | Voting | Suicides due to lost 
> leadership |
> | 3 | Performs consensus | Replies to leader | Replies to leader | Replies to 
> leader | Still down |
> | 4 | Performs writing | Acks to leader | Acks to leader | Acks to leader | 
> Still down |
> | 5 | Leader | Follower | Follower | Follower | Still down |
> | 6 | Leader | Follower | Follower | Follower | Comes back up |
> | 7 | Leader | Follower | Follower | Follower | Follower |
> | 8 || Partitioned in the same way as Master 5 | Follower | Follower | 
> Follower | Follower |
> | 9 | Suicides due to lost leadership || Elected Leader by ZK | Follower | 
> Follower | Follower |
> | 10 | Still down | Performs consensus | Replies to leader | Replies to 
> leader || Doesn't get the message! ||
> | 11 | Still down | Performs writing | Acks to leader | Acks to leader || 
> Acks to leader ||
> | 12 | Still down | Leader | Follower | Follower | Follower |
> Master 2 sends a series of messages to the recently-restarted Master 5.  The 
> first message is dropped, but subsequent messages are not dropped.
> This appears to be due to a stale link between the masters.  Before leader 
> election, the replicated log actors create a network watcher, which adds 
> links to masters that join the ZK group:
> https://github.com/apache/mesos/blob/7a23d0da817be4e8f68d96f524cecf802431033c/src/log/network.hpp#L157-L159
> This link does not appear to break (Master 2 -> 5) when Master 5 goes down, 
> perhaps due to how the network partition was induced (in the hypervisor 
> layer, rather than in the VM itself).
> When Master 2 tries to send an {{PromiseRequest}} to Master 5, we do not 
> observe the [expected log 
> message|https://github.com/apache/mesos/blob/7a23d0da817be4e8f68d96f524cecf802431033c/src/log/replica.cpp#L493-L494]
> Instead, we see a log line in Master 2:
> {code}
> process.cpp:2040] Failed to shutdown socket with fd 27: Transport endpoint is 
> not connected
> {code}
> The broken link is removed by the libprocess {{socket_manager}} and the 
> following {{WriteRequest}} from Master 2 to Master 5 succeeds via a new 
> socket.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-5576) Masters may drop the first message they send between masters after a network partition

2016-06-10 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu reassigned MESOS-5576:


Assignee: Joseph Wu

> Masters may drop the first message they send between masters after a network 
> partition
> --
>
> Key: MESOS-5576
> URL: https://issues.apache.org/jira/browse/MESOS-5576
> Project: Mesos
>  Issue Type: Bug
>  Components: leader election, master, replicated log
>Affects Versions: 0.28.2
> Environment: Observed in an OpenStack environment where each master 
> lives on a separate VM.
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: mesosphere
>
> We observed the following situation in a cluster of five masters:
> || Time || Master 1 || Master 2 || Master 3 || Master 4 || Master 5 ||
> | 0 | Follower | Follower | Follower | Follower | Leader |
> | 1 | Follower | Follower | Follower | Follower || Partitioned from cluster 
> by downing this VM's network ||
> | 2 || Elected Leader by ZK | Voting | Voting | Voting | Suicides due to lost 
> leadership |
> | 3 | Performs consensus | Replies to leader | Replies to leader | Replies to 
> leader | Still down |
> | 4 | Performs writing | Acks to leader | Acks to leader | Acks to leader | 
> Still down |
> | 5 | Leader | Follower | Follower | Follower | Still down |
> | 6 | Leader | Follower | Follower | Follower | Comes back up |
> | 7 | Leader | Follower | Follower | Follower | Follower |
> | 8 || Partitioned in the same way as Master 5 | Follower | Follower | 
> Follower | Follower |
> | 9 | Suicides due to lost leadership || Elected Leader by ZK | Follower | 
> Follower | Follower |
> | 10 | Still down | Performs consensus | Replies to leader | Replies to 
> leader || Doesn't get the message! ||
> | 11 | Still down | Performs writing | Acks to leader | Acks to leader || 
> Acks to leader ||
> | 12 | Still down | Leader | Follower | Follower | Follower |
> Master 2 sends a series of messages to the recently-restarted Master 5.  The 
> first message is dropped, but subsequent messages are not dropped.
> This appears to be due to a stale link between the masters.  Before leader 
> election, the replicated log actors create a network watcher, which adds 
> links to masters that join the ZK group:
> https://github.com/apache/mesos/blob/7a23d0da817be4e8f68d96f524cecf802431033c/src/log/network.hpp#L157-L159
> This link does not appear to break (Master 2 -> 5) when Master 5 goes down, 
> perhaps due to how the network partition was induced (in the hypervisor 
> layer, rather than in the VM itself).
> When Master 2 tries to send an {{PromiseRequest}} to Master 5, we do not 
> observe the [expected log 
> message|https://github.com/apache/mesos/blob/7a23d0da817be4e8f68d96f524cecf802431033c/src/log/replica.cpp#L493-L494]
> Instead, we see a log line in Master 2:
> {code}
> process.cpp:2040] Failed to shutdown socket with fd 27: Transport endpoint is 
> not connected
> {code}
> The broken link is removed by the libprocess {{socket_manager}} and the 
> following {{WriteRequest}} from Master 2 to Master 5 succeeds via a new 
> socket.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5576) Masters may drop the first message they send between masters after a network partition

2016-06-10 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-5576:
-
  Sprint: Mesosphere Sprint 37
Story Points: 5

> Masters may drop the first message they send between masters after a network 
> partition
> --
>
> Key: MESOS-5576
> URL: https://issues.apache.org/jira/browse/MESOS-5576
> Project: Mesos
>  Issue Type: Bug
>  Components: leader election, master, replicated log
>Affects Versions: 0.28.2
> Environment: Observed in an OpenStack environment where each master 
> lives on a separate VM.
>Reporter: Joseph Wu
>  Labels: mesosphere
>
> We observed the following situation in a cluster of five masters:
> || Time || Master 1 || Master 2 || Master 3 || Master 4 || Master 5 ||
> | 0 | Follower | Follower | Follower | Follower | Leader |
> | 1 | Follower | Follower | Follower | Follower || Partitioned from cluster 
> by downing this VM's network ||
> | 2 || Elected Leader by ZK | Voting | Voting | Voting | Suicides due to lost 
> leadership |
> | 3 | Performs consensus | Replies to leader | Replies to leader | Replies to 
> leader | Still down |
> | 4 | Performs writing | Acks to leader | Acks to leader | Acks to leader | 
> Still down |
> | 5 | Leader | Follower | Follower | Follower | Still down |
> | 6 | Leader | Follower | Follower | Follower | Comes back up |
> | 7 | Leader | Follower | Follower | Follower | Follower |
> | 8 || Partitioned in the same way as Master 5 | Follower | Follower | 
> Follower | Follower |
> | 9 | Suicides due to lost leadership || Elected Leader by ZK | Follower | 
> Follower | Follower |
> | 10 | Still down | Performs consensus | Replies to leader | Replies to 
> leader || Doesn't get the message! ||
> | 11 | Still down | Performs writing | Acks to leader | Acks to leader || 
> Acks to leader ||
> | 12 | Still down | Leader | Follower | Follower | Follower |
> Master 2 sends a series of messages to the recently-restarted Master 5.  The 
> first message is dropped, but subsequent messages are not dropped.
> This appears to be due to a stale link between the masters.  Before leader 
> election, the replicated log actors create a network watcher, which adds 
> links to masters that join the ZK group:
> https://github.com/apache/mesos/blob/7a23d0da817be4e8f68d96f524cecf802431033c/src/log/network.hpp#L157-L159
> This link does not appear to break (Master 2 -> 5) when Master 5 goes down, 
> perhaps due to how the network partition was induced (in the hypervisor 
> layer, rather than in the VM itself).
> When Master 2 tries to send an {{PromiseRequest}} to Master 5, we do not 
> observe the [expected log 
> message|https://github.com/apache/mesos/blob/7a23d0da817be4e8f68d96f524cecf802431033c/src/log/replica.cpp#L493-L494]
> Instead, we see a log line in Master 2:
> {code}
> process.cpp:2040] Failed to shutdown socket with fd 27: Transport endpoint is 
> not connected
> {code}
> The broken link is removed by the libprocess {{socket_manager}} and the 
> following {{WriteRequest}} from Master 2 to Master 5 succeeds via a new 
> socket.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5174) Update the balloon-framework to run on test clusters

2016-06-09 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-5174:
-
Sprint: Mesosphere Sprint 33, Mesosphere Sprint 37  (was: Mesosphere Sprint 
33)

> Update the balloon-framework to run on test clusters
> 
>
> Key: MESOS-5174
> URL: https://issues.apache.org/jira/browse/MESOS-5174
> Project: Mesos
>  Issue Type: Improvement
>  Components: framework, technical debt
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: mesosphere, tech-debt
>
> There are a couple of problems with the balloon framework that prevent it 
> from being deployed (easily) on an actual cluster:
> * The framework accepts 100% of memory in an offer.  This means the expected 
> behavior (finish or OOM) is dependent on the offer size.
> * The framework assumes the {{balloon-executor}} binary is available on each 
> agent.  This is generally only true in the build environment or in 
> single-agent test environments.
> * The framework does not specify CPUs with the executor.  This is required by 
> many isolators.
> * The executor's {{TASK_FINISHED}} logic path was untested and is flaky.
> * The framework has no metrics.
> * The framework only launches a single task and then exits.  With this 
> behavior, we can't have useful metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5576) Masters may drop the first message they send between masters after a network partition

2016-06-08 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-5576:
-
Description: 
We observed the following situation in a cluster of five masters:
|| Time || Master 1 || Master 2 || Master 3 || Master 4 || Master 5 ||
| 0 | Follower | Follower | Follower | Follower | Leader |
| 1 | Follower | Follower | Follower | Follower || Partitioned from cluster by 
downing this VM's network ||
| 2 || Elected Leader by ZK | Voting | Voting | Voting | Suicides due to lost 
leadership |
| 3 | Performs consensus | Replies to leader | Replies to leader | Replies to 
leader | Still down |
| 4 | Performs writing | Acks to leader | Acks to leader | Acks to leader | 
Still down |
| 5 | Leader | Follower | Follower | Follower | Still down |
| 6 | Leader | Follower | Follower | Follower | Comes back up |
| 7 | Leader | Follower | Follower | Follower | Follower |
| 8 || Partitioned in the same way as Master 5 | Follower | Follower | Follower 
| Follower |
| 9 | Suicides due to lost leadership || Elected Leader by ZK | Follower | 
Follower | Follower |
| 10 | Still down | Performs consensus | Replies to leader | Replies to leader 
|| Doesn't get the message! ||
| 11 | Still down | Performs writing | Acks to leader | Acks to leader || Acks 
to leader ||
| 12 | Still down | Leader | Follower | Follower | Follower |

Master 2 sends a series of messages to the recently-restarted Master 5.  The 
first message is dropped, but subsequent messages are not dropped.

This appears to be due to a stale link between the masters.  Before leader 
election, the replicated log actors create a network watcher, which adds links 
to masters that join the ZK group:
https://github.com/apache/mesos/blob/7a23d0da817be4e8f68d96f524cecf802431033c/src/log/network.hpp#L157-L159

This link does not appear to break (Master 2 -> 5) when Master 5 goes down, 
perhaps due to how the network partition was induced (in the hypervisor layer, 
rather than in the VM itself).

When Master 2 tries to send an {{PromiseRequest}} to Master 5, we do not 
observe the [expected log 
message|https://github.com/apache/mesos/blob/7a23d0da817be4e8f68d96f524cecf802431033c/src/log/replica.cpp#L493-L494]

Instead, we see a log line in Master 2:
{code}
process.cpp:2040] Failed to shutdown socket with fd 27: Transport endpoint is 
not connected
{code}

The broken link is removed by the libprocess {{socket_manager}} and the 
following {{WriteRequest}} from Master 2 to Master 5 succeeds via a new socket.

  was:
We observed the following situation in a cluster of five masters:
|| Time || Master 1 || Master 2 || Master 3 || Master 4 || Master 5 ||
| 0 | Follower | Follower | Follower | Follower | Leader |
| 1 | Follower | Follower | Follower | Follower || Partitioned from cluster by 
downing this VM's network ||
| 2 || Elected Leader by ZK | Voting | Voting | Voting | Suicides due to lost 
leadership |
| 3 | Performs consensus | Replies to leader | Replies to leader | Replies to 
leader | Still down |
| 4 | Performs writing | Acks to leader | Acks to leader | Acks to leader | 
Still down |
| 5 | Leader | Follower | Follower | Follower | Still down |
| 6 | Leader | Follower | Follower | Follower | Comes back up |
| 7 | Leader | Follower | Follower | Follower | Follower |
| 8 || Partitioned in the same way as Master 5 | Follower | Follower | Follower 
| Follower |
| 9 | Suicides due to lost leadership || Elected Leader by ZK | Follower | 
Follower | Follower |
| 10 | Still down | Performs consensus | Replies to leader | Replies to leader 
|| Doesn't get the message! ||
| 11 | Still down | Performs writing | Acks to leader | Acks to leader || Acks 
to leader ||
| 12 | Still down | Leader | Follower | Follower | Follower |

Master 1 sends a series of messages to the recently-restarted Master 5.  The 
first message is dropped, but subsequent messages are not dropped.

This appears to be due to a stale link between the masters.  Before leader 
election, the replicated log actors create a network watcher, which adds links 
to masters that join the ZK group:
https://github.com/apache/mesos/blob/7a23d0da817be4e8f68d96f524cecf802431033c/src/log/network.hpp#L157-L159

This link does not appear to break (Master 2 -> 5) when Master 5 goes down, 
perhaps due to how the network partition was induced (in the hypervisor layer, 
rather than in the VM itself).

When Master 2 tries to send an {{PromiseRequest}} to Master 5, we do not 
observe the [expected log 
message|https://github.com/apache/mesos/blob/7a23d0da817be4e8f68d96f524cecf802431033c/src/log/replica.cpp#L493-L494]

Instead, we see a log line in Master 2:
{code}
process.cpp:2040] Failed to shutdown socket with fd 27: Transport endpoint is 
not connected
{code}

The broken link is removed by the libprocess {{socket_manager}} and the 
following {{WriteRequest}} from Master 2 to Master 5 succeeds via a new socket.


> Masters m

[jira] [Updated] (MESOS-5576) Masters may drop the first message they send between masters after a network partition

2016-06-08 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-5576:
-
Description: 
We observed the following situation in a cluster of five masters:
|| Time || Master 1 || Master 2 || Master 3 || Master 4 || Master 5 ||
| 0 | Follower | Follower | Follower | Follower | Leader |
| 1 | Follower | Follower | Follower | Follower || Partitioned from cluster by 
downing this VM's network ||
| 2 || Elected Leader by ZK | Voting | Voting | Voting | Suicides due to lost 
leadership |
| 3 | Performs consensus | Replies to leader | Replies to leader | Replies to 
leader | Still down |
| 4 | Performs writing | Acks to leader | Acks to leader | Acks to leader | 
Still down |
| 5 | Leader | Follower | Follower | Follower | Still down |
| 6 | Leader | Follower | Follower | Follower | Comes back up |
| 7 | Leader | Follower | Follower | Follower | Follower |
| 8 || Partitioned in the same way as Master 5 | Follower | Follower | Follower 
| Follower |
| 9 | Suicides due to lost leadership || Elected Leader by ZK | Follower | 
Follower | Follower |
| 10 | Still down | Performs consensus | Replies to leader | Replies to leader 
|| Doesn't get the message! ||
| 11 | Still down | Performs writing | Acks to leader | Acks to leader || Acks 
to leader ||
| 12 | Still down | Leader | Follower | Follower | Follower |

Master 1 sends a series of messages to the recently-restarted Master 5.  The 
first message is dropped, but subsequent messages are not dropped.

This appears to be due to a stale link between the masters.  Before leader 
election, the replicated log actors create a network watcher, which adds links 
to masters that join the ZK group:
https://github.com/apache/mesos/blob/7a23d0da817be4e8f68d96f524cecf802431033c/src/log/network.hpp#L157-L159

This link does not appear to break (Master 2 -> 5) when Master 5 goes down, 
perhaps due to how the network partition was induced (in the hypervisor layer, 
rather than in the VM itself).

When Master 2 tries to send an {{PromiseRequest}} to Master 5, we do not 
observe the [expected log 
message|https://github.com/apache/mesos/blob/7a23d0da817be4e8f68d96f524cecf802431033c/src/log/replica.cpp#L493-L494]

Instead, we see a log line in Master 2:
{code}
process.cpp:2040] Failed to shutdown socket with fd 27: Transport endpoint is 
not connected
{code}

The broken link is removed by the libprocess {{socket_manager}} and the 
following {{WriteRequest}} from Master 2 to Master 5 succeeds via a new socket.

  was:
We observed the following situation in a cluster of five masters:
|| Time || Master 1 || Master 2 || Master 3 || Master 4 || Master 5 ||
| 0 | Follower | Follower | Follower | Follower | Leader |
| 1 | Follower | Follower | Follower | Follower || Partitioned from cluster by 
downing this VM's network ||
| 2 || Elected Leader by ZK | Voting | Voting | Voting | Suicides due to lost 
leadership |
| 3 | Performs consensus | Replies to leader | Replies to leader | Replies to 
leader | Still down |
| 4 | Performs writing | Acks to leader | Acks to leader | Acks to leader | 
Still down |
| 5 | Leader | Follower | Follower | Follower | Still down |
| 6 | Leader | Follower | Follower | Follower | Comes back up |
| 7 | Leader | Follower | Follower | Follower | Follower |
| 8 || Partitioned in the same way as Master 5 | Follower | Follower | Follower 
| Follower |
| 9 | Suicides due to lost leadership || Elected Leader by ZK | Follower | 
Follower | Follower |
| 10 | Still down | Performs consensus | Replies to leader | Replies to leader 
|| Doesn't get the message! ||
| 11 | Still down | Performs writing | Acks to leader | Acks to leader || Acks 
to leader ||
| 12 | Still down | Leader | Follower | Follower | Follower |

Master 1 sends a series of messages to the recently-restarted Master 5.  The 
first message is dropped, but subsequent messages are not dropped.

This appears to be due to a stale link between the masters.  Before leader 
election, the replicated log actors create a network watcher, which adds links 
to masters that join the ZK group:
https://github.com/apache/mesos/blob/7a23d0da817be4e8f68d96f524cecf802431033c/src/log/network.hpp#L157-L159

This link does not appear to break (Master 1 -> 5) when Master 5 goes down, 
perhaps due to how the network partition was induced (in the hypervisor layer, 
rather than in the VM itself).

When Master 1 tries to send an {{PromiseRequest}} to Master 5, we do not 
observe the [expected log 
message|https://github.com/apache/mesos/blob/7a23d0da817be4e8f68d96f524cecf802431033c/src/log/replica.cpp#L493-L494]

Instead, we see a log line in Master 1:
{code}
process.cpp:2040] Failed to shutdown socket with fd 27: Transport endpoint is 
not connected
{code}

The broken link is removed by the libprocess {{socket_manager}} and the 
following {{WriteRequest}} from Master 1 to Master 5 succeeds via a new socket.


> Masters m

[jira] [Created] (MESOS-5576) Masters may drop the first message they send between masters after a network partition

2016-06-08 Thread Joseph Wu (JIRA)
Joseph Wu created MESOS-5576:


 Summary: Masters may drop the first message they send between 
masters after a network partition
 Key: MESOS-5576
 URL: https://issues.apache.org/jira/browse/MESOS-5576
 Project: Mesos
  Issue Type: Bug
  Components: leader election, master, replicated log
Affects Versions: 0.28.2
 Environment: Observed in an OpenStack environment where each master 
lives on a separate VM.
Reporter: Joseph Wu


We observed the following situation in a cluster of five masters:
|| Time || Master 1 || Master 2 || Master 3 || Master 4 || Master 5 ||
| 0 | Follower | Follower | Follower | Follower | Leader |
| 1 | Follower | Follower | Follower | Follower || Partitioned from cluster by 
downing this VM's network ||
| 2 || Elected Leader by ZK | Voting | Voting | Voting | Suicides due to lost 
leadership |
| 3 | Performs consensus | Replies to leader | Replies to leader | Replies to 
leader | Still down |
| 4 | Performs writing | Acks to leader | Acks to leader | Acks to leader | 
Still down |
| 5 | Leader | Follower | Follower | Follower | Still down |
| 6 | Leader | Follower | Follower | Follower | Comes back up |
| 7 | Leader | Follower | Follower | Follower | Follower |
| 8 || Partitioned in the same way as Master 5 | Follower | Follower | Follower 
| Follower |
| 9 | Suicides due to lost leadership || Elected Leader by ZK | Follower | 
Follower | Follower |
| 10 | Still down | Performs consensus | Replies to leader | Replies to leader 
|| Doesn't get the message! ||
| 11 | Still down | Performs writing | Acks to leader | Acks to leader || Acks 
to leader ||
| 12 | Still down | Leader | Follower | Follower | Follower |

Master 1 sends a series of messages to the recently-restarted Master 5.  The 
first message is dropped, but subsequent messages are not dropped.

This appears to be due to a stale link between the masters.  Before leader 
election, the replicated log actors create a network watcher, which adds links 
to masters that join the ZK group:
https://github.com/apache/mesos/blob/7a23d0da817be4e8f68d96f524cecf802431033c/src/log/network.hpp#L157-L159

This link does not appear to break (Master 1 -> 5) when Master 5 goes down, 
perhaps due to how the network partition was induced (in the hypervisor layer, 
rather than in the VM itself).

When Master 1 tries to send an {{PromiseRequest}} to Master 5, we do not 
observe the [expected log 
message|https://github.com/apache/mesos/blob/7a23d0da817be4e8f68d96f524cecf802431033c/src/log/replica.cpp#L493-L494]

Instead, we see a log line in Master 1:
{code}
process.cpp:2040] Failed to shutdown socket with fd 27: Transport endpoint is 
not connected
{code}

The broken link is removed by the libprocess {{socket_manager}} and the 
following {{WriteRequest}} from Master 1 to Master 5 succeeds via a new socket.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-5174) Update the balloon-framework to run on test clusters

2016-06-06 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15235663#comment-15235663
 ] 

Joseph Wu edited comment on MESOS-5174 at 6/7/16 12:31 AM:
---

|| Review || Summary ||
| https://reviews.apache.org/r/46407/ | Balloon executor changes | 
| https://reviews.apache.org/r/48299/ | Spacing/logging |
| https://reviews.apache.org/r/45604/ | Flags and resource math |
| https://reviews.apache.org/r/46411/ | Terminal status updates | 
| https://reviews.apache.org/r/48303/ | Split scheduler into process |
| https://reviews.apache.org/r/45905/ | Metrics | 


was (Author: kaysoky):
|| Review || Summary ||
| https://reviews.apache.org/r/46407/ | Balloon executor changes | 
| https://reviews.apache.org/r/45604/ | First 3 bullet points in the 
description |
| https://reviews.apache.org/r/46411/ | Terminal status updates | 
| https://reviews.apache.org/r/45905/ | Metrics | 

> Update the balloon-framework to run on test clusters
> 
>
> Key: MESOS-5174
> URL: https://issues.apache.org/jira/browse/MESOS-5174
> Project: Mesos
>  Issue Type: Improvement
>  Components: framework, technical debt
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: mesosphere, tech-debt
>
> There are a couple of problems with the balloon framework that prevent it 
> from being deployed (easily) on an actual cluster:
> * The framework accepts 100% of memory in an offer.  This means the expected 
> behavior (finish or OOM) is dependent on the offer size.
> * The framework assumes the {{balloon-executor}} binary is available on each 
> agent.  This is generally only true in the build environment or in 
> single-agent test environments.
> * The framework does not specify CPUs with the executor.  This is required by 
> many isolators.
> * The executor's {{TASK_FINISHED}} logic path was untested and is flaky.
> * The framework has no metrics.
> * The framework only launches a single task and then exits.  With this 
> behavior, we can't have useful metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4642) Mesos Agent Json API can dump binary data from log files out as invalid JSON

2016-05-30 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15307136#comment-15307136
 ] 

Joseph Wu commented on MESOS-4642:
--

Only the new V1 Operator API replacement for the {{/files/read}} endpoint will 
be affected.  We're not making any changes to how we render JSON globally.  But 
the new API won't have the problem described in this JIRA.

> Mesos Agent Json API can dump binary data from log files out as invalid JSON
> 
>
> Key: MESOS-4642
> URL: https://issues.apache.org/jira/browse/MESOS-4642
> Project: Mesos
>  Issue Type: Bug
>  Components: json api, slave
>Affects Versions: 0.27.0
>Reporter: Steven Schlansker
>Priority: Critical
> Fix For: 1.0.0
>
>
> One of our tasks accidentally started logging binary data to stderr.  This 
> was not intentional and generally should not happen -- however, it causes 
> severe problems with the Mesos Agent "files/read.json" API, since it gladly 
> dumps this binary data out as invalid JSON.
> {code}
> # hexdump -C /path/to/task/stderr | tail
> 0003d1f0  6f 6e 6e 65 63 74 69 6f  6e 0a 4e 45 54 3a 20 31  |onnection.NET: 1|
> 0003d200  20 6f 6e 72 65 61 64 20  45 4e 4f 45 4e 54 20 32  | onread ENOENT 2|
> 0003d210  39 35 34 35 36 20 32 35  31 20 32 39 35 37 30 37  |95456 251 295707|
> 0003d220  0a 01 00 00 00 00 00 00  ac 57 65 64 2c 20 31 30  |.Wed, 10|
> 0003d230  20 55 6e 72 65 63 6f 67  6e 69 7a 65 64 20 69 6e  | Unrecognized in|
> 0003d240  70 75 74 20 68 65 61 64  65 72 0a |put header.|
> {code}
> {code}
> # curl 
> 'http://agent-host:5051/files/read.json?path=/path/to/task/stderr&offset=220443&length=9&grep='
>  | hexdump -C
> 7970  6e 65 63 74 69 6f 6e 5c  6e 4e 45 54 3a 20 31 20  |nection\nNET: 1 |
> 7980  6f 6e 72 65 61 64 20 45  4e 4f 45 4e 54 20 32 39  |onread ENOENT 29|
> 7990  35 34 35 36 20 32 35 31  20 32 39 35 37 30 37 5c  |5456 251 295707\|
> 79a0  6e 5c 75 30 30 30 31 5c  75 30 30 30 30 5c 75 30  |n\u0001\u\u0|
> 79b0  30 30 30 5c 75 30 30 30  30 5c 75 30 30 30 30 5c  |000\u\u\|
> 79c0  75 30 30 30 30 5c 75 30  30 30 30 ac 57 65 64 2c  |u\u.Wed,|
> 79d0  20 31 30 20 55 6e 72 65  63 6f 67 6e 69 7a 65 64  | 10 Unrecognized|
> 79e0  20 69 6e 70 75 74 20 68  65 61 64 65 72 5c 6e 22  | input header\n"|
> 79f0  2c 22 6f 66 66 73 65 74  22 3a 32 32 30 34 34 33  |,"offset":220443|
> 7a00  7d|}|
> {code}
> This causes downstream sadness:
> {code}
> ERROR [2016-02-10 18:55:12,303] 
> io.dropwizard.jersey.errors.LoggingExceptionMapper: Error handling a request: 
> 0ee749630f8b26f1
> ! com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 start byte 0xac
> !  at [Source: org.jboss.netty.buffer.ChannelBufferInputStream@6d69ee8; line: 
> 1, column: 31181]
> ! at 
> com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1487) 
> ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:518)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidInitial(UTF8StreamJsonParser.java:3339)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidChar(UTF8StreamJsonParser.java:)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.core.json.UTF8StreamJsonParser._finishString2(UTF8StreamJsonParser.java:2360)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.core.json.UTF8StreamJsonParser._finishString(UTF8StreamJsonParser.java:2287)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.core.json.UTF8StreamJsonParser.getText(UTF8StreamJsonParser.java:286)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.databind.deser.std.StringDeserializer.deserialize(StringDeserializer.java:29)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.databind.deser.std.StringDeserializer.deserialize(StringDeserializer.java:12)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.databind.deser.SettableBeanProperty.deserialize(SettableBeanProperty.java:523)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.databind.deser.BeanDeserializer._deserializeUsingPropertyBased(BeanDeserializer.java:381)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.databind.deser.BeanDeserializerBase.deserializeFromObjectUsingNonDefault(BeanDeserializerBase.java:1073)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.module.afterburner.deser.SuperSonicBeanDeserializer.deserializeFromObject(SuperSonicBeanDeserializer.java:196)
>  ~[singularity-0.4.9.jar:0.4.9

[jira] [Commented] (MESOS-4642) Mesos Agent Json API can dump binary data from log files out as invalid JSON

2016-05-27 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15305063#comment-15305063
 ] 

Joseph Wu commented on MESOS-4642:
--

Looks like the protobuf response in the V1 operator API will neatly side-step 
this issue.  (By effectively creating a new endpoint.)

The response protobuf in the document is:
{code}
message FileContents {
  repeated byte bytes = 1;
}
{code}

The {{byte}} type becomes a base64 encoded string, which will always be valid 
JSON.

> Mesos Agent Json API can dump binary data from log files out as invalid JSON
> 
>
> Key: MESOS-4642
> URL: https://issues.apache.org/jira/browse/MESOS-4642
> Project: Mesos
>  Issue Type: Bug
>  Components: json api, slave
>Affects Versions: 0.27.0
>Reporter: Steven Schlansker
>Priority: Critical
> Fix For: 1.0.0
>
>
> One of our tasks accidentally started logging binary data to stderr.  This 
> was not intentional and generally should not happen -- however, it causes 
> severe problems with the Mesos Agent "files/read.json" API, since it gladly 
> dumps this binary data out as invalid JSON.
> {code}
> # hexdump -C /path/to/task/stderr | tail
> 0003d1f0  6f 6e 6e 65 63 74 69 6f  6e 0a 4e 45 54 3a 20 31  |onnection.NET: 1|
> 0003d200  20 6f 6e 72 65 61 64 20  45 4e 4f 45 4e 54 20 32  | onread ENOENT 2|
> 0003d210  39 35 34 35 36 20 32 35  31 20 32 39 35 37 30 37  |95456 251 295707|
> 0003d220  0a 01 00 00 00 00 00 00  ac 57 65 64 2c 20 31 30  |.Wed, 10|
> 0003d230  20 55 6e 72 65 63 6f 67  6e 69 7a 65 64 20 69 6e  | Unrecognized in|
> 0003d240  70 75 74 20 68 65 61 64  65 72 0a |put header.|
> {code}
> {code}
> # curl 
> 'http://agent-host:5051/files/read.json?path=/path/to/task/stderr&offset=220443&length=9&grep='
>  | hexdump -C
> 7970  6e 65 63 74 69 6f 6e 5c  6e 4e 45 54 3a 20 31 20  |nection\nNET: 1 |
> 7980  6f 6e 72 65 61 64 20 45  4e 4f 45 4e 54 20 32 39  |onread ENOENT 29|
> 7990  35 34 35 36 20 32 35 31  20 32 39 35 37 30 37 5c  |5456 251 295707\|
> 79a0  6e 5c 75 30 30 30 31 5c  75 30 30 30 30 5c 75 30  |n\u0001\u\u0|
> 79b0  30 30 30 5c 75 30 30 30  30 5c 75 30 30 30 30 5c  |000\u\u\|
> 79c0  75 30 30 30 30 5c 75 30  30 30 30 ac 57 65 64 2c  |u\u.Wed,|
> 79d0  20 31 30 20 55 6e 72 65  63 6f 67 6e 69 7a 65 64  | 10 Unrecognized|
> 79e0  20 69 6e 70 75 74 20 68  65 61 64 65 72 5c 6e 22  | input header\n"|
> 79f0  2c 22 6f 66 66 73 65 74  22 3a 32 32 30 34 34 33  |,"offset":220443|
> 7a00  7d|}|
> {code}
> This causes downstream sadness:
> {code}
> ERROR [2016-02-10 18:55:12,303] 
> io.dropwizard.jersey.errors.LoggingExceptionMapper: Error handling a request: 
> 0ee749630f8b26f1
> ! com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 start byte 0xac
> !  at [Source: org.jboss.netty.buffer.ChannelBufferInputStream@6d69ee8; line: 
> 1, column: 31181]
> ! at 
> com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1487) 
> ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:518)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidInitial(UTF8StreamJsonParser.java:3339)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidChar(UTF8StreamJsonParser.java:)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.core.json.UTF8StreamJsonParser._finishString2(UTF8StreamJsonParser.java:2360)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.core.json.UTF8StreamJsonParser._finishString(UTF8StreamJsonParser.java:2287)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.core.json.UTF8StreamJsonParser.getText(UTF8StreamJsonParser.java:286)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.databind.deser.std.StringDeserializer.deserialize(StringDeserializer.java:29)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.databind.deser.std.StringDeserializer.deserialize(StringDeserializer.java:12)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.databind.deser.SettableBeanProperty.deserialize(SettableBeanProperty.java:523)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.databind.deser.BeanDeserializer._deserializeUsingPropertyBased(BeanDeserializer.java:381)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.databind.deser.BeanDeserializerBase.deserializeFromObjectUsingNonDefault(BeanDeserializerBase.java:1073)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.module.afterburner.deser.SuperSoni

[jira] [Commented] (MESOS-4642) Mesos Agent Json API can dump binary data from log files out as invalid JSON

2016-05-27 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304965#comment-15304965
 ] 

Joseph Wu commented on MESOS-4642:
--

There isn't a straight forward solution for this one.  Our options are to make 
a small breaking change, omit data from the file, or create a new analogous 
endpoint (and have frameworks use that one instead).

> Mesos Agent Json API can dump binary data from log files out as invalid JSON
> 
>
> Key: MESOS-4642
> URL: https://issues.apache.org/jira/browse/MESOS-4642
> Project: Mesos
>  Issue Type: Bug
>  Components: json api, slave
>Affects Versions: 0.27.0
>Reporter: Steven Schlansker
>Priority: Critical
> Fix For: 1.0.0
>
>
> One of our tasks accidentally started logging binary data to stderr.  This 
> was not intentional and generally should not happen -- however, it causes 
> severe problems with the Mesos Agent "files/read.json" API, since it gladly 
> dumps this binary data out as invalid JSON.
> {code}
> # hexdump -C /path/to/task/stderr | tail
> 0003d1f0  6f 6e 6e 65 63 74 69 6f  6e 0a 4e 45 54 3a 20 31  |onnection.NET: 1|
> 0003d200  20 6f 6e 72 65 61 64 20  45 4e 4f 45 4e 54 20 32  | onread ENOENT 2|
> 0003d210  39 35 34 35 36 20 32 35  31 20 32 39 35 37 30 37  |95456 251 295707|
> 0003d220  0a 01 00 00 00 00 00 00  ac 57 65 64 2c 20 31 30  |.Wed, 10|
> 0003d230  20 55 6e 72 65 63 6f 67  6e 69 7a 65 64 20 69 6e  | Unrecognized in|
> 0003d240  70 75 74 20 68 65 61 64  65 72 0a |put header.|
> {code}
> {code}
> # curl 
> 'http://agent-host:5051/files/read.json?path=/path/to/task/stderr&offset=220443&length=9&grep='
>  | hexdump -C
> 7970  6e 65 63 74 69 6f 6e 5c  6e 4e 45 54 3a 20 31 20  |nection\nNET: 1 |
> 7980  6f 6e 72 65 61 64 20 45  4e 4f 45 4e 54 20 32 39  |onread ENOENT 29|
> 7990  35 34 35 36 20 32 35 31  20 32 39 35 37 30 37 5c  |5456 251 295707\|
> 79a0  6e 5c 75 30 30 30 31 5c  75 30 30 30 30 5c 75 30  |n\u0001\u\u0|
> 79b0  30 30 30 5c 75 30 30 30  30 5c 75 30 30 30 30 5c  |000\u\u\|
> 79c0  75 30 30 30 30 5c 75 30  30 30 30 ac 57 65 64 2c  |u\u.Wed,|
> 79d0  20 31 30 20 55 6e 72 65  63 6f 67 6e 69 7a 65 64  | 10 Unrecognized|
> 79e0  20 69 6e 70 75 74 20 68  65 61 64 65 72 5c 6e 22  | input header\n"|
> 79f0  2c 22 6f 66 66 73 65 74  22 3a 32 32 30 34 34 33  |,"offset":220443|
> 7a00  7d|}|
> {code}
> This causes downstream sadness:
> {code}
> ERROR [2016-02-10 18:55:12,303] 
> io.dropwizard.jersey.errors.LoggingExceptionMapper: Error handling a request: 
> 0ee749630f8b26f1
> ! com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 start byte 0xac
> !  at [Source: org.jboss.netty.buffer.ChannelBufferInputStream@6d69ee8; line: 
> 1, column: 31181]
> ! at 
> com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1487) 
> ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:518)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidInitial(UTF8StreamJsonParser.java:3339)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidChar(UTF8StreamJsonParser.java:)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.core.json.UTF8StreamJsonParser._finishString2(UTF8StreamJsonParser.java:2360)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.core.json.UTF8StreamJsonParser._finishString(UTF8StreamJsonParser.java:2287)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.core.json.UTF8StreamJsonParser.getText(UTF8StreamJsonParser.java:286)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.databind.deser.std.StringDeserializer.deserialize(StringDeserializer.java:29)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.databind.deser.std.StringDeserializer.deserialize(StringDeserializer.java:12)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.databind.deser.SettableBeanProperty.deserialize(SettableBeanProperty.java:523)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.databind.deser.BeanDeserializer._deserializeUsingPropertyBased(BeanDeserializer.java:381)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.databind.deser.BeanDeserializerBase.deserializeFromObjectUsingNonDefault(BeanDeserializerBase.java:1073)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.module.afterburner.deser.SuperSonicBeanDeserializer.deserializeFromObject(SuperSonicBeanDeserializer.java:196)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 

[jira] [Updated] (MESOS-5472) Hadoop-free S3 fetcher

2016-05-27 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-5472:
-

We will consider adding an {{S3}} plugin once we finish moving the 
{{mesos-fetcher}} to the URI fetcher (MESOS-3918).

> Hadoop-free S3 fetcher
> --
>
> Key: MESOS-5472
> URL: https://issues.apache.org/jira/browse/MESOS-5472
> Project: Mesos
>  Issue Type: Wish
>  Components: fetcher
>Reporter: Marc Villacorta
>Priority: Minor
>
> My mesos agents are running on systems without Hadoop.
> I would like to fetch _S3_ uris into my sandboxes.
> How about using the _'awscli'_?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5425) Consider using IntervalSet for Port range resource math

2016-05-27 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304353#comment-15304353
 ] 

Joseph Wu commented on MESOS-5425:
--

[~yanyanhu], can you post your existing work on Reviewboard?  The performance 
improvements look promising and I'd be happy to help review.  

> Consider using IntervalSet for Port range resource math
> ---
>
> Key: MESOS-5425
> URL: https://issues.apache.org/jira/browse/MESOS-5425
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Joseph Wu
>  Labels: mesosphere
>
> Follow-up JIRA for comments raised in MESOS-3051 (see comments there).
> We should consider utilizing 
> [{{IntervalSet}}|https://github.com/apache/mesos/blob/a0b798d2fac39445ce0545cfaf05a682cd393abe/3rdparty/stout/include/stout/interval.hpp]
>  in [Port range resource 
> math|https://github.com/apache/mesos/blob/a0b798d2fac39445ce0545cfaf05a682cd393abe/src/common/values.cpp#L143].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2082) Update the webui to include maintenance information.

2016-05-25 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15300525#comment-15300525
 ] 

Joseph Wu commented on MESOS-2082:
--

I can help you do preliminary reviews and/or answer any questions you have.  
But I suspect many people (especially shepherds) will not have cycles to spare 
until some time after MesosCon.

> Update the webui to include maintenance information.
> 
>
> Key: MESOS-2082
> URL: https://issues.apache.org/jira/browse/MESOS-2082
> Project: Mesos
>  Issue Type: Task
>  Components: webui
>Reporter: Benjamin Mahler
>Assignee: Shuai Lin
>  Labels: mesosphere, twitter
>
> The simplest thing here would probably be to include another tab in the 
> header for maintenance information.
> We could also consider adding maintenance information inline to the slaves 
> table. Depending on how this is done, the maintenance tab could actually be a 
> subset of the slaves table; only those slaves for which there is maintenance 
> information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5449) Memory leak in SchedulerProcess.declineOffer

2016-05-24 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-5449:
-
Shepherd: Vinod Kone

> Memory leak in SchedulerProcess.declineOffer
> 
>
> Key: MESOS-5449
> URL: https://issues.apache.org/jira/browse/MESOS-5449
> Project: Mesos
>  Issue Type: Bug
>  Components: scheduler driver
>Affects Versions: 0.26.0, 0.27.0, 0.27.1, 0.28.0, 0.26.1, 0.28.1
>Reporter: Dario Rexin
>Assignee: Dario Rexin
>Priority: Blocker
> Fix For: 0.29.0, 0.27.3, 0.28.2, 0.26.2
>
>
> MesosScheduler.declineOffers has been changed ~6 months ago to send a Decline 
> message instead of calling acceptOffers with an empty list of task infos. The 
> changed version of declineOffer however did not remove the offerId from the 
> savedOffers map, causing a memory leak.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5439) registerExecutor problem

2016-05-23 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15296759#comment-15296759
 ] 

Joseph Wu commented on MESOS-5439:
--

A couple questions:
* How many tasks are you launching at once?  (i.e. from a single offer)  And 
how many over a given time?
* Are you using the default command executor?  Or are you launching a custom 
executor?
* What flags are you using to launch the agent?
* What do the executor's stdout/stderr files (in the sandbox) say?  There 
should be glog logs in there too.

> registerExecutor problem
> 
>
> Key: MESOS-5439
> URL: https://issues.apache.org/jira/browse/MESOS-5439
> Project: Mesos
>  Issue Type: Bug
>  Components: c++ api, slave
>Affects Versions: 0.27.0
>Reporter: kimjoohwan
>
> Currently, we are using Mesos 0.27.0. The master is build up with a Intel(R) 
> Core(TM) i5-3470 CPU @ 3.20GHz CPU and a 4GB RAM. The slave (Banana PI) is 
> build up with a Cortex -A7 Dual-Core CPU and a 1GB RAM.
> By using the Mesos API, we have developed and completed the execution of the 
> framework which is based on python.
> but, we found that it takes too much time between the messages, 'Forked child 
> with pid' and 'Got registration for executor' from the slave log. (5sec)
> If you know how to deal with this problem, please let us know.
> I0523 17:38:16.264289  1787 slave.cpp:5208] Launching executor default of 
> framework 3fb86eea-96c4-4b07-aaa2-caf071275bdf-0010 with resources  in work 
> directory 
> '/tmp/mesos/slaves/3fb86eea-96c4-4b07-aaa2-caf071275bdf-S2/frameworks/3fb86eea-96c4-4b07-aaa2-caf071275bdf-0010/executors/default/runs/1c830c9a-4120-4ef0-af80-49a52d307539'
> I0523 17:38:16.290601  1789 containerizer.cpp:616] Starting container 
> '1c830c9a-4120-4ef0-af80-49a52d307539' for executor 'default' of framework 
> '3fb86eea-96c4-4b07-aaa2-caf071275bdf-0010'
> I0523 17:38:16.293285  1787 slave.cpp:1626] Queuing task '0' for executor 
> 'default' of framework 3fb86eea-96c4-4b07-aaa2-caf071275bdf-0010
> I0523 17:38:16.297369  1787 slave.cpp:4233] Current disk usage 2.14%. Max 
> allowed age: 6.150293798159722days
> I0523 17:38:16.504043  1789 launcher.cpp:132] Forked child with pid '1837' 
> for container '1c830c9a-4120-4ef0-af80-49a52d307539'
> I0523 17:38:21.510535  1785 slave.cpp:2573] Got registration for executor 
> 'default' of framework 3fb86eea-96c4-4b07-aaa2-caf071275bdf-0010 from 
> executor(1)@192.168.0.8:56508
> I0523 17:38:21.554608  1785 slave.cpp:1791] Sending queued task '0' to 
> executor 'default' of framework 3fb86eea-96c4-4b07-aaa2-caf071275bdf-0010 at 
> executor(1)@192.168.0.8:56508
> I0523 17:38:21.594511  1789 slave.cpp:2932] Handling status update 
> TASK_RUNNING (UUID: cd04ec2a-0e68-460a-ad2e-e4f504f3b032) for task 0 of 
> framework 3fb86eea-96c4-4b07-aaa2-caf071275bdf-0010 from 
> executor(1)@192.168.0.8:56508
> I0523 17:38:21.600050  1789 slave.cpp:2932] Handling status update 
> TASK_FINISHED (UUID: 46e110c8-4078-4f98-ae30-30b3a1376034) for task 0 of 
> framework 3fb86eea-96c4-4b07-aaa2-caf071275bdf-0010 from 
> executor(1)@192.168.0.8:56508



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5427) Mesos master locks up after slave fails to authenticate

2016-05-20 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15294461#comment-15294461
 ] 

Joseph Wu commented on MESOS-5427:
--

Are you running on Ubuntu 10?  (Typo?)  I'm not sure if Mesos builds on that.

Could you try the same setup/configuration with a more recent version of Mesos? 
 The SASL-based authentication code has not change much.  (It was moved around, 
and is now called the CRAM MD5 authenticator/ee.)

> Mesos master locks up after slave fails to authenticate
> ---
>
> Key: MESOS-5427
> URL: https://issues.apache.org/jira/browse/MESOS-5427
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.20.1
> Environment: Linux XX-X 3.13.0-49-generic #81-Ubuntu SMP 
> Tue Mar 24 19:29:48 UTC 2015 x86_64 GNU/Linux
> Ubuntu 10.04.1 LTS
> AWS/8cores/16GB
>Reporter: analogue
>Priority: Minor
>
> In a mesos master cluster with one leader and two backups, a single slave 
> attempting to authenticate with the leader locked up the master and resulted 
> in 2 CPU cores pegged at 100% CPU usage until restarted.
> master
> {noformat}
> I0516 02:55:39.945566 32126 master.cpp:3612] Authenticating 
> slave(1)@10.85.20.76:5051
> I0516 02:55:39.945757 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> I0516 02:55:39.945802 32123 authenticator.hpp:156] Creating new server SASL 
> connection
> I0516 02:55:39.945991 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> I0516 02:55:39.946030 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> I0516 02:55:39.946063 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> I0516 02:55:39.946095 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> I0516 02:55:39.946126 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> I0516 02:55:39.946158 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> I0516 02:55:39.946189 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> I0516 02:55:39.946221 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> I0516 02:55:39.946252 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> I0516 02:55:39.946285 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> I0516 02:55:39.946316 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> I0516 02:55:39.946347 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> I0516 02:55:39.946379 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> ...
> W0516 02:55:44.945811 32124 master.cpp:3670] Authentication timed out
> I0516 02:55:49.290623 32121 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> (last long line repeats until mesos-master restarted)
> {noformat}
> slave
> {noformat}
> Log file created at: 2016/05/16 02:37:52
> Running on machine: 10-85-20-76-uswest2btestopia
> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
> I0516 02:37:52.112509 10198 logging.cpp:142] INFO level logging started!
> I0516 02:37:52.112761 10198 main.cpp:126] Build: 2014-12-12 00:52:32 by
> I0516 02:37:52.112772 10198 main.cpp:128] Version: 0.20.1
> I0516 02:37:52.112778 10198 main.cpp:131] Git tag: 0.20.1
> I0516 02:37:52.112783 10198 main.cpp:135] Git SHA: 
> fe0a39112f3304283f970f1b08b322b1e970829d
> I0516 02:37:52.112793 10198 containerizer.cpp:89] Using isolation: 
> cgroups/cpu,cgroups/mem
> I0516 02:37:52.125773 10198 linux_launcher.cpp:78] Using 
> /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
> I0516 02:37:52.126652 10198 main.cpp:149] Starting Mesos slave
> I0516 02:37:52.128687 10246 slave.cpp:167] Sla

[jira] [Commented] (MESOS-5421) Mesos Docker executor taskHealthUpdated removes information about job ipAddresses

2016-05-20 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15293988#comment-15293988
 ] 

Joseph Wu commented on MESOS-5421:
--

[~dfedorov], can you check if MESOS-5294 is the same issue?  (There isn't 
enough information in the bug description.)

> Mesos Docker executor taskHealthUpdated removes information about job 
> ipAddresses
> -
>
> Key: MESOS-5421
> URL: https://issues.apache.org/jira/browse/MESOS-5421
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Affects Versions: 0.28.1
>Reporter: Dmitry Fedorov
>Priority: Minor
> Fix For: 0.28.2
>
>
> When you create job with command health check, right after job is launched 
> the status is correct and ipAddresses field is present in it. 
> But after health status is updated, ipAddresses field is missed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3051) performance issues with port ranges comparison

2016-05-20 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15293757#comment-15293757
 ] 

Joseph Wu commented on MESOS-3051:
--

Filed [MESOS-5425] to follow up on further performance improvements.

> performance issues with port ranges comparison
> --
>
> Key: MESOS-3051
> URL: https://issues.apache.org/jira/browse/MESOS-3051
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Affects Versions: 0.22.1
>Reporter: James Peach
>Assignee: Joerg Schad
>  Labels: mesosphere
> Fix For: 0.25.0, 0.24.2
>
>
> Testing in an environment with lots of frameworks (>200), where the 
> frameworks permanently decline resources they don't need. The allocator ends 
> up spending a lot of time figuring out whether offers are refused (the code 
> path through {{HierarchicalAllocatorProcess::isFiltered()}}.
> In profiling a synthetic benchmark, it turns out that comparing port ranges 
> is very expensive, involving many temporary allocations. 61% of 
> Resources::contains() run time is in operator -= (Resource). 35% of 
> Resources::contains() run time is in Resources::_contains().
> The heaviest call chain through {{Resources::_contains}} is:
> {code}
> Running Time  Self (ms) Symbol Name
> 7237.0ms   35.5%  4.0
> mesos::Resources::_contains(mesos::Resource const&) const
> 7200.0ms   35.3%  1.0 mesos::contains(mesos::Resource 
> const&, mesos::Resource const&)
> 7133.0ms   35.0%121.0  
> mesos::operator<=(mesos::Value_Ranges const&, mesos::Value_Ranges const&)
> 6319.0ms   31.0%  7.0   
> mesos::coalesce(mesos::Value_Ranges*, mesos::Value_Ranges const&)
> 6240.0ms   30.6%161.0
> mesos::coalesce(mesos::Value_Ranges*, mesos::Value_Range const&)
> 1867.0ms9.1% 25.0 mesos::Value_Ranges::add_range()
> 1694.0ms8.3%  4.0 
> mesos::Value_Ranges::~Value_Ranges()
> 1495.0ms7.3% 16.0 
> mesos::Value_Ranges::operator=(mesos::Value_Ranges const&)
>  445.0ms2.1% 94.0 
> mesos::Value_Range::MergeFrom(mesos::Value_Range const&)
>  154.0ms0.7% 24.0 mesos::Value_Ranges::range(int) 
> const
>  103.0ms0.5% 24.0 
> mesos::Value_Ranges::range_size() const
>   95.0ms0.4%  2.0 
> mesos::Value_Range::Value_Range(mesos::Value_Range const&)
>   59.0ms0.2%  4.0 
> mesos::Value_Ranges::Value_Ranges()
>   50.0ms0.2% 50.0 mesos::Value_Range::begin() 
> const
>   28.0ms0.1% 28.0 mesos::Value_Range::end() const
>   26.0ms0.1%  0.0 
> mesos::Value_Range::~Value_Range()
> {code}
> mesos::coalesce(Value_Ranges) gets done a lot and ends up being really 
> expensive. The heaviest parts of the inverted call chain are:
> {code}
> Running Time  Self (ms)   Symbol Name
> 3209.0ms   15.7%  3209.0  mesos::Value_Range::~Value_Range()
> 3209.0ms   15.7%  0.0  
> google::protobuf::internal::GenericTypeHandler::Delete(mesos::Value_Range*)
> 3209.0ms   15.7%  0.0   void 
> google::protobuf::internal::RepeatedPtrFieldBase::Destroy::TypeHandler>()
> 3209.0ms   15.7%  0.0
> google::protobuf::RepeatedPtrField::~RepeatedPtrField()
> 3209.0ms   15.7%  0.0 
> google::protobuf::RepeatedPtrField::~RepeatedPtrField()
> 3209.0ms   15.7%  0.0  
> mesos::Value_Ranges::~Value_Ranges()
> 3209.0ms   15.7%  0.0   
> mesos::Value_Ranges::~Value_Ranges()
> 2441.0ms   11.9%  0.0
> mesos::coalesce(mesos::Value_Ranges*, mesos::Value_Range const&)
>  452.0ms2.2%  0.0
> mesos::remove(mesos::Value_Ranges*, mesos::Value_Range const&)
>  169.0ms0.8%  0.0
> mesos::operator<=(mesos::Value_Ranges const&, mesos::Value_Ranges const&)
>   82.0ms0.4%  0.0
> mesos::operator-=(mesos::Value_Ranges&, mesos::Value_Ranges const&)
>   65.0ms0.3%  0.0
> mesos::Value_Ranges::~Value_Ranges()
> 2541.0ms   12.4%  2541.0  
> google::protobuf::internal::GenericTypeHandler::New()
> 2541.0ms   12.4%  0.0  
> google::protobuf::RepeatedPtrField::TypeHandler::Type* 
> google::protobuf::internal::RepeatedPtrFieldBase::Add::TypeHandler>()
> 2305.0ms   11.3%  0.0   
> google::protobuf::RepeatedPtrField::Add()
> 2305.0ms   11.3%  0.0mesos::Value_Ranges::add_range()
> 1962.0ms9.6%  0.0   

[jira] [Created] (MESOS-5425) Consider using IntervalSet for Port range resource math

2016-05-20 Thread Joseph Wu (JIRA)
Joseph Wu created MESOS-5425:


 Summary: Consider using IntervalSet for Port range resource math
 Key: MESOS-5425
 URL: https://issues.apache.org/jira/browse/MESOS-5425
 Project: Mesos
  Issue Type: Improvement
  Components: allocation
Reporter: Joseph Wu


Follow-up JIRA for comments raised in MESOS-3051 (see comments there).

We should consider utilizing 
[{{IntervalSet}}|https://github.com/apache/mesos/blob/a0b798d2fac39445ce0545cfaf05a682cd393abe/3rdparty/stout/include/stout/interval.hpp]
 in [Port range resource 
math|https://github.com/apache/mesos/blob/a0b798d2fac39445ce0545cfaf05a682cd393abe/src/common/values.cpp#L143].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5395) Task getting stuck in staging state if launch it on a rebooted slave.

2016-05-19 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15291648#comment-15291648
 ] 

Joseph Wu commented on MESOS-5395:
--

Nothing in the mesos logs indicates that your task is *not* starting:

>From the stdout file, the task you're looking at is
{code}
project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e
{code}

The agent logs say that the task started successfully.  These timestamps lines 
up very closely with the task's stderr.
{code}
I0518 14:55:19.393923   947 slave.cpp:1361] Got assigned task 
project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e for 
framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-
I0518 14:55:19.394619   947 gc.cpp:83] Unscheduling 
'/var/mesos/slaves/282745ab-423a-4350-a449-3e8cdfccfb93-S2/frameworks/17cd3756-1d59-4dfc-984d-3fe09f6b5730-'
 from gc
I0518 14:55:19.394680   947 gc.cpp:83] Unscheduling 
'/var/mesos/meta/slaves/282745ab-423a-4350-a449-3e8cdfccfb93-S2/frameworks/17cd3756-1d59-4dfc-984d-3fe09f6b5730-'
 from gc
I0518 14:55:19.394760   947 slave.cpp:1480] Launching task 
project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e for 
framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-
I0518 14:55:19.395539   947 paths.cpp:528] Trying to chown 
'/var/mesos/slaves/282745ab-423a-4350-a449-3e8cdfccfb93-S2/frameworks/17cd3756-1d59-4dfc-984d-3fe09f6b5730-/executors/project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e/runs/d3996d05-26f6-4e6c-a89f-8ee9c617182c'
 to user 'root'
I0518 14:55:19.399237   947 slave.cpp:5367] Launching executor 
project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e of 
framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730- with resources cpus(*):0.1; 
mem(*):32 in work directory 
'/var/mesos/slaves/282745ab-423a-4350-a449-3e8cdfccfb93-S2/frameworks/17cd3756-1d59-4dfc-984d-3fe09f6b5730-/executors/project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e/runs/d3996d05-26f6-4e6c-a89f-8ee9c617182c'
I0518 14:55:19.399588   947 slave.cpp:1698] Queuing task 
'project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e' for 
executor 
'project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e' of 
framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-
I0518 14:55:19.402344   948 docker.cpp:1036] Starting container 
'd3996d05-26f6-4e6c-a89f-8ee9c617182c' for task 
'project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e' (and 
executor 
'project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e') of 
framework '17cd3756-1d59-4dfc-984d-3fe09f6b5730-'
...
I0518 14:55:26.880151   952 docker.cpp:623] Checkpointing pid 6331 to 
'/var/mesos/meta/slaves/282745ab-423a-4350-a449-3e8cdfccfb93-S2/frameworks/17cd3756-1d59-4dfc-984d-3fe09f6b5730-/executors/project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e/runs/d3996d05-26f6-4e6c-a89f-8ee9c617182c/pids/forked.pid'
I0518 14:55:26.907119   952 slave.cpp:2643] Got registration for executor 
'project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e' of 
framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730- from 
executor(1)@10.254.234.236:42289
I0518 14:55:26.907639   952 docker.cpp:1316] Ignoring updating container 
'd3996d05-26f6-4e6c-a89f-8ee9c617182c' with resources passed to update is 
identical to existing resources
I0518 14:55:26.907726   952 slave.cpp:1863] Sending queued task 
'project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e' to 
executor 
'project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e' of 
framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730- at 
executor(1)@10.254.234.236:42289
I0518 14:55:27.622561   952 slave.cpp:3002] Handling status update TASK_RUNNING 
(UUID: 26e73671-099c-49f1-a031-57aa9a8cec41) for task 
project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e of 
framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730- from 
executor(1)@10.254.234.236:42289
I0518 14:55:27.622762   953 status_update_manager.cpp:320] Received status 
update TASK_RUNNING (UUID: 26e73671-099c-49f1-a031-57aa9a8cec41) for task 
project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e of 
framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-
I0518 14:55:27.622974   953 status_update_manager.cpp:824] Checkpointing UPDATE 
for status update TASK_RUNNING (UUID: 26e73671-099c-49f1-a031-57aa9a8cec41) for 
task project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e of 
framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-
I0518 14:55:27.679003   953 slave.cpp:3400] Forwarding the update TASK_RUNNING 
(UUID: 26e73671-099c-49f1-a031-57aa9a8cec41) for task 
project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e of 
framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730- to 
master@10.254.226.211:5050
I0518 14:55:27.679095   953 slave.cpp:3310] Sending 

[jira] [Commented] (MESOS-5395) Task getting stuck in staging state if launch it on a rebooted slave.

2016-05-17 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15287096#comment-15287096
 ] 

Joseph Wu commented on MESOS-5395:
--

The log messages you're seeing come from the framework telling Mesos to kill 
said tasks.  There might be something else going on that's preventing your task 
from launching after an agent failover.

Can you also share:
* The resources of your agents
* Full master/agent/Marathon logs before/during/after the event
* Full stdout/stderr files for the task in question
* Your Marathon app definition

> Task getting stuck in staging state if launch it on a rebooted slave.
> -
>
> Key: MESOS-5395
> URL: https://issues.apache.org/jira/browse/MESOS-5395
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.28.0
> Environment: mesos/marathon cluster,  3 maters/4 slaves
> Mesos: 0.28.0 ,  Marathon 0.15.2
>Reporter: Mengkui gong
>
> if rebooting a slave, after that,  using Marathon to launch a task,  the task 
> can start on other slaves without problem.  But if launch it on the rebooted 
> slave, the task will be stuck. From Mesos UI shows it in staging state from 
> active tasks list.  From Marathon UI shows it in deploying state. It can 
> keeping in stuck state for more than 2 hours.  After that time, Marathon will 
> automatically launch the task on this rebooted slave or other slave as 
> normal. So the rebooted slave be recovered as well after that time.   
> From Mesos log,  I can see "telling slave to kill task" all the time.
> I0517 15:25:27.207237 20568 master.cpp:3826] Telling slave 
> 282745ab-423a-4350-a449-3e8cdfccfb93-S1 at slave(1)@10.254.234.236:5050 
> (mesos-slave-3) to kill task 
> project-hub_project-hub-frontend.b645f24b-1c1f-11e6-bb25-d00d2cce797e of 
> framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730- (marathon) at 
> scheduler-fe615b72-ab92-49ca-89e6-e74e600c7e15@10.254.228.3:56757.
> From rebooted slave log, I can see:
> May 17 15:28:37 euca-10-254-234-236 mesos-slave[829]: I0517 15:28:37.206831   
> 916 slave.cpp:1891] Asked to kill task 
> project-hub_project-hub-frontend.b645f24b-1c1f-11e6-bb25-d00d2cce797e of 
> framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-
> May 17 15:28:37 euca-10-254-234-236 mesos-slave[829]: W0517 15:28:37.206866   
> 916 slave.cpp:2018] Ignoring kill task 
> project-hub_project-hub-frontend.b645f24b-1c1f-11e6-bb25-d00d2cce797e because 
> the executor 
> 'project-hub_project-hub-frontend.b645f24b-1c1f-11e6-bb25-d00d2cce797e' of 
> framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730- is terminating/terminated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5378) Terminating a framework during master failover leads to orphaned tasks

2016-05-12 Thread Joseph Wu (JIRA)
Joseph Wu created MESOS-5378:


 Summary: Terminating a framework during master failover leads to 
orphaned tasks
 Key: MESOS-5378
 URL: https://issues.apache.org/jira/browse/MESOS-5378
 Project: Mesos
  Issue Type: Bug
  Components: framework, master
Affects Versions: 0.28.1, 0.27.2
Reporter: Joseph Wu


Repro steps:

1) Setup:
{code}
bin/mesos-master.sh --work_dir=/tmp/master
bin/mesos-slave.sh --work_dir=/tmp/slave --master=localhost:5050
src/mesos-execute --checkpoint --command="sleep 1000" --master=localhost:5050 
--name="test"
{code}

2) Kill all three from (1), in the order they were started.

3) Restart the master and agent.  Do not restart the framework.

Result)
* The agent will reconnect to an orphaned task.
* The Web UI will report no memory usage
* {{curl localhost:5050/metrics/snapshot}} will say:  {{"master/mem_used": 
128,}}

Cause) 
When a framework registers with the master, it provides a {{failover_timeout}}, 
in case the framework disconnects.  If the framework disconnects and does not 
reconnect within this {{failover_timeout}}, the master will kill all tasks 
belonging to the framework.

However, the master does not persist this {{failover_timeout}} across master 
failover.  The master will "forget" about a framework if:
1) The master dies before {{failover_timeout}} passes.
2) The framework dies while the master is dead.

When the master comes back up, the agent will re-register.  The agent will 
report the orphaned task(s).  Because the master failed over, it does not know 
these tasks are orphans (i.e. it thinks the frameworks might re-register).

Proposed solution)
The master should save the {{FrameworkID}} and {{failover_timeout}} in the 
registry.  Upon recovery, the master should resume the {{failover_timeout}} 
timers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5350) Add asynchronous hook for validating docker containerizer tasks

2016-05-10 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15279403#comment-15279403
 ] 

Joseph Wu commented on MESOS-5350:
--

Looks like editing comments was disabled :(

For the TODO above: 
| https://reviews.apache.org/r/47216/ | Put hook into the DockerContainerizer |

> Add asynchronous hook for validating docker containerizer tasks
> ---
>
> Key: MESOS-5350
> URL: https://issues.apache.org/jira/browse/MESOS-5350
> Project: Mesos
>  Issue Type: Improvement
>  Components: docker, modules
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>Priority: Minor
>  Labels: containerizer, hooks, mesosphere
>
> It is possible to plug in custom validation logic for the MesosContainerizer 
> via an {{Isolator}} module, but the same is not true of the 
> DockerContainerizer.
> Basic logic can be plugged into the DockerContainerizer via {{Hooks}}, but 
> this has some notable differences compared to isolators:
> * Hooks are synchronous.
> * Modifications to tasks via Hooks have lower priority compared to the task 
> itself.  i.e. If both the {{TaskInfo}} and 
> {{slaveExecutorEnvironmentDecorator}} define the same environment variable, 
> the {{TaskInfo}} wins.
> * Hooks have no effect if they fail (short of segfaulting)
> i.e. The {{slavePreLaunchDockerHook}} has a return type of {{Try}}:
> https://github.com/apache/mesos/blob/628ccd23501078b04fb21eee85060a6226a80ef8/include/mesos/hook.hpp#L90
> But the effect of returning an {{Error}} is a log message:
> https://github.com/apache/mesos/blob/628ccd23501078b04fb21eee85060a6226a80ef8/src/hook/manager.cpp#L227-L230
> We should add a hook to the DockerContainerizer to narrow this gap.  This new 
> hook would:
> * Be called at roughly the same place as {{slavePreLaunchDockerHook}}
> https://github.com/apache/mesos/blob/628ccd23501078b04fb21eee85060a6226a80ef8/src/slave/containerizer/docker.cpp#L1022
> * Return a {{Future}} and require splitting up 
> {{DockerContainerizer::launch}}.
> * Prevent a task from launching if it returns a {{Failure}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5350) Add asynchronous hook for validating docker containerizer tasks

2016-05-10 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15279388#comment-15279388
 ] 

Joseph Wu commented on MESOS-5350:
--

Work in progress:

|| Review || Summary ||
| https://reviews.apache.org/r/47149/ | Split 
{{DockerContainerizerProcess::launch}} |
| https://reviews.apache.org/r/47205/ | {{mesos-docker-executer}} 
{{--task_environment}} flag |
| https://reviews.apache.org/r/47212/ | Duplicate {{executorEnvironment}} call |
| https://reviews.apache.org/r/47213/ | {{FlagsBase::toVector}} |
| https://reviews.apache.org/r/47214/ | Subprocess cleanup due to above |
| https://reviews.apache.org/r/47215/ | Dockerized {{mesos-docker-executer}} 
tweak |
| https://reviews.apache.org/r/47150/ | Introduce new hook (partial) |
| TODO | Put hook into the DockerContainerizer |

> Add asynchronous hook for validating docker containerizer tasks
> ---
>
> Key: MESOS-5350
> URL: https://issues.apache.org/jira/browse/MESOS-5350
> Project: Mesos
>  Issue Type: Improvement
>  Components: docker, modules
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>Priority: Minor
>  Labels: containerizer, hooks, mesosphere
>
> It is possible to plug in custom validation logic for the MesosContainerizer 
> via an {{Isolator}} module, but the same is not true of the 
> DockerContainerizer.
> Basic logic can be plugged into the DockerContainerizer via {{Hooks}}, but 
> this has some notable differences compared to isolators:
> * Hooks are synchronous.
> * Modifications to tasks via Hooks have lower priority compared to the task 
> itself.  i.e. If both the {{TaskInfo}} and 
> {{slaveExecutorEnvironmentDecorator}} define the same environment variable, 
> the {{TaskInfo}} wins.
> * Hooks have no effect if they fail (short of segfaulting)
> i.e. The {{slavePreLaunchDockerHook}} has a return type of {{Try}}:
> https://github.com/apache/mesos/blob/628ccd23501078b04fb21eee85060a6226a80ef8/include/mesos/hook.hpp#L90
> But the effect of returning an {{Error}} is a log message:
> https://github.com/apache/mesos/blob/628ccd23501078b04fb21eee85060a6226a80ef8/src/hook/manager.cpp#L227-L230
> We should add a hook to the DockerContainerizer to narrow this gap.  This new 
> hook would:
> * Be called at roughly the same place as {{slavePreLaunchDockerHook}}
> https://github.com/apache/mesos/blob/628ccd23501078b04fb21eee85060a6226a80ef8/src/slave/containerizer/docker.cpp#L1022
> * Return a {{Future}} and require splitting up 
> {{DockerContainerizer::launch}}.
> * Prevent a task from launching if it returns a {{Failure}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess

2016-05-09 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277516#comment-15277516
 ] 

Joseph Wu commented on MESOS-5342:
--

We only use Github PR's for website/UI related changes.  Everything else needs 
to go through ReviewBoard.

> CPU pinning/binding support for CgroupsCpushareIsolatorProcess
> --
>
> Key: MESOS-5342
> URL: https://issues.apache.org/jira/browse/MESOS-5342
> Project: Mesos
>  Issue Type: Improvement
>  Components: cgroups, containerization
>Affects Versions: 0.28.1
>Reporter: Chris
>
> The cgroups isolator currently lacks support for binding (also called 
> pinning) containers to a set of cores. The GNU/Linux kernel is known to make 
> sub-optimal core assignments for processes and threads. Poor assignments 
> impact program performance, specifically in terms of cache locality. 
> Applications requiring GPU resources can benefit from this feature by getting 
> access to cores closest to the GPU hardware, which reduces cpu-gpu copy 
> latency.
> Most cluster management systems from the HPC community (SLURM) provide both 
> cgroup isolation and cpu binding. This feature would provide similar 
> capabilities. The current interest in supporting Intel's Cache Allocation 
> Technology, and the advent of Intel's Knights-series processors, will require 
> making choices about where container's are going to run on the mesos-agent's 
> processor(s) cores - this feature is a step toward developing a robust 
> solution.
> The improvement in this JIRA ticket will handle hardware topology detection, 
> track container-to-core utilization in a histogram, and use a mathematical 
> optimization technique to select cores for container assignment based on 
> latency and the container-to-core utilization histogram.
> For GPU tasks, the improvement will prioritize selection of cores based on 
> latency between the GPU and cores in an effort to minimize copy latency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess

2016-05-09 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277245#comment-15277245
 ] 

Joseph Wu commented on MESOS-5342:
--

You can post a link to the document as a JIRA link (we usually use Google Docs, 
but anything will work).

> CPU pinning/binding support for CgroupsCpushareIsolatorProcess
> --
>
> Key: MESOS-5342
> URL: https://issues.apache.org/jira/browse/MESOS-5342
> Project: Mesos
>  Issue Type: Improvement
>  Components: cgroups, containerization
>Affects Versions: 0.28.1
>Reporter: Chris
>
> The cgroups isolator currently lacks support for binding (also called 
> pinning) containers to a set of cores. The GNU/Linux kernel is known to make 
> sub-optimal core assignments for processes and threads. Poor assignments 
> impact program performance, specifically in terms of cache locality. 
> Applications requiring GPU resources can benefit from this feature by getting 
> access to cores closest to the GPU hardware, which reduces cpu-gpu copy 
> latency.
> Most cluster management systems from the HPC community (SLURM) provide both 
> cgroup isolation and cpu binding. This feature would provide similar 
> capabilities. The current interest in supporting Intel's Cache Allocation 
> Technology, and the advent of Intel's Knights-series processors, will require 
> making choices about where container's are going to run on the mesos-agent's 
> processor(s) cores - this feature is a step toward developing a robust 
> solution.
> The improvement in this JIRA ticket will handle hardware topology detection, 
> track container-to-core utilization in a histogram, and use a mathematical 
> optimization technique to select cores for container assignment based on 
> latency and the container-to-core utilization histogram.
> For GPU tasks, the improvement will prioritize selection of cores based on 
> latency between the GPU and cores in an effort to minimize copy latency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5351) DockerVolumeIsolatorTest.ROOT_INTERNET_CURL_CommandTaskRootfsWithVolumes is flaky

2016-05-09 Thread Joseph Wu (JIRA)
Joseph Wu created MESOS-5351:


 Summary: 
DockerVolumeIsolatorTest.ROOT_INTERNET_CURL_CommandTaskRootfsWithVolumes is 
flaky
 Key: MESOS-5351
 URL: https://issues.apache.org/jira/browse/MESOS-5351
 Project: Mesos
  Issue Type: Bug
  Components: test
 Environment: GCC 4.9
CentOS 7 and Fedora 23 (Both SSL or no-SSL)
Reporter: Joseph Wu


Consistently fails on Mesosphere internal CI:
{code}
[14:38:12] :   [Step 10/10] [ RUN  ] 
DockerVolumeIsolatorTest.ROOT_INTERNET_CURL_CommandTaskRootfsWithVolumes
[14:38:12]W:   [Step 10/10] I0509 14:38:12.782032  2386 cluster.cpp:149] 
Creating default 'local' authorizer
[14:38:12]W:   [Step 10/10] I0509 14:38:12.786592  2386 leveldb.cpp:174] Opened 
db in 4.462265ms
[14:38:12]W:   [Step 10/10] I0509 14:38:12.787979  2386 leveldb.cpp:181] 
Compacted db in 1.368995ms
[14:38:12]W:   [Step 10/10] I0509 14:38:12.788007  2386 leveldb.cpp:196] 
Created db iterator in 4994ns
[14:38:12]W:   [Step 10/10] I0509 14:38:12.788014  2386 leveldb.cpp:202] Seeked 
to beginning of db in 724ns
[14:38:12]W:   [Step 10/10] I0509 14:38:12.788019  2386 leveldb.cpp:271] 
Iterated through 0 keys in the db in 388ns
[14:38:12]W:   [Step 10/10] I0509 14:38:12.788031  2386 replica.cpp:779] 
Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
[14:38:12]W:   [Step 10/10] I0509 14:38:12.788249  2402 recover.cpp:447] 
Starting replica recovery
[14:38:12]W:   [Step 10/10] I0509 14:38:12.788316  2402 recover.cpp:473] 
Replica is in EMPTY status
[14:38:12]W:   [Step 10/10] I0509 14:38:12.788684  2406 replica.cpp:673] 
Replica in EMPTY status received a broadcasted recover request from 
(18057)@172.30.2.145:48816
[14:38:12]W:   [Step 10/10] I0509 14:38:12.788744  2405 recover.cpp:193] 
Received a recover response from a replica in EMPTY status
[14:38:12]W:   [Step 10/10] I0509 14:38:12.788869  2400 recover.cpp:564] 
Updating replica status to STARTING
[14:38:12]W:   [Step 10/10] I0509 14:38:12.789206  2406 master.cpp:383] Master 
6c04237d-91d6-4a05-849a-8b46fdeafe76 (ip-172-30-2-145.mesosphere.io) started on 
172.30.2.145:48816
[14:38:12]W:   [Step 10/10] I0509 14:38:12.789216  2406 master.cpp:385] Flags 
at startup: --acls="" --allocation_interval="1secs" 
--allocator="HierarchicalDRF" --authenticate="true" --authenticate_http="true" 
--authenticate_http_frameworks="true" --authenticate_slaves="true" 
--authenticators="crammd5" --authorizers="local" 
--credentials="/tmp/vepf2X/credentials" --framework_sorter="drf" --help="false" 
--hostname_lookup="true" --http_authenticators="basic" 
--http_framework_authenticators="basic" --initialize_driver_logging="true" 
--log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--max_slave_ping_timeouts="5" --quiet="false" 
--recovery_slave_removal_limit="100%" --registry="replicated_log" 
--registry_fetch_timeout="1mins" --registry_store_timeout="100secs" 
--registry_strict="true" --root_submissions="true" 
--slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" 
--user_sorter="drf" --version="false" 
--webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/vepf2X/master" 
--zk_session_timeout="10secs"
[14:38:12]W:   [Step 10/10] I0509 14:38:12.789342  2406 master.cpp:434] Master 
only allowing authenticated frameworks to register
[14:38:12]W:   [Step 10/10] I0509 14:38:12.789348  2406 master.cpp:440] Master 
only allowing authenticated agents to register
[14:38:12]W:   [Step 10/10] I0509 14:38:12.789351  2406 master.cpp:446] Master 
only allowing authenticated HTTP frameworks to register
[14:38:12]W:   [Step 10/10] I0509 14:38:12.789355  2406 credentials.hpp:37] 
Loading credentials for authentication from '/tmp/vepf2X/credentials'
[14:38:12]W:   [Step 10/10] I0509 14:38:12.789466  2406 master.cpp:490] Using 
default 'crammd5' authenticator
[14:38:12]W:   [Step 10/10] I0509 14:38:12.789504  2406 master.cpp:561] Using 
default 'basic' HTTP authenticator
[14:38:12]W:   [Step 10/10] I0509 14:38:12.789540  2406 master.cpp:641] Using 
default 'basic' HTTP framework authenticator
[14:38:12]W:   [Step 10/10] I0509 14:38:12.789599  2406 master.cpp:688] 
Authorization enabled
[14:38:12]W:   [Step 10/10] I0509 14:38:12.789669  2402 hierarchical.cpp:142] 
Initialized hierarchical allocator process
[14:38:12]W:   [Step 10/10] I0509 14:38:12.789691  2407 
whitelist_watcher.cpp:77] No whitelist given
[14:38:12]W:   [Step 10/10] I0509 14:38:12.790190  2403 leveldb.cpp:304] 
Persisting metadata (8 bytes) to leveldb took 1.259226ms
[14:38:12]W:   [Step 10/10] I0509 14:38:12.790207  2403 replica.cpp:320] 
Persisted replica status to STARTING
[14:38:12]W:   [Step 10/10] I0509 14:38:12.790297  2406 master.cpp:1939] The 
newly elected leader is master@172.30.2.145:48816 with id 
6c04237d-91d6-4a05-849a-8b46fdeafe76
[14:38:12]W:   [Step 10/10]

[jira] [Created] (MESOS-5350) Add asynchronous hook for validating docker containerizer tasks

2016-05-09 Thread Joseph Wu (JIRA)
Joseph Wu created MESOS-5350:


 Summary: Add asynchronous hook for validating docker containerizer 
tasks
 Key: MESOS-5350
 URL: https://issues.apache.org/jira/browse/MESOS-5350
 Project: Mesos
  Issue Type: Improvement
  Components: docker, modules
Reporter: Joseph Wu
Assignee: Joseph Wu
Priority: Minor


It is possible to plug in custom validation logic for the MesosContainerizer 
via an {{Isolator}} module, but the same is not true of the DockerContainerizer.

Basic logic can be plugged into the DockerContainerizer via {{Hooks}}, but this 
has some notable differences compared to isolators:
* Hooks are synchronous.
* Modifications to tasks via Hooks have lower priority compared to the task 
itself.  i.e. If both the {{TaskInfo}} and 
{{slaveExecutorEnvironmentDecorator}} define the same environment variable, the 
{{TaskInfo}} wins.
* Hooks have no effect if they fail (short of segfaulting)
i.e. The {{slavePreLaunchDockerHook}} has a return type of {{Try}}:
https://github.com/apache/mesos/blob/628ccd23501078b04fb21eee85060a6226a80ef8/include/mesos/hook.hpp#L90
But the effect of returning an {{Error}} is a log message:
https://github.com/apache/mesos/blob/628ccd23501078b04fb21eee85060a6226a80ef8/src/hook/manager.cpp#L227-L230

We should add a hook to the DockerContainerizer to narrow this gap.  This new 
hook would:
* Be called at roughly the same place as {{slavePreLaunchDockerHook}}
https://github.com/apache/mesos/blob/628ccd23501078b04fb21eee85060a6226a80ef8/src/slave/containerizer/docker.cpp#L1022
* Return a {{Future}} and require splitting up {{DockerContainerizer::launch}}.
* Prevent a task from launching if it returns a {{Failure}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess

2016-05-09 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15276992#comment-15276992
 ] 

Joseph Wu commented on MESOS-5342:
--

Ideally (and especially for new contributors), you should find a shepherd 
_before_ starting work on an issue, which will save you time in the long-run.

I would recommend taking some time and reading some of our contribution guides:
* http://mesos.apache.org/documentation/latest/c++-style-guide/
* http://mesos.apache.org/documentation/latest/submitting-a-patch/
* http://mesos.apache.org/documentation/latest/testing-patterns/

It would also help to have a design document that describes the goal and some 
implementation decisions you've made.

> CPU pinning/binding support for CgroupsCpushareIsolatorProcess
> --
>
> Key: MESOS-5342
> URL: https://issues.apache.org/jira/browse/MESOS-5342
> Project: Mesos
>  Issue Type: Improvement
>  Components: cgroups, containerization
>Affects Versions: 0.28.1
>Reporter: Chris
>
> The cgroups isolator currently lacks support for binding (also called 
> pinning) containers to a set of cores. The GNU/Linux kernel is known to make 
> sub-optimal core assignments for processes and threads. Poor assignments 
> impact program performance, specifically in terms of cache locality. 
> Applications requiring GPU resources can benefit from this feature by getting 
> access to cores closest to the GPU hardware, which reduces cpu-gpu copy 
> latency.
> Most cluster management systems from the HPC community (SLURM) provide both 
> cgroup isolation and cpu binding. This feature would provide similar 
> capabilities. The current interest in supporting Intel's Cache Allocation 
> Technology, and the advent of Intel's Knights-series processors, will require 
> making choices about where container's are going to run on the mesos-agent's 
> processor(s) cores - this feature is a step toward developing a robust 
> solution.
> The improvement in this JIRA ticket will handle hardware topology detection, 
> track container-to-core utilization in a histogram, and use a mathematical 
> optimization technique to select cores for container assignment based on 
> latency and the container-to-core utilization histogram.
> For GPU tasks, the improvement will prioritize selection of cores based on 
> latency between the GPU and cores in an effort to minimize copy latency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3926) Modularize URI fetcher plugin interface.

2016-05-09 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-3926:
-
Sprint:   (was: Mesosphere Sprint 35)

> Modularize URI fetcher plugin interface.  
> --
>
> Key: MESOS-3926
> URL: https://issues.apache.org/jira/browse/MESOS-3926
> Project: Mesos
>  Issue Type: Task
>  Components: fetcher
>Reporter: Jie Yu
>Assignee: Shuai Lin
>  Labels: fetcher, mesosphere, module
>
> So that we can add custom URI fetcher plugins using modules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5260) Extend the uri::Fetcher::Plugin interface to include a "fetchSize"

2016-05-02 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-5260:
-
Sprint:   (was: Mesosphere Sprint 34)

> Extend the uri::Fetcher::Plugin interface to include a "fetchSize"
> --
>
> Key: MESOS-5260
> URL: https://issues.apache.org/jira/browse/MESOS-5260
> Project: Mesos
>  Issue Type: Task
>  Components: fetcher
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: fetcher, mesosphere
>
> In order to replace the {{mesos-fetcher}} binary with the {{uri::Fetcher}}, 
> each plugin must be able to determine/estimate the size of a download.  This 
> is used by the Fetcher cache when it creates cache entries and such.
> The logic for each of the four {{Fetcher::Plugin}}s can be taken and 
> refactored from the existing fetcher.
> https://github.com/apache/mesos/blob/653eca74f1080f5f55cd5092423506163e65d402/src/slave/containerizer/fetcher.cpp#L267



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5261) Combine the internal::slave::Fetcher class and mesos-fetcher binary

2016-05-02 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-5261:
-
Sprint:   (was: Mesosphere Sprint 34)

> Combine the internal::slave::Fetcher class and mesos-fetcher binary
> ---
>
> Key: MESOS-5261
> URL: https://issues.apache.org/jira/browse/MESOS-5261
> Project: Mesos
>  Issue Type: Task
>  Components: fetcher
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: fetcher, mesosphere
>
> After [MESOS-5259], the {{mesos-fetcher}} will no longer need to be a 
> separate binary and can be safely folded back into the agent process.  (It 
> was a separate binary because libcurl has synchronous/blocking calls.)  
> This will likely mean:
> * A change to the {{fetch}} continuation chain:
>   
> https://github.com/apache/mesos/blob/653eca74f1080f5f55cd5092423506163e65d402/src/slave/containerizer/fetcher.cpp#L315
> * This protobuf can be deprecated (or just removed):
>   
> https://github.com/apache/mesos/blob/653eca74f1080f5f55cd5092423506163e65d402/include/mesos/fetcher/fetcher.proto



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3918) Unified and pluggable URI fetching support.

2016-05-02 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-3918:
-
Labels: mesosphere twitter  (was: twitter)

> Unified and pluggable URI fetching support.
> ---
>
> Key: MESOS-3918
> URL: https://issues.apache.org/jira/browse/MESOS-3918
> Project: Mesos
>  Issue Type: Epic
>Reporter: Jie Yu
>Assignee: Jie Yu
>  Labels: mesosphere, twitter
>
> Fetcher was originally designed to fetch CommandInfo::URIs (e.g., executor 
> binary) for executors/tasks. A recent refactor (MESOS-336) added caching 
> support to the fetcher. The recent work on filesystem isolation/unified 
> containerizer (MESOS-2840) requires Mesos to fetch filesystem images (e.g., 
> APPC/DOCKER images) as well. The natural question is: can we leverage the 
> fetcher to fetch those filesystem images (and cache them accordingly)? 
> Unfortunately, the existing fetcher interface is tightly coupled with 
> CommandInfo::URIs for executors/tasks, making it very hard to be re used to 
> fetch/cache filesystem images.
> Another motivation for the refactor is that we want to extend the fetcher to 
> support more types of schemes. For instance, we want to support magnet URI to 
> enable p2p fetching. This is in fact quite important for operating a large 
> cluster (MESOS-3596). The goal here is to allow fetcher to be extended (e.g., 
> using modules) so that operators can add custom fetching support.
> The main design goal here is to decouple artifacts fetching from artifacts 
> cache management. We can make artifacts fetching extensible (e.g. to support 
> p2p fetching), and solve the cache management part later.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5259) Refactor the mesos-fetcher binary to use the uri::Fetcher as a backend

2016-05-02 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-5259:
-
Sprint:   (was: Mesosphere Sprint 34)

> Refactor the mesos-fetcher binary to use the uri::Fetcher as a backend
> --
>
> Key: MESOS-5259
> URL: https://issues.apache.org/jira/browse/MESOS-5259
> Project: Mesos
>  Issue Type: Task
>  Components: fetcher
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: fetcher, mesosphere
>
> This is an intermediate step for combining the {{mesos-fetcher}} binary and 
> {{uri::Fetcher}}.  
> The {{download}} method should be replaced with {{uri::Fetcher::fetch}}.
> https://github.com/apache/mesos/blob/653eca74f1080f5f55cd5092423506163e65d402/src/launcher/fetcher.cpp#L179
> Combining the two will:
> * Attach the {{uri::Fetcher}} to the existing Fetcher caching logic.
> * Remove some code duplication for downloading URIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5304) /metrics/snapshot endpoint help disappeared on agent.

2016-04-28 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15263286#comment-15263286
 ] 

Joseph Wu commented on MESOS-5304:
--

Review: https://reviews.apache.org/r/46806/

> /metrics/snapshot endpoint help disappeared on agent.
> -
>
> Key: MESOS-5304
> URL: https://issues.apache.org/jira/browse/MESOS-5304
> Project: Mesos
>  Issue Type: Bug
>Reporter: Joerg Schad
>Assignee: Joseph Wu
>  Labels: mesosphere
>
> After 
> https://github.com/apache/mesos/commit/066fc4bd0df6690a5e1a929d3836e307c1e22586
> the help for the /metrics/snapshot endpoint on the agent doesn't appear 
> anymore (Master endpoint help is unchanged).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5304) /metrics/snapshot endpoint help disappeared on agent.

2016-04-28 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-5304:
-
Shepherd: Joris Van Remoortere

> /metrics/snapshot endpoint help disappeared on agent.
> -
>
> Key: MESOS-5304
> URL: https://issues.apache.org/jira/browse/MESOS-5304
> Project: Mesos
>  Issue Type: Bug
>Reporter: Joerg Schad
>Assignee: Joseph Wu
>  Labels: mesosphere
>
> After 
> https://github.com/apache/mesos/commit/066fc4bd0df6690a5e1a929d3836e307c1e22586
> the help for the /metrics/snapshot endpoint on the agent doesn't appear 
> anymore (Master endpoint help is unchanged).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5304) /metrics/snapshot endpoint help disappeared on agent.

2016-04-28 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-5304:
-
  Sprint: Mesosphere Sprint 34
Story Points: 1
  Labels: mesosphere  (was: )

The authentication realm change moved the {{metrics::initialize}} method too 
far up in {{process::initialize}}.

I will fix this and add some comments to illustrate the order of process 
initialization to help prevent similar errors in future.

> /metrics/snapshot endpoint help disappeared on agent.
> -
>
> Key: MESOS-5304
> URL: https://issues.apache.org/jira/browse/MESOS-5304
> Project: Mesos
>  Issue Type: Bug
>Reporter: Joerg Schad
>Assignee: Joseph Wu
>  Labels: mesosphere
>
> After 
> https://github.com/apache/mesos/commit/066fc4bd0df6690a5e1a929d3836e307c1e22586
> the help for the /metrics/snapshot endpoint on the agent doesn't appear 
> anymore (Master endpoint help is unchanged).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-5304) /metrics/snapshot endpoint help disappeared on agent.

2016-04-28 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu reassigned MESOS-5304:


Assignee: Joseph Wu  (was: Greg Mann)

> /metrics/snapshot endpoint help disappeared on agent.
> -
>
> Key: MESOS-5304
> URL: https://issues.apache.org/jira/browse/MESOS-5304
> Project: Mesos
>  Issue Type: Bug
>Reporter: Joerg Schad
>Assignee: Joseph Wu
>
> After 
> https://github.com/apache/mesos/commit/066fc4bd0df6690a5e1a929d3836e307c1e22586
> the help for the /metrics/snapshot endpoint on the agent doesn't appear 
> anymore (Master endpoint help is unchanged).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5294) Status updates after a health check are incomplete or invalid

2016-04-27 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15261107#comment-15261107
 ] 

Joseph Wu commented on MESOS-5294:
--

AFAIK, the function of Mesos-DNS is not dependent on the health of the task.  
It simply looks at the masters {{/state}} endpoint and creates DNS records for 
each master/agent/framework/task.  

Can you include your Marathon app definition?  (And, perhaps, the result of 
{{master/state}}, filtered down to the task in question.)

> Status updates after a health check are incomplete or invalid
> -
>
> Key: MESOS-5294
> URL: https://issues.apache.org/jira/browse/MESOS-5294
> Project: Mesos
>  Issue Type: Bug
> Environment: mesos 0.28.0, docker 1.11, marathon 0.15.3, mesos-dns, 
> ubuntu 14.04
>Reporter: Travis Hegner
>Assignee: Travis Hegner
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> With command health checks enabled via marathon, mesos-dns will resolve the 
> task correctly until the task is reported as "healthy". At that point, 
> mesos-dns stops resolving the task correctly.
> Digging through src/docker/executor.cpp, I found that in the 
> "taskHealthUpdated()" function is attempting to copy the taskID to the new 
> status instance with "status.mutable_task_id()->CopyFrom(taskID);", but other 
> instances of status updates have a similar line 
> "status.mutable_task_id()->CopyFrom(taskID.get());".
> My assumption is that this difference is causing the status update after a 
> health check to not have a proper taskID, which in turn is causing an 
> incorrect state.json output.
> I'll try to get a patch together soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4760) Expose metrics and gauges for fetcher cache usage and hit rate

2016-04-25 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15257308#comment-15257308
 ] 

Joseph Wu commented on MESOS-4760:
--

Most of the caching logic will remain the same.  I'll be changing parts of 
{{FetcherProcess::run}} and a tiny bit of logic in {{FetcherProcess::fetch}}.

It should be safe to add metrics in parallel, up until you need the injected 
fetcher object.

> Expose metrics and gauges for fetcher cache usage and hit rate
> --
>
> Key: MESOS-4760
> URL: https://issues.apache.org/jira/browse/MESOS-4760
> Project: Mesos
>  Issue Type: Improvement
>  Components: fetcher, statistics
>Reporter: Michael Browning
>Assignee: Michael Browning
>Priority: Minor
>  Labels: features, fetcher, statistics, uber
>
> To evaluate the fetcher cache and calibrate the value of the 
> fetcher_cache_size flag, it would be useful to have metrics and gauges on 
> agents that expose operational statistics like cache hit rate, occupied cache 
> size, and time spent downloading resources that were not present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4760) Expose metrics and gauges for fetcher cache usage and hit rate

2016-04-25 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15256821#comment-15256821
 ] 

Joseph Wu commented on MESOS-4760:
--

It's somewhat likely that the fetcher unification (with the {{URI::Fetcher}}) 
will make the fetcher into an injectable object.  Added this issue to the Epic 
[MESOS-3918].

> Expose metrics and gauges for fetcher cache usage and hit rate
> --
>
> Key: MESOS-4760
> URL: https://issues.apache.org/jira/browse/MESOS-4760
> Project: Mesos
>  Issue Type: Improvement
>  Components: fetcher, statistics
>Reporter: Michael Browning
>Assignee: Michael Browning
>Priority: Minor
>  Labels: features, fetcher, statistics, uber
>
> To evaluate the fetcher cache and calibrate the value of the 
> fetcher_cache_size flag, it would be useful to have metrics and gauges on 
> agents that expose operational statistics like cache hit rate, occupied cache 
> size, and time spent downloading resources that were not present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4760) Expose metrics and gauges for fetcher cache usage and hit rate

2016-04-25 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-4760:
-
Story Points: 2

> Expose metrics and gauges for fetcher cache usage and hit rate
> --
>
> Key: MESOS-4760
> URL: https://issues.apache.org/jira/browse/MESOS-4760
> Project: Mesos
>  Issue Type: Improvement
>  Components: fetcher, statistics
>Reporter: Michael Browning
>Assignee: Michael Browning
>Priority: Minor
>  Labels: features, fetcher, statistics, uber
>
> To evaluate the fetcher cache and calibrate the value of the 
> fetcher_cache_size flag, it would be useful to have metrics and gauges on 
> agents that expose operational statistics like cache hit rate, occupied cache 
> size, and time spent downloading resources that were not present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5259) Refactor the mesos-fetcher binary to use the uri::Fetcher as a backend

2016-04-25 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-5259:
-
Sprint: Mesosphere Sprint 34

> Refactor the mesos-fetcher binary to use the uri::Fetcher as a backend
> --
>
> Key: MESOS-5259
> URL: https://issues.apache.org/jira/browse/MESOS-5259
> Project: Mesos
>  Issue Type: Task
>  Components: fetcher
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: fetcher, mesosphere
>
> This is an intermediate step for combining the {{mesos-fetcher}} binary and 
> {{uri::Fetcher}}.  
> The {{download}} method should be replaced with {{uri::Fetcher::fetch}}.
> https://github.com/apache/mesos/blob/653eca74f1080f5f55cd5092423506163e65d402/src/launcher/fetcher.cpp#L179
> Combining the two will:
> * Attach the {{uri::Fetcher}} to the existing Fetcher caching logic.
> * Remove some code duplication for downloading URIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5260) Extend the uri::Fetcher::Plugin interface to include a "fetchSize"

2016-04-25 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-5260:
-
Sprint: Mesosphere Sprint 34

> Extend the uri::Fetcher::Plugin interface to include a "fetchSize"
> --
>
> Key: MESOS-5260
> URL: https://issues.apache.org/jira/browse/MESOS-5260
> Project: Mesos
>  Issue Type: Task
>  Components: fetcher
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: fetcher, mesosphere
>
> In order to replace the {{mesos-fetcher}} binary with the {{uri::Fetcher}}, 
> each plugin must be able to determine/estimate the size of a download.  This 
> is used by the Fetcher cache when it creates cache entries and such.
> The logic for each of the four {{Fetcher::Plugin}}s can be taken and 
> refactored from the existing fetcher.
> https://github.com/apache/mesos/blob/653eca74f1080f5f55cd5092423506163e65d402/src/slave/containerizer/fetcher.cpp#L267



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5261) Combine the internal::slave::Fetcher class and mesos-fetcher binary

2016-04-25 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-5261:
-
Sprint: Mesosphere Sprint 34

> Combine the internal::slave::Fetcher class and mesos-fetcher binary
> ---
>
> Key: MESOS-5261
> URL: https://issues.apache.org/jira/browse/MESOS-5261
> Project: Mesos
>  Issue Type: Task
>  Components: fetcher
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: fetcher, mesosphere
>
> After [MESOS-5259], the {{mesos-fetcher}} will no longer need to be a 
> separate binary and can be safely folded back into the agent process.  (It 
> was a separate binary because libcurl has synchronous/blocking calls.)  
> This will likely mean:
> * A change to the {{fetch}} continuation chain:
>   
> https://github.com/apache/mesos/blob/653eca74f1080f5f55cd5092423506163e65d402/src/slave/containerizer/fetcher.cpp#L315
> * This protobuf can be deprecated (or just removed):
>   
> https://github.com/apache/mesos/blob/653eca74f1080f5f55cd5092423506163e65d402/include/mesos/fetcher/fetcher.proto



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3926) Modularize URI fetcher plugin interface.

2016-04-25 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-3926:
-
Sprint: Mesosphere Sprint 34

> Modularize URI fetcher plugin interface.  
> --
>
> Key: MESOS-3926
> URL: https://issues.apache.org/jira/browse/MESOS-3926
> Project: Mesos
>  Issue Type: Task
>  Components: fetcher
>Reporter: Jie Yu
>Assignee: Shuai Lin
>  Labels: fetcher, mesosphere, module
>
> So that we can add custom URI fetcher plugins using modules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


<    1   2   3   4   5   6   7   8   9   10   >