[jira] [Created] (MESOS-1711) Create method for users to identify HDFS compatible protocols in fetcher.cpp

2014-08-18 Thread John Omernik (JIRA)
John Omernik created MESOS-1711:
---

 Summary: Create method for users to identify HDFS compatible 
protocols in fetcher.cpp
 Key: MESOS-1711
 URL: https://issues.apache.org/jira/browse/MESOS-1711
 Project: Mesos
  Issue Type: Improvement
  Components: general
Affects Versions: 0.19.1
 Environment: All
Reporter: John Omernik
Priority: Minor
 Fix For: 0.21.0


In fetcher.cpp, the code to get the Mesos packages uses a hard coded list of 
protocols to determine if the Hadoop copytoLocal method is used or if another 
method (such as standard filecopy).  This limits the addition of new protocols 
that are HDFS compatible until the next release of Mesos. Tachyon Filesystem 
(tachyonfs://), MapR FS (maprfs://) and glusterfs:// are three examples that 
could make use of this. 

Instead of just adding those file systems in the hard coded list, I recommend 
following the lead of the Tachyon Project.  In tachyon-0.6.0-SNAPSHOT, they 
have added an environment variable of allowed hdfs compatible protocols. This 
comma-separated list allows the user/admin to specify which protocols are HDFS 
compatible, without hard coding it in the fetcher.cpp.   

I don't have access to the Tachyon issues list for linking, but the code is on 
line 75 of 

https://github.com/amplab/tachyon/blob/master/core/src/main/java/tachyon/UnderFileSystem.java





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MESOS-1711) Create method for users to identify HDFS compatible protocols in fetcher.cpp

2014-08-18 Thread Timothy St. Clair (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100728#comment-14100728
 ] 

Timothy St. Clair commented on MESOS-1711:
--

I'm game for *this one. 

 Create method for users to identify HDFS compatible protocols in fetcher.cpp
 

 Key: MESOS-1711
 URL: https://issues.apache.org/jira/browse/MESOS-1711
 Project: Mesos
  Issue Type: Improvement
  Components: general
Affects Versions: 0.19.1
 Environment: All
Reporter: John Omernik
Priority: Minor
  Labels: fetecher, hadoop, hdfs
 Fix For: 0.21.0

   Original Estimate: 6h
  Remaining Estimate: 6h

 In fetcher.cpp, the code to get the Mesos packages uses a hard coded list of 
 protocols to determine if the Hadoop copytoLocal method is used or if another 
 method (such as standard filecopy).  This limits the addition of new 
 protocols that are HDFS compatible until the next release of Mesos. Tachyon 
 Filesystem (tachyonfs://), MapR FS (maprfs://) and glusterfs:// are three 
 examples that could make use of this. 
 Instead of just adding those file systems in the hard coded list, I recommend 
 following the lead of the Tachyon Project.  In tachyon-0.6.0-SNAPSHOT, they 
 have added an environment variable of allowed hdfs compatible protocols. This 
 comma-separated list allows the user/admin to specify which protocols are 
 HDFS compatible, without hard coding it in the fetcher.cpp.   
 I don't have access to the Tachyon issues list for linking, but the code is 
 on line 75 of 
 https://github.com/amplab/tachyon/blob/master/core/src/main/java/tachyon/UnderFileSystem.java



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MESOS-1711) Create method for users to identify HDFS compatible protocols in fetcher.cpp

2014-08-18 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-1711:
--

 Shepherd: Adam B
Fix Version/s: (was: 0.21.0)
   Labels: fetcher hadoop hdfs  (was: fetecher hadoop hdfs)

 Create method for users to identify HDFS compatible protocols in fetcher.cpp
 

 Key: MESOS-1711
 URL: https://issues.apache.org/jira/browse/MESOS-1711
 Project: Mesos
  Issue Type: Improvement
  Components: general
Affects Versions: 0.19.1
 Environment: All
Reporter: John Omernik
Assignee: Timothy St. Clair
Priority: Minor
  Labels: fetcher, hadoop, hdfs
   Original Estimate: 6h
  Remaining Estimate: 6h

 In fetcher.cpp, the code to get the Mesos packages uses a hard coded list of 
 protocols to determine if the Hadoop copytoLocal method is used or if another 
 method (such as standard filecopy).  This limits the addition of new 
 protocols that are HDFS compatible until the next release of Mesos. Tachyon 
 Filesystem (tachyonfs://), MapR FS (maprfs://) and glusterfs:// are three 
 examples that could make use of this. 
 Instead of just adding those file systems in the hard coded list, I recommend 
 following the lead of the Tachyon Project.  In tachyon-0.6.0-SNAPSHOT, they 
 have added an environment variable of allowed hdfs compatible protocols. This 
 comma-separated list allows the user/admin to specify which protocols are 
 HDFS compatible, without hard coding it in the fetcher.cpp.   
 I don't have access to the Tachyon issues list for linking, but the code is 
 on line 75 of 
 https://github.com/amplab/tachyon/blob/master/core/src/main/java/tachyon/UnderFileSystem.java



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MESOS-871) GroupTest.RetryableErrors is flaky

2014-08-18 Thread Brenden Matthews (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brenden Matthews updated MESOS-871:
---

Description: 
 [ RUN  ] GroupTest.RetryableErrors
I1205 18:30:58.603236 20390 zookeeper_test_server.cpp:113] Started 
ZooKeeperTestServer on port 40811
2013-12-05 18:30:58,604:20390(0x7f4512b9d700):ZOO_INFO@log_env@658: Client 
environment:zookeeper.version=zookeeper C client 3.3.4
2013-12-05 18:30:58,604:20390(0x7f4512b9d700):ZOO_INFO@log_env@662: Client 
environment:host.name=localhost.localdomain
2013-12-05 18:30:58,604:20390(0x7f4512b9d700):ZOO_INFO@log_env@669: Client 
environment:os.name=Linux
2013-12-05 18:30:58,604:20390(0x7f4512b9d700):ZOO_INFO@log_env@670: Client 
environment:os.arch=3.11.9-200.fc19.x86_64
2013-12-05 18:30:58,604:20390(0x7f4512b9d700):ZOO_INFO@log_env@671: Client 
environment:os.version=#1 SMP Wed Nov 20 21:22:24 UTC 2013
2013-12-05 18:30:58,605:20390(0x7f4512b9d700):ZOO_INFO@log_env@679: Client 
environment:user.name=jenkins
2013-12-05 18:30:58,605:20390(0x7f4512b9d700):ZOO_INFO@log_env@687: Client 
environment:user.home=/home/jenkins
2013-12-05 18:30:58,605:20390(0x7f4512b9d700):ZOO_INFO@log_env@699: Client 
environment:user.dir=/var/jenkins/workspace/mesos-fedora-19-gcc/src
2013-12-05 18:30:58,605:20390(0x7f4512b9d700):ZOO_INFO@zookeeper_init@727: 
Initiating client connection, host=127.0.0.1:40811 sessionTimeout=5000 
watcher=0x7f45138fa6a0 sessionId=0 sessionPasswd=null context=0x7f44fc19ae70 
flags=0
2013-12-05 18:30:58,608:20390(0x7f44df2d5700):ZOO_INFO@check_events@1585: 
initiated connection to server [127.0.0.1:40811]
2013-12-05 18:30:58,614:20390(0x7f44df2d5700):ZOO_INFO@check_events@1632: 
session establishment complete on server [127.0.0.1:40811], 
sessionId=0x142c5be6528, negotiated timeout=6000
I1205 18:30:58.616745 20411 group.cpp:280] Group process 
((1488)@127.0.0.1:59677) connected to ZooKeeper
I1205 18:30:58.616773 20411 group.cpp:675] Syncing group operations: queue size 
(joins, cancels, datas) = (0, 0, 0)
I1205 18:30:58.616780 20411 group.cpp:313] Authenticating with ZooKeeper using 
digest
2013-12-05 
18:30:59,587:20390(0x7f44e1ada700):ZOO_ERROR@handle_socket_error_msg@1579: 
Socket [127.0.0.1:45593] zk retcode=-4, errno=111(Connection refused): server 
refused to accept the client
2013-12-05 
18:31:00,610:20390(0x7f44df2d5700):ZOO_INFO@auth_completion_func@1198: 
Authentication scheme digest succeeded
I1205 18:31:00.616611 20411 group.cpp:337] Trying to create path '/test' in 
ZooKeeper
2013-12-05 
18:31:00,628:20390(0x7f44df2d5700):ZOO_ERROR@handle_socket_error_msg@1592: 
Socket [127.0.0.1:40811] zk retcode=-4, errno=32(Broken pipe): failed while 
flushing send queue
I1205 18:31:00.633744 20416 group.cpp:366] Lost connection to ZooKeeper, 
attempting to reconnect ...
2013-12-05 18:31:02,635:20390(0x7f44df2d5700):ZOO_INFO@check_events@1585: 
initiated connection to server [127.0.0.1:40811]
2013-12-05 
18:31:02,637:20390(0x7f44df2d5700):ZOO_ERROR@handle_socket_error_msg@1621: 
Socket [127.0.0.1:40811] zk retcode=-112, errno=116(Stale file handle): 
sessionId=0x142c5be6528 has expired.
2013-12-05 18:31:02,638:20390(0x7f4510398700):ZOO_INFO@zookeeper_close@2321: 
Freeing zookeeper resources for sessionId=0x142c5be6528

2013-12-05 18:31:02,639:20390(0x7f4510398700):ZOO_INFO@log_env@658: Client 
environment:zookeeper.version=zookeeper C client 3.3.4
2013-12-05 18:31:02,639:20390(0x7f4510398700):ZOO_INFO@log_env@662: Client 
environment:host.name=localhost.localdomain
2013-12-05 18:31:02,639:20390(0x7f4510398700):ZOO_INFO@log_env@669: Client 
environment:os.name=Linux
2013-12-05 18:31:02,639:20390(0x7f4510398700):ZOO_INFO@log_env@670: Client 
environment:os.arch=3.11.9-200.fc19.x86_64
2013-12-05 18:31:02,639:20390(0x7f4510398700):ZOO_INFO@log_env@671: Client 
environment:os.version=#1 SMP Wed Nov 20 21:22:24 UTC 2013
2013-12-05 18:31:02,639:20390(0x7f4510398700):ZOO_INFO@log_env@679: Client 
environment:user.name=jenkins
2013-12-05 18:31:02,639:20390(0x7f4510398700):ZOO_INFO@log_env@687: Client 
environment:user.home=/home/jenkins
2013-12-05 18:31:02,639:20390(0x7f4510398700):ZOO_INFO@log_env@699: Client 
environment:user.dir=/var/jenkins/workspace/mesos-fedora-19-gcc/src
2013-12-05 18:31:02,639:20390(0x7f4510398700):ZOO_INFO@zookeeper_init@727: 
Initiating client connection, host=127.0.0.1:40811 sessionTimeout=5000 
watcher=0x7f45138fa6a0 sessionId=0 sessionPasswd=null context=0x7f4508138080 
flags=0
2013-12-05 18:31:02,642:20390(0x7f44e05d3700):ZOO_INFO@check_events@1585: 
initiated connection to server [127.0.0.1:40811]
2013-12-05 18:31:02,648:20390(0x7f44e05d3700):ZOO_INFO@check_events@1632: 
session establishment complete on server [127.0.0.1:40811], 
sessionId=0x142c5be65280001, negotiated timeout=6000
I1205 18:31:02.648991 20411 group.cpp:280] Group process 
((1488)@127.0.0.1:59677) connected to ZooKeeper
I1205 18:31:02.649021 20411 

[jira] [Created] (MESOS-1714) The C++ 'Resources' abstraction should keep the underlying resources flattened.

2014-08-18 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-1714:
--

 Summary: The C++ 'Resources' abstraction should keep the 
underlying resources flattened.
 Key: MESOS-1714
 URL: https://issues.apache.org/jira/browse/MESOS-1714
 Project: Mesos
  Issue Type: Bug
  Components: c++ api
Reporter: Benjamin Mahler


Currently, the C++ Resources class does not ensure that the underlying 
Resources protobufs are kept flat.

This is an issue because some of the methods, e.g. 
[Resources::get|https://github.com/apache/mesos/blob/0.19.1/src/common/resources.cpp#L269],
 assume the resources are flat.

There is code that constructs unflattened resources, e.g. 
[Slave::launchExecutor|https://github.com/apache/mesos/blob/0.19.1/src/slave/slave.cpp#L3353].
 We could prevent this type of construction, however it is perfectly fine if we 
ensure the C++ 'Resources' class performs flattening.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MESOS-1574) what to do when a rogue process binds to a port mesos didn't allocate to it?

2014-08-18 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101448#comment-14101448
 ] 

Jie Yu commented on MESOS-1574:
---

If you turn on the network isolator in 0.20.0, we will have isolation for 
'ports' resource as well. So if a process is using a port that is not assigned 
to it, it can still bind that port, but it won't be able to use that port to 
communicate with others. THat's because we install tc filters for each 
container and will drop those packages if the src port does not belong to the 
container.

 what to do when a rogue process binds to a port mesos didn't allocate to it?
 

 Key: MESOS-1574
 URL: https://issues.apache.org/jira/browse/MESOS-1574
 Project: Mesos
  Issue Type: Improvement
  Components: allocation, isolation
Reporter: Jay Buffington
Priority: Minor

 I recently had an issue where a slave had a process who's parent was init 
 that was bound to a port in the range that mesos thought was a free resource. 
  I'm not sure if this is due to a bug in mesos (it lost track of this process 
 during an upgrade?) or if there was a bad user who started a process on the 
 host manually outside of mesos.  The process is over a month old and I have 
 no history in mesos to ask it if/when it launched the task :(
 If a rogue process binds to a port that mesos-slave has offered to the master 
 as an available resource there should be some sort of reckoning.  Mesos could:
* kill the rogue process
* rescind the offer for that port
* have an api that can be plugged into a monitoring system to alert humans 
 of this inconsistency



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (MESOS-1713) Python framework test dies on OSX because of missing symbol

2014-08-18 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu reassigned MESOS-1713:
-

Assignee: Jie Yu  (was: Timothy Chen)

 Python framework test dies on OSX because of missing symbol
 ---

 Key: MESOS-1713
 URL: https://issues.apache.org/jira/browse/MESOS-1713
 Project: Mesos
  Issue Type: Bug
Affects Versions: 0.20.0
Reporter: Thomas Rampelberg
Assignee: Jie Yu

 When building 0.20 on OSX, the build completes but during the test there is a 
 problem with the cgroup symbols:
 ImportError: 
 dlopen(/Users/jyu/.python-eggs/mesos.native-0.20.0-py2.6-macosx-10.4-x86_64.egg-tmp/mesos/native/_mesos.so,
  2): Symbol not found: __ZN7cgroups9hierarchyERKSs
   Referenced from: 
 /Users/jyu/.python-eggs/mesos.native-0.20.0-py2.6-macosx-10.4-x86_64.egg-tmp/mesos/native/_mesos.so



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (MESOS-1715) The slave does not send pending tasks during re-registration.

2014-08-18 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-1715:
--

 Summary: The slave does not send pending tasks during 
re-registration.
 Key: MESOS-1715
 URL: https://issues.apache.org/jira/browse/MESOS-1715
 Project: Mesos
  Issue Type: Bug
  Components: slave
Reporter: Benjamin Mahler


In what looks like an oversight, the pending tasks in the slave 
(Framework::pending) are not sent in the re-registration message.

This can lead to spurious TASK_LOST notifications being generated by the master 
when it falsely thinks the tasks are not present on the slave.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (MESOS-1717) The slave does not show pending tasks in the JSON endpoints.

2014-08-18 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-1717:
--

 Summary: The slave does not show pending tasks in the JSON 
endpoints.
 Key: MESOS-1717
 URL: https://issues.apache.org/jira/browse/MESOS-1717
 Project: Mesos
  Issue Type: Bug
  Components: json api, slave
Reporter: Benjamin Mahler


The slave does not show pending tasks in the /state.json endpoint.

This is a bit tricky to add since we rely on knowing the executor directory.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (MESOS-1574) what to do when a rogue process binds to a port mesos didn't allocate to it?

2014-08-18 Thread Jay Buffington (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jay Buffington resolved MESOS-1574.
---

Resolution: Won't Fix

This is largely addressed by the network isolator.  Thanks [~jieyu]

 what to do when a rogue process binds to a port mesos didn't allocate to it?
 

 Key: MESOS-1574
 URL: https://issues.apache.org/jira/browse/MESOS-1574
 Project: Mesos
  Issue Type: Improvement
  Components: allocation, isolation
Reporter: Jay Buffington
Priority: Minor

 I recently had an issue where a slave had a process who's parent was init 
 that was bound to a port in the range that mesos thought was a free resource. 
  I'm not sure if this is due to a bug in mesos (it lost track of this process 
 during an upgrade?) or if there was a bad user who started a process on the 
 host manually outside of mesos.  The process is over a month old and I have 
 no history in mesos to ask it if/when it launched the task :(
 If a rogue process binds to a port that mesos-slave has offered to the master 
 as an available resource there should be some sort of reckoning.  Mesos could:
* kill the rogue process
* rescind the offer for that port
* have an api that can be plugged into a monitoring system to alert humans 
 of this inconsistency



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (MESOS-1718) Command executor can overcommit the slave.

2014-08-18 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-1718:
--

 Summary: Command executor can overcommit the slave.
 Key: MESOS-1718
 URL: https://issues.apache.org/jira/browse/MESOS-1718
 Project: Mesos
  Issue Type: Bug
  Components: slave
Reporter: Benjamin Mahler


Currently we give a small amount of resources to the command executor, in 
addition to resources used by the command task:

https://github.com/apache/mesos/blob/0.20.0-rc1/src/slave/slave.cpp#L2448
{code: title=}
ExecutorInfo Slave::getExecutorInfo(
const FrameworkID frameworkId,
const TaskInfo task)
{
  ...
// Add an allowance for the command executor. This does lead to a
// small overcommit of resources.
executor.mutable_resources()-MergeFrom(
Resources::parse(
  cpus: + stringify(DEFAULT_EXECUTOR_CPUS) + ; +
  mem: + stringify(DEFAULT_EXECUTOR_MEM.megabytes())).get());
  ...
}
{code}

This leads to an overcommit of the slave. Ideally, for command tasks we can 
transfer all of the task resources to the executor at the slave / isolation 
level.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (MESOS-1719) Master should persist active frameworks information

2014-08-18 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-1719:
-

 Summary: Master should persist active frameworks information
 Key: MESOS-1719
 URL: https://issues.apache.org/jira/browse/MESOS-1719
 Project: Mesos
  Issue Type: Task
Reporter: Vinod Kone


https://issues.apache.org/jira/browse/MESOS-1219 disallows completed frameworks 
from re-registering with the same framework id, as long as the master doesn't 
failover.

This ticket tracks the work for it work across the master failover using 
registrar.

There are some open questions that need to be addressed:

-- Should registry contain framework ids only framework infos.

For disallowing completed frameworks from re-registering, persisting 
framework ids is enough. But, if in the future, we want to disallow
frameworks from re-registering if some parts of framework info
changed then we need to persist the info too.

-- How to update the framework info.
  Currently frameworks are allowed to update framework info while re-
  registering, but it only takes effect on the master when the master fails 
  over and on the slave when the slave fails over. How should things 
   change when persist framework info?





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (MESOS-1720) Slave should send exited executor message when the executor is never launched.

2014-08-18 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-1720:
--

 Summary: Slave should send exited executor message when the 
executor is never launched.
 Key: MESOS-1720
 URL: https://issues.apache.org/jira/browse/MESOS-1720
 Project: Mesos
  Issue Type: Bug
  Components: master, slave
Reporter: Benjamin Mahler


When the slave sends TASK_LOST before launching an executor for a task, the 
slave does not send an exited executor message to the master.

Since the master receives no exited executor message, it still thinks the 
executor's resources are consumed on the slave.

One possible fix for this would be to send the exited executor message to the 
master in these cases.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (MESOS-1219) Master should disallow frameworks that reconnect after failover timeout.

2014-08-18 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone resolved MESOS-1219.
---

   Resolution: Fixed
Fix Version/s: 0.21.0

commit e9643f47720d8c5c9e6278ec068e771b94e94fbc
Author: Vinod Kone vi...@twitter.com
Date:   Fri Apr 18 15:31:59 2014 -0700

Fixed master to reject completed frameworks from re-registering.

Review: https://reviews.apache.org/r/20507


 Master should disallow frameworks that reconnect after failover timeout.
 

 Key: MESOS-1219
 URL: https://issues.apache.org/jira/browse/MESOS-1219
 Project: Mesos
  Issue Type: Bug
  Components: master, webui
Reporter: Robert Lacroix
Assignee: Vinod Kone
 Fix For: 0.21.0


 When a scheduler reconnects after the failover timeout has exceeded, the 
 framework id is usually reused because the scheduler doesn't know that the 
 timeout exceeded and it is actually handled as a new framework.
 The /framework/:framework_id route of the Web UI doesn't handle those cases 
 very well because its key is reused. It only shows the terminated one.
 Would it make sense to ignore the provided framework id when a scheduler 
 reconnects to a terminated framework and generate a new id to make sure it's 
 unique?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MESOS-1466) Race between executor exited event and launch task can cause overcommit of resources

2014-08-18 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101757#comment-14101757
 ] 

Benjamin Mahler commented on MESOS-1466:


We're going to proceed with a mitigation of this by rejecting tasks once the 
slave is overcommitted:
https://issues.apache.org/jira/browse/MESOS-1721

However, we would also like to ensure that this kind of race is not possible. 
One solution is to use master acknowledgments for executor exits:

(1) When an executor terminates (or the executor could not be launched: 
MESOS-1720), we send an exited executor message.
(2) The master acknowledges these message.
(3) The slave will not accept tasks for unacknowledged terminal executors (this 
must include those executors that could not be launched, per MESOS-1720).

The result of this is that a new executor cannot be launched until the master 
is aware of the old executor exiting.

 Race between executor exited event and launch task can cause overcommit of 
 resources
 

 Key: MESOS-1466
 URL: https://issues.apache.org/jira/browse/MESOS-1466
 Project: Mesos
  Issue Type: Bug
  Components: allocation, master
Reporter: Vinod Kone
Assignee: Benjamin Mahler
  Labels: reliability

 The following sequence of events can cause an overcommit
 -- Launch task is called for a task whose executor is already running
 -- Executor's resources are not accounted for on the master
 -- Executor exits and the event is enqueued behind launch tasks on the master
 -- Master sends the task to the slave which needs to commit for resources 
 for task and the (new) executor.
 -- Master processes the executor exited event and re-offers the executor's 
 resources causing an overcommit of resources.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (MESOS-1713) Python framework test dies on OSX because of missing symbol

2014-08-18 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu resolved MESOS-1713.
---

   Resolution: Fixed
Fix Version/s: 0.20.0

 Python framework test dies on OSX because of missing symbol
 ---

 Key: MESOS-1713
 URL: https://issues.apache.org/jira/browse/MESOS-1713
 Project: Mesos
  Issue Type: Bug
Affects Versions: 0.20.0
Reporter: Thomas Rampelberg
Assignee: Jie Yu
 Fix For: 0.20.0


 When building 0.20 on OSX, the build completes but during the test there is a 
 problem with the cgroup symbols:
 ImportError: 
 dlopen(/Users/jyu/.python-eggs/mesos.native-0.20.0-py2.6-macosx-10.4-x86_64.egg-tmp/mesos/native/_mesos.so,
  2): Symbol not found: __ZN7cgroups9hierarchyERKSs
   Referenced from: 
 /Users/jyu/.python-eggs/mesos.native-0.20.0-py2.6-macosx-10.4-x86_64.egg-tmp/mesos/native/_mesos.so



--
This message was sent by Atlassian JIRA
(v6.2#6252)