[jira] [Created] (MESOS-1711) Create method for users to identify HDFS compatible protocols in fetcher.cpp
John Omernik created MESOS-1711: --- Summary: Create method for users to identify HDFS compatible protocols in fetcher.cpp Key: MESOS-1711 URL: https://issues.apache.org/jira/browse/MESOS-1711 Project: Mesos Issue Type: Improvement Components: general Affects Versions: 0.19.1 Environment: All Reporter: John Omernik Priority: Minor Fix For: 0.21.0 In fetcher.cpp, the code to get the Mesos packages uses a hard coded list of protocols to determine if the Hadoop copytoLocal method is used or if another method (such as standard filecopy). This limits the addition of new protocols that are HDFS compatible until the next release of Mesos. Tachyon Filesystem (tachyonfs://), MapR FS (maprfs://) and glusterfs:// are three examples that could make use of this. Instead of just adding those file systems in the hard coded list, I recommend following the lead of the Tachyon Project. In tachyon-0.6.0-SNAPSHOT, they have added an environment variable of allowed hdfs compatible protocols. This comma-separated list allows the user/admin to specify which protocols are HDFS compatible, without hard coding it in the fetcher.cpp. I don't have access to the Tachyon issues list for linking, but the code is on line 75 of https://github.com/amplab/tachyon/blob/master/core/src/main/java/tachyon/UnderFileSystem.java -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MESOS-1711) Create method for users to identify HDFS compatible protocols in fetcher.cpp
[ https://issues.apache.org/jira/browse/MESOS-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100728#comment-14100728 ] Timothy St. Clair commented on MESOS-1711: -- I'm game for *this one. Create method for users to identify HDFS compatible protocols in fetcher.cpp Key: MESOS-1711 URL: https://issues.apache.org/jira/browse/MESOS-1711 Project: Mesos Issue Type: Improvement Components: general Affects Versions: 0.19.1 Environment: All Reporter: John Omernik Priority: Minor Labels: fetecher, hadoop, hdfs Fix For: 0.21.0 Original Estimate: 6h Remaining Estimate: 6h In fetcher.cpp, the code to get the Mesos packages uses a hard coded list of protocols to determine if the Hadoop copytoLocal method is used or if another method (such as standard filecopy). This limits the addition of new protocols that are HDFS compatible until the next release of Mesos. Tachyon Filesystem (tachyonfs://), MapR FS (maprfs://) and glusterfs:// are three examples that could make use of this. Instead of just adding those file systems in the hard coded list, I recommend following the lead of the Tachyon Project. In tachyon-0.6.0-SNAPSHOT, they have added an environment variable of allowed hdfs compatible protocols. This comma-separated list allows the user/admin to specify which protocols are HDFS compatible, without hard coding it in the fetcher.cpp. I don't have access to the Tachyon issues list for linking, but the code is on line 75 of https://github.com/amplab/tachyon/blob/master/core/src/main/java/tachyon/UnderFileSystem.java -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MESOS-1711) Create method for users to identify HDFS compatible protocols in fetcher.cpp
[ https://issues.apache.org/jira/browse/MESOS-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam B updated MESOS-1711: -- Shepherd: Adam B Fix Version/s: (was: 0.21.0) Labels: fetcher hadoop hdfs (was: fetecher hadoop hdfs) Create method for users to identify HDFS compatible protocols in fetcher.cpp Key: MESOS-1711 URL: https://issues.apache.org/jira/browse/MESOS-1711 Project: Mesos Issue Type: Improvement Components: general Affects Versions: 0.19.1 Environment: All Reporter: John Omernik Assignee: Timothy St. Clair Priority: Minor Labels: fetcher, hadoop, hdfs Original Estimate: 6h Remaining Estimate: 6h In fetcher.cpp, the code to get the Mesos packages uses a hard coded list of protocols to determine if the Hadoop copytoLocal method is used or if another method (such as standard filecopy). This limits the addition of new protocols that are HDFS compatible until the next release of Mesos. Tachyon Filesystem (tachyonfs://), MapR FS (maprfs://) and glusterfs:// are three examples that could make use of this. Instead of just adding those file systems in the hard coded list, I recommend following the lead of the Tachyon Project. In tachyon-0.6.0-SNAPSHOT, they have added an environment variable of allowed hdfs compatible protocols. This comma-separated list allows the user/admin to specify which protocols are HDFS compatible, without hard coding it in the fetcher.cpp. I don't have access to the Tachyon issues list for linking, but the code is on line 75 of https://github.com/amplab/tachyon/blob/master/core/src/main/java/tachyon/UnderFileSystem.java -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MESOS-871) GroupTest.RetryableErrors is flaky
[ https://issues.apache.org/jira/browse/MESOS-871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brenden Matthews updated MESOS-871: --- Description: [ RUN ] GroupTest.RetryableErrors I1205 18:30:58.603236 20390 zookeeper_test_server.cpp:113] Started ZooKeeperTestServer on port 40811 2013-12-05 18:30:58,604:20390(0x7f4512b9d700):ZOO_INFO@log_env@658: Client environment:zookeeper.version=zookeeper C client 3.3.4 2013-12-05 18:30:58,604:20390(0x7f4512b9d700):ZOO_INFO@log_env@662: Client environment:host.name=localhost.localdomain 2013-12-05 18:30:58,604:20390(0x7f4512b9d700):ZOO_INFO@log_env@669: Client environment:os.name=Linux 2013-12-05 18:30:58,604:20390(0x7f4512b9d700):ZOO_INFO@log_env@670: Client environment:os.arch=3.11.9-200.fc19.x86_64 2013-12-05 18:30:58,604:20390(0x7f4512b9d700):ZOO_INFO@log_env@671: Client environment:os.version=#1 SMP Wed Nov 20 21:22:24 UTC 2013 2013-12-05 18:30:58,605:20390(0x7f4512b9d700):ZOO_INFO@log_env@679: Client environment:user.name=jenkins 2013-12-05 18:30:58,605:20390(0x7f4512b9d700):ZOO_INFO@log_env@687: Client environment:user.home=/home/jenkins 2013-12-05 18:30:58,605:20390(0x7f4512b9d700):ZOO_INFO@log_env@699: Client environment:user.dir=/var/jenkins/workspace/mesos-fedora-19-gcc/src 2013-12-05 18:30:58,605:20390(0x7f4512b9d700):ZOO_INFO@zookeeper_init@727: Initiating client connection, host=127.0.0.1:40811 sessionTimeout=5000 watcher=0x7f45138fa6a0 sessionId=0 sessionPasswd=null context=0x7f44fc19ae70 flags=0 2013-12-05 18:30:58,608:20390(0x7f44df2d5700):ZOO_INFO@check_events@1585: initiated connection to server [127.0.0.1:40811] 2013-12-05 18:30:58,614:20390(0x7f44df2d5700):ZOO_INFO@check_events@1632: session establishment complete on server [127.0.0.1:40811], sessionId=0x142c5be6528, negotiated timeout=6000 I1205 18:30:58.616745 20411 group.cpp:280] Group process ((1488)@127.0.0.1:59677) connected to ZooKeeper I1205 18:30:58.616773 20411 group.cpp:675] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0) I1205 18:30:58.616780 20411 group.cpp:313] Authenticating with ZooKeeper using digest 2013-12-05 18:30:59,587:20390(0x7f44e1ada700):ZOO_ERROR@handle_socket_error_msg@1579: Socket [127.0.0.1:45593] zk retcode=-4, errno=111(Connection refused): server refused to accept the client 2013-12-05 18:31:00,610:20390(0x7f44df2d5700):ZOO_INFO@auth_completion_func@1198: Authentication scheme digest succeeded I1205 18:31:00.616611 20411 group.cpp:337] Trying to create path '/test' in ZooKeeper 2013-12-05 18:31:00,628:20390(0x7f44df2d5700):ZOO_ERROR@handle_socket_error_msg@1592: Socket [127.0.0.1:40811] zk retcode=-4, errno=32(Broken pipe): failed while flushing send queue I1205 18:31:00.633744 20416 group.cpp:366] Lost connection to ZooKeeper, attempting to reconnect ... 2013-12-05 18:31:02,635:20390(0x7f44df2d5700):ZOO_INFO@check_events@1585: initiated connection to server [127.0.0.1:40811] 2013-12-05 18:31:02,637:20390(0x7f44df2d5700):ZOO_ERROR@handle_socket_error_msg@1621: Socket [127.0.0.1:40811] zk retcode=-112, errno=116(Stale file handle): sessionId=0x142c5be6528 has expired. 2013-12-05 18:31:02,638:20390(0x7f4510398700):ZOO_INFO@zookeeper_close@2321: Freeing zookeeper resources for sessionId=0x142c5be6528 2013-12-05 18:31:02,639:20390(0x7f4510398700):ZOO_INFO@log_env@658: Client environment:zookeeper.version=zookeeper C client 3.3.4 2013-12-05 18:31:02,639:20390(0x7f4510398700):ZOO_INFO@log_env@662: Client environment:host.name=localhost.localdomain 2013-12-05 18:31:02,639:20390(0x7f4510398700):ZOO_INFO@log_env@669: Client environment:os.name=Linux 2013-12-05 18:31:02,639:20390(0x7f4510398700):ZOO_INFO@log_env@670: Client environment:os.arch=3.11.9-200.fc19.x86_64 2013-12-05 18:31:02,639:20390(0x7f4510398700):ZOO_INFO@log_env@671: Client environment:os.version=#1 SMP Wed Nov 20 21:22:24 UTC 2013 2013-12-05 18:31:02,639:20390(0x7f4510398700):ZOO_INFO@log_env@679: Client environment:user.name=jenkins 2013-12-05 18:31:02,639:20390(0x7f4510398700):ZOO_INFO@log_env@687: Client environment:user.home=/home/jenkins 2013-12-05 18:31:02,639:20390(0x7f4510398700):ZOO_INFO@log_env@699: Client environment:user.dir=/var/jenkins/workspace/mesos-fedora-19-gcc/src 2013-12-05 18:31:02,639:20390(0x7f4510398700):ZOO_INFO@zookeeper_init@727: Initiating client connection, host=127.0.0.1:40811 sessionTimeout=5000 watcher=0x7f45138fa6a0 sessionId=0 sessionPasswd=null context=0x7f4508138080 flags=0 2013-12-05 18:31:02,642:20390(0x7f44e05d3700):ZOO_INFO@check_events@1585: initiated connection to server [127.0.0.1:40811] 2013-12-05 18:31:02,648:20390(0x7f44e05d3700):ZOO_INFO@check_events@1632: session establishment complete on server [127.0.0.1:40811], sessionId=0x142c5be65280001, negotiated timeout=6000 I1205 18:31:02.648991 20411 group.cpp:280] Group process ((1488)@127.0.0.1:59677) connected to ZooKeeper I1205 18:31:02.649021 20411
[jira] [Created] (MESOS-1714) The C++ 'Resources' abstraction should keep the underlying resources flattened.
Benjamin Mahler created MESOS-1714: -- Summary: The C++ 'Resources' abstraction should keep the underlying resources flattened. Key: MESOS-1714 URL: https://issues.apache.org/jira/browse/MESOS-1714 Project: Mesos Issue Type: Bug Components: c++ api Reporter: Benjamin Mahler Currently, the C++ Resources class does not ensure that the underlying Resources protobufs are kept flat. This is an issue because some of the methods, e.g. [Resources::get|https://github.com/apache/mesos/blob/0.19.1/src/common/resources.cpp#L269], assume the resources are flat. There is code that constructs unflattened resources, e.g. [Slave::launchExecutor|https://github.com/apache/mesos/blob/0.19.1/src/slave/slave.cpp#L3353]. We could prevent this type of construction, however it is perfectly fine if we ensure the C++ 'Resources' class performs flattening. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MESOS-1574) what to do when a rogue process binds to a port mesos didn't allocate to it?
[ https://issues.apache.org/jira/browse/MESOS-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101448#comment-14101448 ] Jie Yu commented on MESOS-1574: --- If you turn on the network isolator in 0.20.0, we will have isolation for 'ports' resource as well. So if a process is using a port that is not assigned to it, it can still bind that port, but it won't be able to use that port to communicate with others. THat's because we install tc filters for each container and will drop those packages if the src port does not belong to the container. what to do when a rogue process binds to a port mesos didn't allocate to it? Key: MESOS-1574 URL: https://issues.apache.org/jira/browse/MESOS-1574 Project: Mesos Issue Type: Improvement Components: allocation, isolation Reporter: Jay Buffington Priority: Minor I recently had an issue where a slave had a process who's parent was init that was bound to a port in the range that mesos thought was a free resource. I'm not sure if this is due to a bug in mesos (it lost track of this process during an upgrade?) or if there was a bad user who started a process on the host manually outside of mesos. The process is over a month old and I have no history in mesos to ask it if/when it launched the task :( If a rogue process binds to a port that mesos-slave has offered to the master as an available resource there should be some sort of reckoning. Mesos could: * kill the rogue process * rescind the offer for that port * have an api that can be plugged into a monitoring system to alert humans of this inconsistency -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (MESOS-1713) Python framework test dies on OSX because of missing symbol
[ https://issues.apache.org/jira/browse/MESOS-1713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu reassigned MESOS-1713: - Assignee: Jie Yu (was: Timothy Chen) Python framework test dies on OSX because of missing symbol --- Key: MESOS-1713 URL: https://issues.apache.org/jira/browse/MESOS-1713 Project: Mesos Issue Type: Bug Affects Versions: 0.20.0 Reporter: Thomas Rampelberg Assignee: Jie Yu When building 0.20 on OSX, the build completes but during the test there is a problem with the cgroup symbols: ImportError: dlopen(/Users/jyu/.python-eggs/mesos.native-0.20.0-py2.6-macosx-10.4-x86_64.egg-tmp/mesos/native/_mesos.so, 2): Symbol not found: __ZN7cgroups9hierarchyERKSs Referenced from: /Users/jyu/.python-eggs/mesos.native-0.20.0-py2.6-macosx-10.4-x86_64.egg-tmp/mesos/native/_mesos.so -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (MESOS-1715) The slave does not send pending tasks during re-registration.
Benjamin Mahler created MESOS-1715: -- Summary: The slave does not send pending tasks during re-registration. Key: MESOS-1715 URL: https://issues.apache.org/jira/browse/MESOS-1715 Project: Mesos Issue Type: Bug Components: slave Reporter: Benjamin Mahler In what looks like an oversight, the pending tasks in the slave (Framework::pending) are not sent in the re-registration message. This can lead to spurious TASK_LOST notifications being generated by the master when it falsely thinks the tasks are not present on the slave. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (MESOS-1717) The slave does not show pending tasks in the JSON endpoints.
Benjamin Mahler created MESOS-1717: -- Summary: The slave does not show pending tasks in the JSON endpoints. Key: MESOS-1717 URL: https://issues.apache.org/jira/browse/MESOS-1717 Project: Mesos Issue Type: Bug Components: json api, slave Reporter: Benjamin Mahler The slave does not show pending tasks in the /state.json endpoint. This is a bit tricky to add since we rely on knowing the executor directory. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (MESOS-1574) what to do when a rogue process binds to a port mesos didn't allocate to it?
[ https://issues.apache.org/jira/browse/MESOS-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Buffington resolved MESOS-1574. --- Resolution: Won't Fix This is largely addressed by the network isolator. Thanks [~jieyu] what to do when a rogue process binds to a port mesos didn't allocate to it? Key: MESOS-1574 URL: https://issues.apache.org/jira/browse/MESOS-1574 Project: Mesos Issue Type: Improvement Components: allocation, isolation Reporter: Jay Buffington Priority: Minor I recently had an issue where a slave had a process who's parent was init that was bound to a port in the range that mesos thought was a free resource. I'm not sure if this is due to a bug in mesos (it lost track of this process during an upgrade?) or if there was a bad user who started a process on the host manually outside of mesos. The process is over a month old and I have no history in mesos to ask it if/when it launched the task :( If a rogue process binds to a port that mesos-slave has offered to the master as an available resource there should be some sort of reckoning. Mesos could: * kill the rogue process * rescind the offer for that port * have an api that can be plugged into a monitoring system to alert humans of this inconsistency -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (MESOS-1718) Command executor can overcommit the slave.
Benjamin Mahler created MESOS-1718: -- Summary: Command executor can overcommit the slave. Key: MESOS-1718 URL: https://issues.apache.org/jira/browse/MESOS-1718 Project: Mesos Issue Type: Bug Components: slave Reporter: Benjamin Mahler Currently we give a small amount of resources to the command executor, in addition to resources used by the command task: https://github.com/apache/mesos/blob/0.20.0-rc1/src/slave/slave.cpp#L2448 {code: title=} ExecutorInfo Slave::getExecutorInfo( const FrameworkID frameworkId, const TaskInfo task) { ... // Add an allowance for the command executor. This does lead to a // small overcommit of resources. executor.mutable_resources()-MergeFrom( Resources::parse( cpus: + stringify(DEFAULT_EXECUTOR_CPUS) + ; + mem: + stringify(DEFAULT_EXECUTOR_MEM.megabytes())).get()); ... } {code} This leads to an overcommit of the slave. Ideally, for command tasks we can transfer all of the task resources to the executor at the slave / isolation level. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (MESOS-1719) Master should persist active frameworks information
Vinod Kone created MESOS-1719: - Summary: Master should persist active frameworks information Key: MESOS-1719 URL: https://issues.apache.org/jira/browse/MESOS-1719 Project: Mesos Issue Type: Task Reporter: Vinod Kone https://issues.apache.org/jira/browse/MESOS-1219 disallows completed frameworks from re-registering with the same framework id, as long as the master doesn't failover. This ticket tracks the work for it work across the master failover using registrar. There are some open questions that need to be addressed: -- Should registry contain framework ids only framework infos. For disallowing completed frameworks from re-registering, persisting framework ids is enough. But, if in the future, we want to disallow frameworks from re-registering if some parts of framework info changed then we need to persist the info too. -- How to update the framework info. Currently frameworks are allowed to update framework info while re- registering, but it only takes effect on the master when the master fails over and on the slave when the slave fails over. How should things change when persist framework info? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (MESOS-1720) Slave should send exited executor message when the executor is never launched.
Benjamin Mahler created MESOS-1720: -- Summary: Slave should send exited executor message when the executor is never launched. Key: MESOS-1720 URL: https://issues.apache.org/jira/browse/MESOS-1720 Project: Mesos Issue Type: Bug Components: master, slave Reporter: Benjamin Mahler When the slave sends TASK_LOST before launching an executor for a task, the slave does not send an exited executor message to the master. Since the master receives no exited executor message, it still thinks the executor's resources are consumed on the slave. One possible fix for this would be to send the exited executor message to the master in these cases. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (MESOS-1219) Master should disallow frameworks that reconnect after failover timeout.
[ https://issues.apache.org/jira/browse/MESOS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone resolved MESOS-1219. --- Resolution: Fixed Fix Version/s: 0.21.0 commit e9643f47720d8c5c9e6278ec068e771b94e94fbc Author: Vinod Kone vi...@twitter.com Date: Fri Apr 18 15:31:59 2014 -0700 Fixed master to reject completed frameworks from re-registering. Review: https://reviews.apache.org/r/20507 Master should disallow frameworks that reconnect after failover timeout. Key: MESOS-1219 URL: https://issues.apache.org/jira/browse/MESOS-1219 Project: Mesos Issue Type: Bug Components: master, webui Reporter: Robert Lacroix Assignee: Vinod Kone Fix For: 0.21.0 When a scheduler reconnects after the failover timeout has exceeded, the framework id is usually reused because the scheduler doesn't know that the timeout exceeded and it is actually handled as a new framework. The /framework/:framework_id route of the Web UI doesn't handle those cases very well because its key is reused. It only shows the terminated one. Would it make sense to ignore the provided framework id when a scheduler reconnects to a terminated framework and generate a new id to make sure it's unique? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MESOS-1466) Race between executor exited event and launch task can cause overcommit of resources
[ https://issues.apache.org/jira/browse/MESOS-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101757#comment-14101757 ] Benjamin Mahler commented on MESOS-1466: We're going to proceed with a mitigation of this by rejecting tasks once the slave is overcommitted: https://issues.apache.org/jira/browse/MESOS-1721 However, we would also like to ensure that this kind of race is not possible. One solution is to use master acknowledgments for executor exits: (1) When an executor terminates (or the executor could not be launched: MESOS-1720), we send an exited executor message. (2) The master acknowledges these message. (3) The slave will not accept tasks for unacknowledged terminal executors (this must include those executors that could not be launched, per MESOS-1720). The result of this is that a new executor cannot be launched until the master is aware of the old executor exiting. Race between executor exited event and launch task can cause overcommit of resources Key: MESOS-1466 URL: https://issues.apache.org/jira/browse/MESOS-1466 Project: Mesos Issue Type: Bug Components: allocation, master Reporter: Vinod Kone Assignee: Benjamin Mahler Labels: reliability The following sequence of events can cause an overcommit -- Launch task is called for a task whose executor is already running -- Executor's resources are not accounted for on the master -- Executor exits and the event is enqueued behind launch tasks on the master -- Master sends the task to the slave which needs to commit for resources for task and the (new) executor. -- Master processes the executor exited event and re-offers the executor's resources causing an overcommit of resources. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (MESOS-1713) Python framework test dies on OSX because of missing symbol
[ https://issues.apache.org/jira/browse/MESOS-1713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu resolved MESOS-1713. --- Resolution: Fixed Fix Version/s: 0.20.0 Python framework test dies on OSX because of missing symbol --- Key: MESOS-1713 URL: https://issues.apache.org/jira/browse/MESOS-1713 Project: Mesos Issue Type: Bug Affects Versions: 0.20.0 Reporter: Thomas Rampelberg Assignee: Jie Yu Fix For: 0.20.0 When building 0.20 on OSX, the build completes but during the test there is a problem with the cgroup symbols: ImportError: dlopen(/Users/jyu/.python-eggs/mesos.native-0.20.0-py2.6-macosx-10.4-x86_64.egg-tmp/mesos/native/_mesos.so, 2): Symbol not found: __ZN7cgroups9hierarchyERKSs Referenced from: /Users/jyu/.python-eggs/mesos.native-0.20.0-py2.6-macosx-10.4-x86_64.egg-tmp/mesos/native/_mesos.so -- This message was sent by Atlassian JIRA (v6.2#6252)