[jira] [Commented] (MESOS-1845) CommandInfo tasks may fail when scheduled after another task with the same id has finished.
[ https://issues.apache.org/jira/browse/MESOS-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156207#comment-14156207 ] Alexander Rukletsov commented on MESOS-1845: My understanding is that we can reuse the task id when the task enters any of the terminal states, i.e. right after corresponding statusUpdate() invokation. CommandInfo tasks may fail when scheduled after another task with the same id has finished. --- Key: MESOS-1845 URL: https://issues.apache.org/jira/browse/MESOS-1845 Project: Mesos Issue Type: Bug Reporter: Andreas Raster I created a little test framework where I wanted to experiment with scheduling tasks where running one task relies on the results of another, previously run task. So in my test framework I would first schedule a task that would append the string foo to a file, and after that one finishes I would schedule a task that appends bar to the same file. This worked well when using ExecutorInfo, but when I switched to using CommandInfo instead (specifying commands like 'echo foo /share/foobar.txt' in set_value()), it would most of the time fail in the second step when attempting to append bar. Occasionally, but very rarely, it would work though. I couldn't find any meaningful log messages indicating what exactly went wrong. The slave log would indicate that the tasks status changed to TASK_FAILED and that that status update was sent correctly. The stdout log in the Sandbox would indicate that the command 'exited with status 0'. I could work around the issue when I specified task ids that were always unique. Previously I would reuse the id of a previously run task, one that appended foo to a file, after it finished in the followup task that would append bar to a file. It seems to me there might be something wrong when scheduling very short running tasks with the same id quickly after each other. Source code for my foobar framework: http://paste.ubuntu.com/8459083 Build with: g++ -std=c++0x -g -Wall foobar_framework.cpp -I. -L/usr/local/lib -lmesos -o foobar-framework -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-1845) CommandInfo tasks may fail when scheduled after another task with the same id has finished.
[ https://issues.apache.org/jira/browse/MESOS-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156207#comment-14156207 ] Alexander Rukletsov edited comment on MESOS-1845 at 10/2/14 8:10 AM: - My understanding is that we can reuse the task id when the task enters any of the terminal states, i.e. right after corresponding statusUpdate() invocation. was (Author: alex-mesos): My understanding is that we can reuse the task id when the task enters any of the terminal states, i.e. right after corresponding statusUpdate() invokation. CommandInfo tasks may fail when scheduled after another task with the same id has finished. --- Key: MESOS-1845 URL: https://issues.apache.org/jira/browse/MESOS-1845 Project: Mesos Issue Type: Bug Reporter: Andreas Raster I created a little test framework where I wanted to experiment with scheduling tasks where running one task relies on the results of another, previously run task. So in my test framework I would first schedule a task that would append the string foo to a file, and after that one finishes I would schedule a task that appends bar to the same file. This worked well when using ExecutorInfo, but when I switched to using CommandInfo instead (specifying commands like 'echo foo /share/foobar.txt' in set_value()), it would most of the time fail in the second step when attempting to append bar. Occasionally, but very rarely, it would work though. I couldn't find any meaningful log messages indicating what exactly went wrong. The slave log would indicate that the tasks status changed to TASK_FAILED and that that status update was sent correctly. The stdout log in the Sandbox would indicate that the command 'exited with status 0'. I could work around the issue when I specified task ids that were always unique. Previously I would reuse the id of a previously run task, one that appended foo to a file, after it finished in the followup task that would append bar to a file. It seems to me there might be something wrong when scheduling very short running tasks with the same id quickly after each other. Source code for my foobar framework: http://paste.ubuntu.com/8459083 Build with: g++ -std=c++0x -g -Wall foobar_framework.cpp -I. -L/usr/local/lib -lmesos -o foobar-framework -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1857) path::join() is broken
[ https://issues.apache.org/jira/browse/MESOS-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156215#comment-14156215 ] Alexander Rukletsov commented on MESOS-1857: I don't think we can use C++11 version of containers (because of gcc 4.4.something). If remember correctly, these features were introduced after gcc 4.6 (I haven't checked it now though). path::join() is broken -- Key: MESOS-1857 URL: https://issues.apache.org/jira/browse/MESOS-1857 Project: Mesos Issue Type: Bug Affects Versions: 0.21.0 Environment: CentOS6 Reporter: Vinod Kone Assignee: Cody Maloney Saw this on internal CI {code} In file included from ./stout/include/stout/os.hpp:63, from ./stout/include/stout/flags/flags.hpp:30, from ./stout/include/stout/flags.hpp:17, from stout/tests/flags_tests.cpp:7: ./stout/include/stout/path.hpp: In function ‘std::string path::join(const std::string, T ...)’: ./stout/include/stout/path.hpp:42: error: ‘const struct std::basic_stringchar, std::char_traitschar, std::allocatorchar ’ has no member named ‘back’ make[7]: *** [stout_tests-flags_tests.o] Error 1 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-898) Transform and audit mesos build process
[ https://issues.apache.org/jira/browse/MESOS-898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156613#comment-14156613 ] Timothy St. Clair commented on MESOS-898: - Cody, I do, but it's turned idle for a while. If I have a willing partner, I'd be happy to resurrect. Transform and audit mesos build process --- Key: MESOS-898 URL: https://issues.apache.org/jira/browse/MESOS-898 Project: Mesos Issue Type: Improvement Components: build Reporter: Timothy St. Clair Labels: build This is a rather substantial undertaking, so I would want upstream debate+buy-in prior to full commitment. The basic premise is: upstream rebundles several of its dependencies in part to tightly control its stack. This is not out of the norm, but in order to be picked up by distribution channels it needs to built against system dependencies, and rebundling is strictly forbidden. Given that the mesos primary target platform are data-center distributions such as RHEL/CENTOS/SL it makes sense to still have bundling support for those who do not have dependencies in their channels yet. This is where cmake can be win with it's uber macros (http://www.cmake.org/cmake/help/v2.8.8/cmake.html#module:ExternalProject). I do not know of any equivalent in the autotools world, other then to brew your own solution. I've done this type of work in the past, and completely transformed condor and would leverage a lot of the work that was done there. I currently have a tracking branch where I've started this work, but before I go off into the woods, it makes sense to have a debate in public. The primary benefits are: 1. Enable downstream channels to easily distro without carrying a large patch sets. 2. Still support existing non-proper distribution methods. 3. Harden / future proof dependent interfaces. Side Benefits: Audit current build mechanics. - Presently the language specific binding are not installed. (.py .jar) - make -jX currently fails - optionally look in arm support. Costs: 1. Time 2. Potential temporary destabilization 3. Infrastructure around build+test may need to change. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-898) Transform and audit mesos build process
[ https://issues.apache.org/jira/browse/MESOS-898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156767#comment-14156767 ] Dominic Hamon commented on MESOS-898: - I would also like to be part of this effort, time allowing. Transform and audit mesos build process --- Key: MESOS-898 URL: https://issues.apache.org/jira/browse/MESOS-898 Project: Mesos Issue Type: Improvement Components: build Reporter: Timothy St. Clair Labels: build This is a rather substantial undertaking, so I would want upstream debate+buy-in prior to full commitment. The basic premise is: upstream rebundles several of its dependencies in part to tightly control its stack. This is not out of the norm, but in order to be picked up by distribution channels it needs to built against system dependencies, and rebundling is strictly forbidden. Given that the mesos primary target platform are data-center distributions such as RHEL/CENTOS/SL it makes sense to still have bundling support for those who do not have dependencies in their channels yet. This is where cmake can be win with it's uber macros (http://www.cmake.org/cmake/help/v2.8.8/cmake.html#module:ExternalProject). I do not know of any equivalent in the autotools world, other then to brew your own solution. I've done this type of work in the past, and completely transformed condor and would leverage a lot of the work that was done there. I currently have a tracking branch where I've started this work, but before I go off into the woods, it makes sense to have a debate in public. The primary benefits are: 1. Enable downstream channels to easily distro without carrying a large patch sets. 2. Still support existing non-proper distribution methods. 3. Harden / future proof dependent interfaces. Side Benefits: Audit current build mechanics. - Presently the language specific binding are not installed. (.py .jar) - make -jX currently fails - optionally look in arm support. Costs: 1. Time 2. Potential temporary destabilization 3. Infrastructure around build+test may need to change. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1835) Check for IP address being localhost not platform independent
[ https://issues.apache.org/jira/browse/MESOS-1835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156823#comment-14156823 ] Vinod Kone commented on MESOS-1835: --- __ip__ is in network order, so I don't think the endianess of the machine matters? http://en.wikipedia.org/wiki/Endianness#Endianness_in_networking Check for IP address being localhost not platform independent - Key: MESOS-1835 URL: https://issues.apache.org/jira/browse/MESOS-1835 Project: Mesos Issue Type: Bug Components: libprocess Affects Versions: 0.20.1 Reporter: Anindya Sinha Assignee: Evelina Dumitrescu In process::initialize() [3rdparty/src/libprocess/process.cpp], check for __ip__ for localhost (127.0.0.1) is done by checking if __ip__ == 2130706433. However, it could be either 2130706433 or 16777343 depending on endianness. This check should succeed independent of the endianness, so would be good to do a 'inet_ntop' and then compare against the string for 127.0.0.1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-898) Transform and audit mesos build process
[ https://issues.apache.org/jira/browse/MESOS-898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156878#comment-14156878 ] Cody Maloney commented on MESOS-898: I'd definitely like to help work on it. Trying to get build time down, using ccache and other tricks I've gotten most of the way I can go without lots of changes / fixing inside of the autotools/autoconf stuff, and I'd much rather just plug in a cleaner / simpler system than that. I'd also like to get it so that our bundled libraries don't have to live in the git repo itself (There isn't a good reason why the mesos repository is as large or takes as long to clone as it does currently), as well as simplify updating to newer versions of things (There are quite a few little things that would be fixed by newer GLog, GMock, GTest) Transform and audit mesos build process --- Key: MESOS-898 URL: https://issues.apache.org/jira/browse/MESOS-898 Project: Mesos Issue Type: Improvement Components: build Reporter: Timothy St. Clair Labels: build This is a rather substantial undertaking, so I would want upstream debate+buy-in prior to full commitment. The basic premise is: upstream rebundles several of its dependencies in part to tightly control its stack. This is not out of the norm, but in order to be picked up by distribution channels it needs to built against system dependencies, and rebundling is strictly forbidden. Given that the mesos primary target platform are data-center distributions such as RHEL/CENTOS/SL it makes sense to still have bundling support for those who do not have dependencies in their channels yet. This is where cmake can be win with it's uber macros (http://www.cmake.org/cmake/help/v2.8.8/cmake.html#module:ExternalProject). I do not know of any equivalent in the autotools world, other then to brew your own solution. I've done this type of work in the past, and completely transformed condor and would leverage a lot of the work that was done there. I currently have a tracking branch where I've started this work, but before I go off into the woods, it makes sense to have a debate in public. The primary benefits are: 1. Enable downstream channels to easily distro without carrying a large patch sets. 2. Still support existing non-proper distribution methods. 3. Harden / future proof dependent interfaces. Side Benefits: Audit current build mechanics. - Presently the language specific binding are not installed. (.py .jar) - make -jX currently fails - optionally look in arm support. Costs: 1. Time 2. Potential temporary destabilization 3. Infrastructure around build+test may need to change. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1857) path::join() is broken
[ https://issues.apache.org/jira/browse/MESOS-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156880#comment-14156880 ] Cody Maloney commented on MESOS-1857: - [~alex-mesos] A whole bunch of them are around, it is very hit or miss and there isn't a good list online at all. The answer is we need a better bot (Preferrably public, but at least internal). I'm working on that, hopefully get something before too much longer. path::join() is broken -- Key: MESOS-1857 URL: https://issues.apache.org/jira/browse/MESOS-1857 Project: Mesos Issue Type: Bug Affects Versions: 0.21.0 Environment: CentOS6 Reporter: Vinod Kone Assignee: Cody Maloney Fix For: 0.21.0 Saw this on internal CI {code} In file included from ./stout/include/stout/os.hpp:63, from ./stout/include/stout/flags/flags.hpp:30, from ./stout/include/stout/flags.hpp:17, from stout/tests/flags_tests.cpp:7: ./stout/include/stout/path.hpp: In function ‘std::string path::join(const std::string, T ...)’: ./stout/include/stout/path.hpp:42: error: ‘const struct std::basic_stringchar, std::char_traitschar, std::allocatorchar ’ has no member named ‘back’ make[7]: *** [stout_tests-flags_tests.o] Error 1 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1853) Remove /proc and /sys remounts from port_mapping isolator
[ https://issues.apache.org/jira/browse/MESOS-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156928#comment-14156928 ] Jie Yu commented on MESOS-1853: --- What about /sys? Do we need to mount it inside the container? Remove /proc and /sys remounts from port_mapping isolator - Key: MESOS-1853 URL: https://issues.apache.org/jira/browse/MESOS-1853 Project: Mesos Issue Type: Bug Components: isolation Affects Versions: 0.20.0, 0.20.1 Reporter: Ian Downes Assignee: Ian Downes /proc/net reflects a new network namespace regardless and remount doesn't actually do what we expected anyway, i.e., it's not sufficient for a new pid namespace and a new mount is required. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1858) Leaked file descriptors in StatusUpdateStream.
[ https://issues.apache.org/jira/browse/MESOS-1858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156982#comment-14156982 ] Benjamin Mahler commented on MESOS-1858: Linking in MESOS-1432. Leaked file descriptors in StatusUpdateStream. -- Key: MESOS-1858 URL: https://issues.apache.org/jira/browse/MESOS-1858 Project: Mesos Issue Type: Bug Reporter: Jie Yu https://github.com/apache/mesos/blob/master/src/slave/status_update_manager.hpp#L180 We should set cloexec for 'fd'. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-898) Transform and audit mesos build process
[ https://issues.apache.org/jira/browse/MESOS-898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157053#comment-14157053 ] Timothy St. Clair commented on MESOS-898: - 1st pass just throwing things over the master fence: https://github.com/timothysc/mesos/tree/cmake_round2. I'll try to cleanup some of the other files and move over as time permits. I'd be happy to add collaborators on repo, where needed. Transform and audit mesos build process --- Key: MESOS-898 URL: https://issues.apache.org/jira/browse/MESOS-898 Project: Mesos Issue Type: Improvement Components: build Reporter: Timothy St. Clair Labels: build This is a rather substantial undertaking, so I would want upstream debate+buy-in prior to full commitment. The basic premise is: upstream rebundles several of its dependencies in part to tightly control its stack. This is not out of the norm, but in order to be picked up by distribution channels it needs to built against system dependencies, and rebundling is strictly forbidden. Given that the mesos primary target platform are data-center distributions such as RHEL/CENTOS/SL it makes sense to still have bundling support for those who do not have dependencies in their channels yet. This is where cmake can be win with it's uber macros (http://www.cmake.org/cmake/help/v2.8.8/cmake.html#module:ExternalProject). I do not know of any equivalent in the autotools world, other then to brew your own solution. I've done this type of work in the past, and completely transformed condor and would leverage a lot of the work that was done there. I currently have a tracking branch where I've started this work, but before I go off into the woods, it makes sense to have a debate in public. The primary benefits are: 1. Enable downstream channels to easily distro without carrying a large patch sets. 2. Still support existing non-proper distribution methods. 3. Harden / future proof dependent interfaces. Side Benefits: Audit current build mechanics. - Presently the language specific binding are not installed. (.py .jar) - make -jX currently fails - optionally look in arm support. Costs: 1. Time 2. Potential temporary destabilization 3. Infrastructure around build+test may need to change. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-444) Remove --checkpoint flag in the slave once checkpointing is stable.
[ https://issues.apache.org/jira/browse/MESOS-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157092#comment-14157092 ] Cody Maloney commented on MESOS-444: First patch to work on this: https://reviews.apache.org/r/26275 Email to mesos-dev: http://www.mail-archive.com/dev@mesos.apache.org/msg20729.html The general plan I have for implementing this: 1) Make it so the flag can't be changed at the command line 2) Remove the checkpoint variable entirely from slave/flags.hpp. This is a fairly involved change since a number of unit tests depend on manually setting the flag, as well as the default being non-checkpointing. 3) Remove logic around checkpointing in the slave 4) Drop the flag from the SlaveInfo struct, remove logic inside the master (Will require a deprecation cycle). Remove --checkpoint flag in the slave once checkpointing is stable. --- Key: MESOS-444 URL: https://issues.apache.org/jira/browse/MESOS-444 Project: Mesos Issue Type: Task Reporter: Benjamin Mahler Assignee: Cody Maloney Labels: newbie In the interim of slave recovery being worked on (see: MESOS-110), we've added a --checkpoint flag to the slave to enable or disable the feature. Prior to releasing this feature, we need to remove this flag so that all slaves have checkpointing available, and frameworks can choose to use it. There's no need to keep this flag around and add configuration complexity. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-444) Remove --checkpoint flag in the slave once checkpointing is stable.
[ https://issues.apache.org/jira/browse/MESOS-444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cody Maloney updated MESOS-444: --- Shepherd: Vinod Kone (was: Vinod Kone) Remove --checkpoint flag in the slave once checkpointing is stable. --- Key: MESOS-444 URL: https://issues.apache.org/jira/browse/MESOS-444 Project: Mesos Issue Type: Task Reporter: Benjamin Mahler Assignee: Cody Maloney Labels: newbie In the interim of slave recovery being worked on (see: MESOS-110), we've added a --checkpoint flag to the slave to enable or disable the feature. Prior to releasing this feature, we need to remove this flag so that all slaves have checkpointing available, and frameworks can choose to use it. There's no need to keep this flag around and add configuration complexity. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-444) Remove --checkpoint flag in the slave once checkpointing is stable.
[ https://issues.apache.org/jira/browse/MESOS-444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cody Maloney updated MESOS-444: --- Shepherd: Vinod Kone Remove --checkpoint flag in the slave once checkpointing is stable. --- Key: MESOS-444 URL: https://issues.apache.org/jira/browse/MESOS-444 Project: Mesos Issue Type: Task Reporter: Benjamin Mahler Assignee: Cody Maloney Labels: newbie In the interim of slave recovery being worked on (see: MESOS-110), we've added a --checkpoint flag to the slave to enable or disable the feature. Prior to releasing this feature, we need to remove this flag so that all slaves have checkpointing available, and frameworks can choose to use it. There's no need to keep this flag around and add configuration complexity. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1835) Check for IP address being localhost not platform independent
[ https://issues.apache.org/jira/browse/MESOS-1835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157192#comment-14157192 ] Timothy St. Clair commented on MESOS-1835: -- Is this an actual issue or a NIT? Also I'm uncertain how this change would be agnostic to the whole IPv6 work. Check for IP address being localhost not platform independent - Key: MESOS-1835 URL: https://issues.apache.org/jira/browse/MESOS-1835 Project: Mesos Issue Type: Bug Components: libprocess Affects Versions: 0.20.1 Reporter: Anindya Sinha Assignee: Evelina Dumitrescu In process::initialize() [3rdparty/src/libprocess/process.cpp], check for __ip__ for localhost (127.0.0.1) is done by checking if __ip__ == 2130706433. However, it could be either 2130706433 or 16777343 depending on endianness. This check should succeed independent of the endianness, so would be good to do a 'inet_ntop' and then compare against the string for 127.0.0.1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1835) Check for IP address being localhost not platform independent
[ https://issues.apache.org/jira/browse/MESOS-1835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157344#comment-14157344 ] Anindya Sinha commented on MESOS-1835: -- inet_pton returns network order which is big endian. So, 127.0.0.1 is stored as 0x17f (16777343) when stored in __ip__ on little endian machines. However, check for __ip__ is being done for 2130706433. So that check fails. On big endian machines, __ip__ should be stored as 0x7f01 (2130706433) although I do not have means to verify this, and hence the check for __ip__ == 2130706433 would succeed. Check for IP address being localhost not platform independent - Key: MESOS-1835 URL: https://issues.apache.org/jira/browse/MESOS-1835 Project: Mesos Issue Type: Bug Components: libprocess Affects Versions: 0.20.1 Reporter: Anindya Sinha Assignee: Evelina Dumitrescu In process::initialize() [3rdparty/src/libprocess/process.cpp], check for __ip__ for localhost (127.0.0.1) is done by checking if __ip__ == 2130706433. However, it could be either 2130706433 or 16777343 depending on endianness. This check should succeed independent of the endianness, so would be good to do a 'inet_ntop' and then compare against the string for 127.0.0.1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)