from:"Anand Mazumdar \(JIRA\)"

[jira] [Updated] (MESOS-6917) Segfault when the executor sets an invalid UUID when sending a status update.

2017-01-17 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6917:
--
Shepherd: Vinod Kone  (was: Anand Mazumdar)

> Segfault when the executor sets an invalid UUID  when sending a status update.
> --
>
> Key: MESOS-6917
> URL: https://issues.apache.org/jira/browse/MESOS-6917
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.0, 1.0.1, 1.0.2, 1.1.0
>Reporter: Aaron Wood
>Assignee: Aaron Wood
>Priority: Blocker
>  Labels: mesosphere
> Fix For: 1.1.1, 1.2.0
>
>
> A segfault occurs when an executor sets a UUID that's not a valid v4 UUID and 
> sends it off to the agent:
> {code}
> ABORT: (../../3rdparty/stout/include/stout/try.hpp:77): Try::get() but state 
> == ERROR: Not a valid UUID
> *** Aborted at 1484262968 (unix time) try "date -d @1484262968" if you are 
> using GNU date ***
> PC: @ 0x7efeb6101428 (unknown)
> *** SIGABRT (@0x36b7) received by PID 14007 (TID 0x7efeabd29700) from PID 
> 14007; stack trace: ***
> @ 0x7efeb64a6390 (unknown)
> @ 0x7efeb6101428 (unknown)
> @ 0x7efeb610302a (unknown)
> @ 0x560df739fa6e _Abort()
> @ 0x560df739fa9c _Abort()
> @ 0x7efebb53a5ad Try<>::get()
> @ 0x7efebb5363d6 Try<>::get()
> @ 0x7efebbd84809 
> mesos::internal::slave::validation::executor::call::validate()
> @ 0x7efebbb59b36 mesos::internal::slave::Slave::Http::executor()
> @ 0x7efebbc773b8 
> _ZZN5mesos8internal5slave5Slave10initializeEvENKUlRKN7process4http7RequestEE1_clES7_
> @ 0x7efebbcb5808 
> _ZNSt17_Function_handlerIFN7process6FutureINS0_4http8ResponseEEERKNS2_7RequestEEZN5mesos8internal5slave5Slave10initializeEvEUlS7_E1_E9_M_invokeERKSt9_Any_dataS7_
> @ 0x7efebbfb2aea std::function<>::operator()()
> @ 0x7efebcb158b8 
> _ZZZN7process11ProcessBase6_visitERKNS0_12HttpEndpointERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKNS_5OwnedINS_4http7RequestNKUlRK6OptionINSD_14authentication20AuthenticationResultEEE0_clESN_ENKUlbE1_clEb
> @ 0x7efebcb1a10a 
> _ZZZNK7process9_DeferredIZZNS_11ProcessBase6_visitERKNS1_12HttpEndpointERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKNS_5OwnedINS_4http7RequestNKUlRK6OptionINSE_14authentication20AuthenticationResultEEE0_clESO_EUlbE1_EcvSt8functionIFT_T0_EEINS_6FutureINSE_8ResponseEEERKbEEvENKUlS12_E_clES12_ENKUlvE_clEv
> @ 0x7efebcb1c5f8 
> _ZNSt17_Function_handlerIFN7process6FutureINS0_4http8ResponseEEEvEZZNKS0_9_DeferredIZZNS0_11ProcessBase6_visitERKNS7_12HttpEndpointERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKNS0_5OwnedINS2_7RequestNKUlRK6OptionINS2_14authentication20AuthenticationResultEEE0_clEST_EUlbE1_EcvSt8functionIFT_T0_EEIS4_RKbEEvENKUlS14_E_clES14_EUlvE_E9_M_invokeERKSt9_Any_data
> @ 0x7efebb5ce8ca std::function<>::operator()()
> @ 0x7efebb5c4b27 
> _ZZN7process8internal8DispatchINS_6FutureINS_4http8ResponseclIRSt8functionIFS5_vS5_RKNS_4UPIDEOT_ENKUlPNS_11ProcessBaseEE_clESI_
> @ 0x7efebb5d4e1e 
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8internal8DispatchINS0_6FutureINS0_4http8ResponseclIRSt8functionIFS9_vS9_RKNS0_4UPIDEOT_EUlS2_E_E9_M_invokeERKSt9_Any_dataOS2_
> @ 0x7efebcb30baf std::function<>::operator()()
> @ 0x7efebcb13fd6 process::ProcessBase::visit()
> @ 0x7efebcb1f3c8 process::DispatchEvent::visit()
> @ 0x7efebb3ab2ea process::ProcessBase::serve()
> @ 0x7efebcb0fe8a process::ProcessManager::resume()
> @ 0x7efebcb0c5a3 
> _ZZN7process14ProcessManager12init_threadsEvENKUt_clEv
> @ 0x7efebcb1ea34 
> _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE
> @ 0x7efebcb1e98a 
> _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEclEv
> @ 0x7efebcb1e91a 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
> @ 0x7efeb6980c80 (unknown)
> @ 0x7efeb649c6ba start_thread
> @ 0x7efeb61d282d (unknown)
> Aborted (core dumped)
> {code}
> https://reviews.apache.org/r/55480/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-6936) Add support for media types needed for streaming request/responses.

2017-01-17 Thread Anand Mazumdar (JIRA)

Anand Mazumdar created MESOS-6936:
-

 Summary: Add support for media types needed for streaming 
request/responses.
 Key: MESOS-6936
 URL: https://issues.apache.org/jira/browse/MESOS-6936
 Project: Mesos
  Issue Type: Improvement
Reporter: Anand Mazumdar
Assignee: Anand Mazumdar
Priority: Blocker


As per the design document created as part of MESOS-3601, we need to add 
support for the additional media types proposed to our API handlers for 
supporting request streaming. These headers would also be used by the server in 
the future for streaming responses.

The following media types needed to be added:

{{RecordIO-Accept}}: Enables the client to perform content negotiation for the 
contents of the stream. The supported values for this header would be 
{{application/json}} and {{application/x-protobuf}}.
{{RecordIO-Content-Type}}: The content type of the RecordIO stream sent by the 
server. The supported values for this header would be {{application/json}} and 
{{application/x-protobuf}}.

The {{Content-Type}} for the response would be {{application/recordio}}. For 
more details/examples see the alternate proposal section of the design doc:

https://docs.google.com/document/d/1OV1D5uUmWNvTaX3qEO9fZGo4FRlCSqrx0IHq5GuLAk8/edit#



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-3601) Formalize all headers and metadata for HTTP API Event Stream

2017-01-17 Thread Anand Mazumdar (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-3601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15826427#comment-15826427
 ] 

Anand Mazumdar commented on MESOS-3601:
---

The implementation of the headers proposed in the design doc is being tracked 
on MESOS-6936

Resolving this for now. 

> Formalize all headers and metadata for HTTP API Event Stream
> 
>
> Key: MESOS-3601
> URL: https://issues.apache.org/jira/browse/MESOS-3601
> Project: Mesos
>  Issue Type: Improvement
>Affects Versions: 0.24.0
> Environment: Mesos 0.24.0
>Reporter: Ben Whitehead
>Assignee: Anand Mazumdar
>Priority: Blocker
>  Labels: api, http, mesosphere, wireprotocol
> Fix For: 1.2.0
>
>
> From an HTTP standpoint the current set of headers returned when connecting 
> to the HTTP scheduler API are insufficient. 
> {code:title=current headers}
> HTTP/1.1 200 OK
> Transfer-Encoding: chunked
> Date: Wed, 30 Sep 2015 21:07:16 GMT
> Content-Type: application/json
> {code}
> Since the response from mesos is intended to function as a stream 
> {{Connection: keep-alive}} should be specified so that the connection can 
> remain open.
> If RecordIO is going to be applied to the messages, the headers should 
> include the information necessary for a client to be able to detect RecordIO 
> and setup it response handlers appropriately.
> How RecordIO is expressed will come down to the semantics of what is actually 
> "Returned" as the response from {{POST /api/v1/scheduler}}.
> h4. Proposal
> One approach would be to leverage http as much as possible, having a client 
> specify an {{Accept-Encoding}} along with the {{Accept}} header to indicate 
> that it can handle RecordIO {{Content-Encoding}} of {{Content-Type}} 
> messages.  (This approach allows for things like gzip to be woven in fairly 
> easily in the future)
> For this approach I would expect the following:
> {code:title=Request}
> POST /api/v1/scheduler HTTP/1.1
> Host: localhost:5050
> Accept: application/x-protobuf
> Accept-Encoding: recordio
> Content-Type: application/x-protobuf
> Content-Length: 35
> User-Agent: RxNetty Client
> {code}
> {code:title=Response}
> HTTP/1.1 200 OK
> Connection: keep-alive
> Transfer-Encoding: chunked
> Content-Type: application/x-protobuf
> Content-Encoding: recordio
> Cache-Control: no-transform
> {code}
> When Content-Encoding is used it is recommended to set {{Cache-Control: 
> no-transform}} to signal to any proxies that no transformation should be 
> applied to the the content encoding [Section 14.11 RFC 
> 2616|http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.11].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-6936) Add support for media types needed for streaming request/responses.

2017-01-17 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6936:
--
Sprint: Mesosphere Sprint 49

> Add support for media types needed for streaming request/responses.
> ---
>
> Key: MESOS-6936
> URL: https://issues.apache.org/jira/browse/MESOS-6936
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Anand Mazumdar
>Assignee: Anand Mazumdar
>Priority: Blocker
>  Labels: mesosphere
>
> As per the design document created as part of MESOS-3601, we need to add 
> support for the additional media types proposed to our API handlers for 
> supporting request streaming. These headers would also be used by the server 
> in the future for streaming responses.
> The following media types needed to be added:
> {{RecordIO-Accept}}: Enables the client to perform content negotiation for 
> the contents of the stream. The supported values for this header would be 
> {{application/json}} and {{application/x-protobuf}}.
> {{RecordIO-Content-Type}}: The content type of the RecordIO stream sent by 
> the server. The supported values for this header would be 
> {{application/json}} and {{application/x-protobuf}}.
> The {{Content-Type}} for the response would be {{application/recordio}}. For 
> more details/examples see the alternate proposal section of the design doc:
> https://docs.google.com/document/d/1OV1D5uUmWNvTaX3qEO9fZGo4FRlCSqrx0IHq5GuLAk8/edit#



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-6937) ContentType/MasterAPITest.ReserveResources/1 fails during Writer close

2017-01-17 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6937:
--
Target Version/s: 1.2.0
Priority: Blocker  (was: Major)

> ContentType/MasterAPITest.ReserveResources/1 fails during Writer close
> --
>
> Key: MESOS-6937
> URL: https://issues.apache.org/jira/browse/MESOS-6937
> Project: Mesos
>  Issue Type: Bug
>  Components: tests
> Environment: ASF CI, Ubuntu 14.04, libevent and SSL enabled
>Reporter: Greg Mann
>Priority: Blocker
>  Labels: tests
> Attachments: MasterAPITest.ReserveResources.txt
>
>
> This was observed on ASF CI. Libevent was enabled, but the test in question 
> was not running in SSL-enabled mode. We see the following stack trace:
> {code}
> *** Error in `src/mesos-tests': double free or corruption (fasttop): 
> 0x2b4f7001bf70 ***
> *** Aborted at 1484691168 (unix time) try "date -d @1484691168" if you are 
> using GNU date ***
> PC: @ 0x2b4f2bc9ac37 (unknown)
> *** SIGABRT (@0x3e869c7) received by PID 27079 (TID 0x2b4f35be5700) from 
> PID 27079; stack trace: ***
> @ 0x2b4f2b236330 (unknown)
> @ 0x2b4f2bc9ac37 (unknown)
> @ 0x2b4f2bc9e028 (unknown)
> @ 0x2b4f2bcd72a4 (unknown)
> @ 0x2b4f2bce355e (unknown)
> @ 0x2b4f299e98a0 
> _ZNSt14_Function_base13_Base_managerIZN7process8internal4LoopIZNS1_4http4Pipe6Reader7readAllEvEUlvE_ZNS6_7readAllEvEUlRKSsE0_SsSsE3runENS1_6FutureISsEEEUlvE3_E10_M_managerERSt9_Any_dataRKSG_St18_Manager_operation
> @ 0x2b4f299fadb9 
> _ZN7process8internal4LoopIZNS_4http4Pipe6Reader7readAllEvEUlvE_ZNS4_7readAllEvEUlRKSsE0_SsSsE3runENS_6FutureISsEE
> @ 0x2b4f299fca57 
> _ZNSt17_Function_handlerIFvRKN7process6FutureISsEEEZNKS2_5onAnyIRZNS0_8internal4LoopIZNS0_4http4Pipe6Reader7readAllEvEUlvE_ZNSB_7readAllEvEUlRKSsE0_SsSsE3runES2_EUlS4_E2_vEES4_OT_NS2_6PreferEEUlS4_E_E9_M_invokeERKSt9_Any_dataS4_
> @ 0x2b4f28a4cc16 
> _ZN7process8internal3runISt8functionIFvRKNS_6FutureISsJRS4_EEEvRKSt6vectorIT_SaISB_EEDpOT0_
> @ 0x2b4f29a2479f process::Future<>::_set<>()
> @ 0x2b4f299f46a9 process::http::Pipe::Writer::close()
> @ 0x2b4f29a24d32 
> process::StreamingRequestDecoder::on_message_complete()
> @ 0x2b4f29b0641d http_parser_execute
> @ 0x2b4f29aaeafe process::internal::decode_recv()
> @ 0x2b4f29abc44b 
> _ZNSt17_Function_handlerIFvRKN7process6FutureImEEEZNKS2_5onAnyISt5_BindIFPFvS4_PcmNS0_7network8internal6SocketINS9_4inet7AddressEEEPNS0_23StreamingRequestDecoderEESt12_PlaceholderILi1EES8_mSE_SG_EEvEES4_OT_NS2_6PreferEEUlS4_E_E9_M_invokeERKSt9_Any_dataS4_
> @  0x14e136e process::internal::run<>()
> @  0x14e5d9f process::Future<>::_set<>()
> @ 0x2b4f29a4c23d 
> _ZN7process8internal4LoopIZNS_2io8internal4readEiPvmEUlvE_ZNS3_4readEiS4_mEUlRK6OptionImEE0_S7_mE3runENS_6FutureIS7_EE
> @ 0x2b4f29a4dc6f 
> _ZNSt17_Function_handlerIFvRKN7process6FutureINS0_11ControlFlowImEZNKS4_5onAnyIRZNS0_8internal4LoopIZNS0_2io8internal4readEiPvmEUlvE_ZNSC_4readEiSD_mEUlRK6OptionImEE0_SG_mE3runENS1_ISG_EEEUlS6_E0_vEES6_OT_NS4_6PreferEEUlS6_E_E9_M_invokeERKSt9_Any_dataS6_
> @ 0x2b4f29a5bec6 
> _ZN7process8internal3runISt8functionIFvRKNS_6FutureINS_11ControlFlowImEEJRS6_EEEvRKSt6vectorIT_SaISD_EEDpOT0_
> @ 0x2b4f29a5d971 process::Future<>::_set<>()
> @ 0x2b4f29a600a1 process::Promise<>::associate()
> @ 0x2b4f29a608da process::internal::thenf<>()
> @ 0x2b4f29b0170e 
> _ZN7process8internal3runISt8functionIFvRKNS_6FutureIsJRS4_EEEvRKSt6vectorIT_SaISB_EEDpOT0_
> @ 0x2b4f29b01cd1 process::Future<>::_set<>()
> @ 0x2b4f29b00b36 process::io::internal::pollCallback()
> @ 0x2b4f29b0b990 event_process_active_single_queue
> @ 0x2b4f29b0bf06 event_process_active
> @ 0x2b4f29b0c662 event_base_loop
> @ 0x2b4f29aff96d process::EventLoop::run()
> @ 0x2b4f2b4f5a60 (unknown)
> @ 0x2b4f2b22e184 start_thread
> {code}
> Find the log from the failed run attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-6948) AgentAPITest.LaunchNestedContainerSession is flaky

2017-01-18 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6948:
--
Target Version/s: 1.2.0
Priority: Blocker  (was: Major)

>From initial investigations, it seemed like the container exited before the 
>API handler could attach to it. Marking it as a blocker for 1.2 pending 
>further investigations.

cc: [~klueska][~greggomann]

> AgentAPITest.LaunchNestedContainerSession is flaky
> --
>
> Key: MESOS-6948
> URL: https://issues.apache.org/jira/browse/MESOS-6948
> Project: Mesos
>  Issue Type: Bug
>  Components: tests
> Environment: CentOS 7 VM, libevent and SSL enabled
>Reporter: Greg Mann
>Priority: Blocker
>  Labels: debugging, tests
> Attachments: AgentAPITest.LaunchNestedContainerSession.txt
>
>
> This was observed in a CentOS 7 VM, with libevent and SSL enabled:
> {code}
> I0118 22:17:23.528846  2887 http.cpp:464] Processing call 
> LAUNCH_NESTED_CONTAINER_SESSION
> I0118 22:17:23.530452  2887 containerizer.cpp:1807] Starting nested container 
> 492a5d0a-0060-416c-ad80-dd0441f558dc.62c170bb-7298-4209-b797-80d7ca73353e
> I0118 22:17:23.532265  2887 containerizer.cpp:1831] Trying to chown 
> '/tmp/ContentType_AgentAPITest_LaunchNestedContainerSession_0_ykIax9/slaves/707fd1a2-1a93-4e9f-a9b2-5453a207b4c5-S0/frameworks/707fd1a2-1a93-4e9f-a9b2-5453a207b4c5-/executors/14a26e2a-58b7-4166-909c-c90787d84fcb/runs/492a5d0a-0060-416c-ad80-dd0441f558dc/containers/62c170bb-7298-4209-b797-80d7ca73353e'
>  to user 'vagrant'
> I0118 22:17:23.535213  2887 switchboard.cpp:570] Launching 
> 'mesos-io-switchboard' with flags '--heartbeat_interval="30secs" 
> --help="false" 
> --socket_address="/tmp/mesos-io-switchboard-5a08fbd5-0d70-411e-8389-ac115a5f6430"
>  --stderr_from_fd="15" --stderr_to_fd="2" --stdin_to_fd="12" 
> --stdout_from_fd="13" --stdout_to_fd="1" --tty="false" 
> --wait_for_connection="true"' for container 
> 492a5d0a-0060-416c-ad80-dd0441f558dc.62c170bb-7298-4209-b797-80d7ca73353e
> I0118 22:17:23.537210  2887 switchboard.cpp:600] Created I/O switchboard 
> server (pid: 3335) listening on socket file 
> '/tmp/mesos-io-switchboard-5a08fbd5-0d70-411e-8389-ac115a5f6430' for 
> container 
> 492a5d0a-0060-416c-ad80-dd0441f558dc.62c170bb-7298-4209-b797-80d7ca73353e
> I0118 22:17:23.543665  2887 containerizer.cpp:1540] Launching 
> 'mesos-containerizer' with flags '--help="false" 
> --launch_info="{"command":{"shell":true,"value":"printf output && printf 
> error 
> 1>&2"},"environment":{},"err":{"fd":16,"type":"FD"},"in":{"fd":11,"type":"FD"},"out":{"fd":14,"type":"FD"},"user":"vagrant"}"
>  --pipe_read="12" --pipe_write="13" 
> --runtime_directory="/tmp/ContentType_AgentAPITest_LaunchNestedContainerSession_0_QVZGrY/containers/492a5d0a-0060-416c-ad80-dd0441f558dc/containers/62c170bb-7298-4209-b797-80d7ca73353e"
>  --unshare_namespace_mnt="false"'
> I0118 22:17:23.556032  2887 launcher.cpp:133] Forked child with pid '3337' 
> for container 
> '492a5d0a-0060-416c-ad80-dd0441f558dc.62c170bb-7298-4209-b797-80d7ca73353e'
> I0118 22:17:23.563900  2887 fetcher.cpp:349] Starting to fetch URIs for 
> container: 
> 492a5d0a-0060-416c-ad80-dd0441f558dc.62c170bb-7298-4209-b797-80d7ca73353e, 
> directory: 
> /tmp/ContentType_AgentAPITest_LaunchNestedContainerSession_0_ykIax9/slaves/707fd1a2-1a93-4e9f-a9b2-5453a207b4c5-S0/frameworks/707fd1a2-1a93-4e9f-a9b2-5453a207b4c5-/executors/14a26e2a-58b7-4166-909c-c90787d84fcb/runs/492a5d0a-0060-416c-ad80-dd0441f558dc/containers/62c170bb-7298-4209-b797-80d7ca73353e
> I0118 22:17:23.962441  2887 containerizer.cpp:2481] Container 
> 492a5d0a-0060-416c-ad80-dd0441f558dc.62c170bb-7298-4209-b797-80d7ca73353e has 
> exited
> I0118 22:17:23.962484  2887 containerizer.cpp:2118] Destroying container 
> 492a5d0a-0060-416c-ad80-dd0441f558dc.62c170bb-7298-4209-b797-80d7ca73353e in 
> RUNNING state
> I0118 22:17:23.962715  2887 launcher.cpp:149] Asked to destroy container 
> 492a5d0a-0060-416c-ad80-dd0441f558dc.62c170bb-7298-4209-b797-80d7ca73353e
> I0118 22:17:23.977562  2887 process.cpp:3733] Failed to process request for 
> '/slave(69)/api/v1': Container has or is being destroyed
> W0118 22:17:23.978216  2887 http.cpp:2734] Failed to attach to nested 
> container 
> 492a5d0a-0060-416c-ad80-dd0441f558dc.62c170bb-7298-4209-b797-80d7ca73353e: 
> Container has or is being destroyed
> I0118 22:17:23.978330  2887 process.cpp:1435] Returning '500 Internal Server 
> Error' for '/slave(69)/api/v1' (Container has or is being destroyed)
> ../../src/tests/api_tests.cpp:3960: Failure
> Value of: (response).get().status
>   Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> {code}
> Find attached the full log from a failed run.



--
This message was sent by Atlassi

[jira] [Updated] (MESOS-6864) Container Exec should be possible with tasks belonging to a task group

2017-01-19 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6864:
--
Story Points: 5

> Container Exec should be possible with tasks belonging to a task group
> --
>
> Key: MESOS-6864
> URL: https://issues.apache.org/jira/browse/MESOS-6864
> Project: Mesos
>  Issue Type: Bug
>Reporter: Gastón Kleiman
>Assignee: Gastón Kleiman
>Priority: Blocker
>  Labels: debugging, mesosphere
>
> {{LaunchNestedContainerSession}} currently requires the parent container to 
> be an Executor 
> (https://github.com/apache/mesos/blob/f89f28724f5837ff414dc6cc84e1afb63f3306e5/src/slave/http.cpp#L2189-L2211).
> This works for command tasks, because the task container id is the same as 
> the executor container id.
> But it won't work for pod tasks whose container id is different from 
> executor’s container id.
> In order to resolve this ticket, we need to allow launching a child container 
> at an arbitrary level.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-6948) AgentAPITest.LaunchNestedContainerSession is flaky

2017-01-20 Thread Anand Mazumdar (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15832393#comment-15832393
 ] 

Anand Mazumdar commented on MESOS-6948:
---

This showed up with an identical stack trace on ASF CI.
{code}
[ RUN  ] ContentType/AgentAPITest.LaunchNestedContainerSession/1
I0120 20:51:26.939275 26158 cluster.cpp:160] Creating default 'local' authorizer
I0120 20:51:26.940529 26164 master.cpp:383] Master 
2b435d8e-3792-458e-96ff-0ecff4b4aa54 (9a52de5d9dcd) started on 172.17.0.3:38943
I0120 20:51:26.940562 26164 master.cpp:385] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="HierarchicalDRF" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authenticators="crammd5" 
--authorizers="local" --credentials="/tmp/rLq2TT/credentials" 
--framework_sorter="drf" --help="false" --hostname_lookup="true" 
--http_authenticators="basic" --http_framework_authenticators="basic" 
--initialize_driver_logging="true" --log_auto_initialize="true" 
--logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--root_submissions="true" --user_sorter="drf" --version="false" 
--webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/rLq2TT/master" 
--zk_session_timeout="10secs"
I0120 20:51:26.940747 26164 master.cpp:435] Master only allowing authenticated 
frameworks to register
I0120 20:51:26.940754 26164 master.cpp:449] Master only allowing authenticated 
agents to register
I0120 20:51:26.940757 26164 master.cpp:462] Master only allowing authenticated 
HTTP frameworks to register
I0120 20:51:26.940762 26164 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/rLq2TT/credentials'
I0120 20:51:26.940891 26164 master.cpp:507] Using default 'crammd5' 
authenticator
I0120 20:51:26.940939 26164 http.cpp:922] Using default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I0120 20:51:26.940978 26164 http.cpp:922] Using default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I0120 20:51:26.941035 26164 http.cpp:922] Using default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I0120 20:51:26.941082 26164 master.cpp:587] Authorization enabled
I0120 20:51:26.941186 26167 hierarchical.cpp:151] Initialized hierarchical 
allocator process
I0120 20:51:26.941215 26172 whitelist_watcher.cpp:77] No whitelist given
I0120 20:51:26.941815 26164 master.cpp:2119] Elected as the leading master!
I0120 20:51:26.941829 26164 master.cpp:1641] Recovering from registrar
I0120 20:51:26.941910 26170 registrar.cpp:329] Recovering registrar
I0120 20:51:26.942194 26173 registrar.cpp:362] Successfully fetched the 
registry (0B) in 175872ns
I0120 20:51:26.942227 26173 registrar.cpp:461] Applied 1 operations in 5497ns; 
attempting to update the registry
I0120 20:51:26.942468 26172 registrar.cpp:506] Successfully updated the 
registry in 222976ns
I0120 20:51:26.942519 26172 registrar.cpp:392] Successfully recovered registrar
I0120 20:51:26.942750 26164 hierarchical.cpp:178] Skipping recovery of 
hierarchical allocator: nothing to recover
I0120 20:51:26.942751 26165 master.cpp:1757] Recovered 0 agents from the 
registry (129B); allowing 10mins for agents to re-register
I0120 20:51:26.943758 26158 containerizer.cpp:220] Using isolation: 
posix/cpu,posix/mem,filesystem/posix,network/cni
W0120 20:51:26.944063 26158 backend.cpp:76] Failed to create 'aufs' backend: 
AufsBackend requires root privileges, but is running as user mesos
W0120 20:51:26.944141 26158 backend.cpp:76] Failed to create 'bind' backend: 
BindBackend requires root privileges
I0120 20:51:26.945397 26158 cluster.cpp:446] Creating default 'local' authorizer
I0120 20:51:26.945864 26163 slave.cpp:209] Mesos agent started on 
(579)@172.17.0.3:38943
I0120 20:51:26.945879 26163 slave.cpp:210] Flags at startup: --acls="" 
--appc_simple_discovery_uri_prefix="http://"; 
--appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authenticatee="crammd5" 
--authentication_backoff_factor="1secs" --authorizer="local" 
--cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" 
--cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" 
--cgroups_root="mesos" --container_disk_watch_interval="15secs" 
--containerizers="mesos" 
--credential="/tmp/ContentType_AgentAPITest_LaunchNestedContainerSession_1

[jira] [Assigned] (MESOS-6296) Default executor should be able to launch multiple task groups

2017-01-25 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar reassigned MESOS-6296:
-

Assignee: Anand Mazumdar

> Default executor should be able to launch multiple task groups
> --
>
> Key: MESOS-6296
> URL: https://issues.apache.org/jira/browse/MESOS-6296
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Vinod Kone
>Assignee: Anand Mazumdar
>
> This gives more flexibility for schedulers that do not know all the tasks 
> that they want to launch up front. For example a backup task that needs to be 
> launched regularly next to a main task in the same executor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-6296) Default executor should be able to launch multiple task groups

2017-01-25 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6296:
--
  Sprint: Mesosphere Sprint 50
Story Points: 5

> Default executor should be able to launch multiple task groups
> --
>
> Key: MESOS-6296
> URL: https://issues.apache.org/jira/browse/MESOS-6296
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Vinod Kone
>Assignee: Anand Mazumdar
>
> This gives more flexibility for schedulers that do not know all the tasks 
> that they want to launch up front. For example a backup task that needs to be 
> launched regularly next to a main task in the same executor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-6989) Docker executor segfaults in ~MesosExecutorDriver()

2017-01-26 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6989:
--
Priority: Blocker  (was: Major)

Moving it to blocker since it does result in a stack trace in the task's 
stdout. Note that our existing tests might not be catching this because they 
might not be validating the executor's exit status code to be non-zero for 
docker/default executor.

> Docker executor segfaults in ~MesosExecutorDriver()
> ---
>
> Key: MESOS-6989
> URL: https://issues.apache.org/jira/browse/MESOS-6989
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Reporter: Jan-Philip Gehrcke
>Assignee: Joseph Wu
>Priority: Blocker
>  Labels: mesosphere
>
> With the current Mesos master state (commit 
> 42e515bc5c175a318e914d34473016feda4db6ff), the Docker executor segfaults 
> during shutdown. 
> Steps to reproduce:
> 1) Start master:
> {code}
> $ ./bin/mesos-master.sh --ip=127.0.0.1 --work_dir=/tmp/jp/mesos
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> I0125 13:41:15.963775 14744 main.cpp:278] Build: 2017-01-25 13:37:42 by jp
> I0125 13:41:15.963868 14744 main.cpp:279] Version: 1.2.0
> I0125 13:41:15.963877 14744 main.cpp:286] Git SHA: 
> 42e515bc5c175a318e914d34473016feda4db6ff
> {code}
> (note that building it at 13:37 is not part of the repro)
> 2) Start agent:
> {code}
> $ ./bin/mesos-slave.sh --containerizers=mesos,docker --master=127.0.0.1:5050 
> --work_dir=/tmp/jp/mesos
> {code}
> 3) Run {{mesos-execute}} with the Docker containerizer:
> {code}
> $ ./src/mesos-execute --master=127.0.0.1:5050 --name=testcommand 
> --containerizer=docker --docker_image=debian --command=env
> I0125 13:43:59.704973 14951 scheduler.cpp:184] Version: 1.2.0
> I0125 13:43:59.706425 14952 scheduler.cpp:470] New master detected at 
> master@127.0.0.1:5050
> Subscribed with ID 57596743-06f4-45f1-a975-348cf70589b1-
> Submitted task 'testcommand' to agent 
> '57596743-06f4-45f1-a975-348cf70589b1-S0'
> Received status update TASK_RUNNING for task 'testcommand'
>   source: SOURCE_EXECUTOR
> Received status update TASK_FINISHED for task 'testcommand'
>   message: 'Container exited with status 0'
>   source: SOURCE_EXECUTOR
> {code}
> Relevant agent output that shows the executor segfault:
> {code}
> [...]
> I0125 13:44:16.249191 14823 slave.cpp:4328] Got exited event for 
> executor(1)@192.99.40.208:33529
> I0125 13:44:16.347095 14830 docker.cpp:2358] Executor for container 
> 396282a9-7bf0-48ee-ba07-3ff2ca801d53 has exited
> I0125 13:44:16.347127 14830 docker.cpp:2052] Destroying container 
> 396282a9-7bf0-48ee-ba07-3ff2ca801d53
> I0125 13:44:16.347439 14830 docker.cpp:2179] Running docker stop on container 
> 396282a9-7bf0-48ee-ba07-3ff2ca801d53
> I0125 13:44:16.349215 14826 slave.cpp:4691] Executor 'testcommand' of 
> framework 57596743-06f4-45f1-a975-348cf70589b1- terminated with signal 
> Segmentation fault (core dumped)
> [...]
> {code}
> The complete task stderr:
> {code}
> $ cat 
> /tmp/jp/mesos/slaves/57596743-06f4-45f1-a975-348cf70589b1-S0/frameworks/57596743-06f4-45f1-a975-348cf70589b1-/executors/testcommand/runs/latest/stderr
>  
> I0125 13:44:12.850073 15030 exec.cpp:162] Version: 1.2.0
> I0125 13:44:12.864229 15050 exec.cpp:237] Executor registered on agent 
> 57596743-06f4-45f1-a975-348cf70589b1-S0
> I0125 13:44:12.865842 15054 docker.cpp:850] Running docker -H 
> unix:///var/run/docker.sock run --cpu-shares 1024 --memory 134217728 
> --env-file /tmp/xFZ8G9 -v 
> /tmp/jp/mesos/slaves/57596743-06f4-45f1-a975-348cf70589b1-S0/frameworks/57596743-06f4-45f1-a975-348cf70589b1-/executors/testcommand/runs/396282a9-7bf0-48ee-ba07-3ff2ca801d53:/mnt/mesos/sandbox
>  --net host --entrypoint /bin/sh --name 
> mesos-57596743-06f4-45f1-a975-348cf70589b1-S0.396282a9-7bf0-48ee-ba07-3ff2ca801d53
>  debian -c env
> I0125 13:44:15.248721 15064 exec.cpp:410] Executor asked to shutdown
> *** Aborted at 1485369856 (unix time) try "date -d @1485369856" if you are 
> using GNU date ***
> PC: @ 0x7fb38f153dd0 (unknown)
> *** SIGSEGV (@0x68) received by PID 15030 (TID 0x7fb3961a88c0) from PID 104; 
> stack trace: ***
> @ 0x7fb38f15b5c0 (unknown)
> @ 0x7fb38f153dd0 (unknown)
> @ 0x7fb39332c607 __gthread_mutex_lock()
> @ 0x7fb39332c657 __gthread_recursive_mutex_lock()
> @ 0x7fb39332edca std::recursive_mutex::lock()
> @ 0x7fb393337bd8 
> _ZZ11synchronizeISt15recursive_mutexE12SynchronizedIT_EPS2_ENKUlPS0_E_clES5_
> @ 0x7fb393337bf8 
> _ZZ11synchronizeISt15recursive_mutexE12SynchronizedIT_EPS2_ENUlPS0_E_4_FUNES5_
> @ 0x7fb39333ba6b Synchronized<>::Synchronized()
> @ 0x7fb393337cac synchronize<>()
> @ 0x7fb39492f15c process::ProcessManager::wait()
> @ 0

[jira] [Updated] (MESOS-7017) HTTP API responses can crash the master.

2017-01-27 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-7017:
--
Priority: Critical  (was: Major)

> HTTP API responses can crash the master.
> 
>
> Key: MESOS-7017
> URL: https://issues.apache.org/jira/browse/MESOS-7017
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Reporter: James Peach
>Priority: Critical
>
> The master can crash when generating large responses to small API requests. 
> One manifestation of this is querying the tasks.
> {noformat}
> [libprotobuf ERROR google/protobuf/io/coded_stream.cc:180] A protocol message 
> was rejected because it was too big (more than 67108864 bytes).  To increase 
> the limit (or to disable these warnings), see 
> CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
> F0126 18:34:18.790386 26230 evolve.cpp:63] Check failed: 
> t.ParsePartialFromString(data) Failed to parse mesos.v1.master.Response while 
> evolving from mesos.master.Response
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] (MESOS-6936) Add support for media types needed for streaming request/responses.

2017-01-29 Thread Anand Mazumdar (JIRA)

Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Anand Mazumdar updated an issue 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 Mesos /  MESOS-6936 
 
 
 
  Add support for media types needed for streaming request/responses.  
 
 
 
 
 
 
 
 
 

Change By:
 
 Anand Mazumdar 
 
 
 
 
 
 
 
 
 
 As per the design document created as part of MESOS-3601, we need to add support for the additional media types proposed to our API handlers for supporting request streaming. These headers would also be used by the server in the future for streaming responses.The following media types needed to be added:{{ RecordIO Message -Accept}}: Enables the client to perform content negotiation for the contents of the stream. The supported values for this header would be {{application/json}} and {{application/x-protobuf}}.{{ RecordIO Message -Content-Type}}: The content type of the RecordIO stream sent by the server. The supported values for this header would be {{application/json}} and {{application/x-protobuf}}.The {{Content-Type}} for the response would be {{application/recordio}}. For more details/examples see the alternate proposal section of the design doc:https://docs.google.com/document/d/1OV1D5uUmWNvTaX3qEO9fZGo4FRlCSqrx0IHq5GuLAk8/edit# 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 

 This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)

[jira] [Updated] (MESOS-7053) Support multiple challenges for WWW-Authencate http header.

2017-02-02 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-7053:
--
Summary: Support multiple challenges for WWW-Authencate http header.  (was: 
Support multiple challenges WWW-Authencate http heade.)

> Support multiple challenges for WWW-Authencate http header.
> ---
>
> Key: MESOS-7053
> URL: https://issues.apache.org/jira/browse/MESOS-7053
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Gilbert Song
>  Labels: authentication, http, libprocess
>
> According to RFC, duplicate http headers are not allowed:
> https://tools.ietf.org/html/rfc7230#section-3.2.2
> However, multiple headers can be appended as a comma separated list for one 
> single header section. This is also true for multiple challenges in 
> Www-Authenticate with a 401 Unauthorized response:
> https://tools.ietf.org/html/rfc2617#section-4.6
> We should support multiple challenges case and figure out which one is the 
> strongest auth-scheme that we should go with.
> A simple proposal might be selecting an auth-scheme by defining a priority, 
> e.g.,
> 1. Bearer
> 2. Basic
> ...
> For sure, more discussion is needed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (MESOS-7053) Support multiple challenges for WWW-Authenticate http header.

2017-02-02 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-7053:
--
Summary: Support multiple challenges for WWW-Authenticate http header.  
(was: Support multiple challenges for WWW-Authencate http header.)

> Support multiple challenges for WWW-Authenticate http header.
> -
>
> Key: MESOS-7053
> URL: https://issues.apache.org/jira/browse/MESOS-7053
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Gilbert Song
>  Labels: authentication, http, libprocess
>
> According to RFC, duplicate http headers are not allowed:
> https://tools.ietf.org/html/rfc7230#section-3.2.2
> However, multiple headers can be appended as a comma separated list for one 
> single header section. This is also true for multiple challenges in 
> Www-Authenticate with a 401 Unauthorized response:
> https://tools.ietf.org/html/rfc2617#section-4.6
> We should support multiple challenges case and figure out which one is the 
> strongest auth-scheme that we should go with.
> A simple proposal might be selecting an auth-scheme by defining a priority, 
> e.g.,
> 1. Bearer
> 2. Basic
> ...
> For sure, more discussion is needed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (MESOS-7053) Consider supporting multiple challenges for WWW-Authenticate http header.

2017-02-02 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-7053:
--
Summary: Consider supporting multiple challenges for WWW-Authenticate http 
header.  (was: Support multiple challenges for WWW-Authenticate http header.)

> Consider supporting multiple challenges for WWW-Authenticate http header.
> -
>
> Key: MESOS-7053
> URL: https://issues.apache.org/jira/browse/MESOS-7053
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Gilbert Song
>  Labels: authentication, http, libprocess
>
> According to RFC, duplicate http headers are not allowed:
> https://tools.ietf.org/html/rfc7230#section-3.2.2
> However, multiple headers can be appended as a comma separated list for one 
> single header section. This is also true for multiple challenges in 
> Www-Authenticate with a 401 Unauthorized response:
> https://tools.ietf.org/html/rfc2617#section-4.6
> We should support multiple challenges case and figure out which one is the 
> strongest auth-scheme that we should go with.
> A simple proposal might be selecting an auth-scheme by defining a priority, 
> e.g.,
> 1. Bearer
> 2. Basic
> ...
> For sure, more discussion is needed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (MESOS-5186) mesos.interface: Allow using protobuf 3.x

2017-02-03 Thread Anand Mazumdar (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-5186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851752#comment-15851752
 ] 

Anand Mazumdar commented on MESOS-5186:
---

[~mcypark] [~jieyu] Would either of you want to shepherd these patches?

> mesos.interface: Allow using protobuf 3.x
> -
>
> Key: MESOS-5186
> URL: https://issues.apache.org/jira/browse/MESOS-5186
> Project: Mesos
>  Issue Type: Improvement
>  Components: python api
>Reporter: Myautsai PAN
>Assignee: Yong Tang
>Priority: Minor
>  Labels: easyfix
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> We're working on integrating TensorFlow(https://www.tensorflow.org) with 
> mesos. Both the two require {{protobuf}}. The python package 
> {{mesos.interface}} requires {{protobuf>=2.6.1,<3}}, but {{tensorflow}} 
> requires {{protobuf>=3.0.0}} . Though protobuf 3.x is not compatible with 
> protobuf 2.x, but anyway we modify the {{setup.py}} 
> (https://github.com/apache/mesos/blob/66cddaf/src/python/interface/setup.py.in#L29)
> from {{'install_requires': [ 'google-common>=0.0.1', 'protobuf>=2.6.1,<3' 
> ],}} to {{'install_requires': [ 'google-common>=0.0.1', 'protobuf>=2.6.1' ],}}
> It works fine. Would you please consider support protobuf 3.x officially in 
> the next release? Maybe just remove the {{,<3}} restriction is enough.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (MESOS-3901) Enable Mesos to be able know when it is hosted behind a proxy with a URL prefix

2017-02-03 Thread Anand Mazumdar (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851769#comment-15851769
 ] 

Anand Mazumdar commented on MESOS-3901:
---

[~haosd...@gmail.com] Are we recommending the above workaround to everyone or 
have we considered fixing the problem in the Mesos Web UI itself if possible?

> Enable Mesos to be able know when it is hosted behind a proxy with a URL 
> prefix
> ---
>
> Key: MESOS-3901
> URL: https://issues.apache.org/jira/browse/MESOS-3901
> Project: Mesos
>  Issue Type: Improvement
>  Components: webui
>Reporter: Harpreet
>Assignee: haosdent
>  Labels: mesosphere
>
> If Mesos is run behind a proxy with a URL prefix e.g.  
> https://:/services/mesos (`/services/mesos` being the URL 
> prefix), sandboxes in mesos don't load. This happens because when
>   Mesos is accessed through a proxy at 
> https://:/services/mesos, Mesos tries to request slave state 
> from 
> https://:/slave/20151110-232502-218431498-5050-1234-S1/slave(1)/state.json?jsonp=angular.callbacks._4.
>  This URL is missing the /services/mesos path prefix, so the request fails. 
> Fixing this by rewriting URLs in the body of every response, would not be a 
> clean solution and can be error prone.
> After searching around a bit we've learned that this is apparently a common 
> issue with webapps, because there is no standard specification for making 
> them aware of their base URL path. Some will allow you to specify a base path 
> in configuration[1], others will respect an X-Forwarded-Path header if a 
> proxy provides it[2], and others don't handle this at all. 
> It would be great to have explicit support in for this in Mesos.
> [1] 
> http://search.cpan.org/~bobtfish/Catalyst-TraitFor-Request-ProxyBase-0.05/lib/Catalyst/TraitFor/Request/ProxyBase.pm
> [2] https://github.com/mattkenney/feedsquish/blob/master/rupta.py#L94



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (MESOS-7017) HTTP API responses can crash the master.

2017-02-03 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-7017:
--
Target Version/s: 1.3.0

> HTTP API responses can crash the master.
> 
>
> Key: MESOS-7017
> URL: https://issues.apache.org/jira/browse/MESOS-7017
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Reporter: James Peach
>Priority: Critical
>
> The master can crash when generating large responses to small API requests. 
> One manifestation of this is querying the tasks.
> {noformat}
> [libprotobuf ERROR google/protobuf/io/coded_stream.cc:180] A protocol message 
> was rejected because it was too big (more than 67108864 bytes).  To increase 
> the limit (or to disable these warnings), see 
> CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
> F0126 18:34:18.790386 26230 evolve.cpp:63] Check failed: 
> t.ParsePartialFromString(data) Failed to parse mesos.v1.master.Response while 
> evolving from mesos.master.Response
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (MESOS-7007) filesystem/shared and --default_container_info broken since 1.1

2017-02-03 Thread Anand Mazumdar (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851779#comment-15851779
 ] 

Anand Mazumdar commented on MESOS-7007:
---

[~pierrecdn] Did the above suggested workaround resolve the issue?

> filesystem/shared and --default_container_info broken since 1.1
> ---
>
> Key: MESOS-7007
> URL: https://issues.apache.org/jira/browse/MESOS-7007
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.1.0
>Reporter: Pierre Cheynier
>
> I face this issue, that prevent me to upgrade to 1.1.0 (and the change was 
> consequently introduced in this version):
> I'm using default_container_info to mount a /tmp volume in the container's 
> mount namespace from its current sandbox, meaning that each container have a 
> dedicated /tmp, thanks to the {{filesystem/shared}} isolator.
> I noticed through our automation pipeline that integration tests were failing 
> and found that this is because /tmp (the one from the host!) contents is 
> trashed each time a container is created.
> Here is my setup: 
> * 
> {{--isolation='cgroups/cpu,cgroups/mem,namespaces/pid,*disk/du,filesystem/shared,filesystem/linux*,docker/runtime'}}
> * 
> {{--default_container_info='\{"type":"MESOS","volumes":\[\{"host_path":"tmp","container_path":"/tmp","mode":"RW"\}\]\}'}}
> I discovered this issue in the early days of 1.1 (end of Nov, spoke with 
> someone on Slack), but had unfortunately no time to dig into the symptoms a 
> bit more.
> I found nothing interesting even using GLOGv=3.
> Maybe it's a bad usage of isolators that trigger this issue ? If it's the 
> case, then at least a documentation update should be done.
> Let me know if more information is needed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (MESOS-6914) Command 'hadoop version 2>&1' failed

2017-02-03 Thread Anand Mazumdar (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851791#comment-15851791
 ] 

Anand Mazumdar commented on MESOS-6914:
---

Did you get a chance to see this [~kingpoker]?

> Command 'hadoop version 2>&1' failed
> 
>
> Key: MESOS-6914
> URL: https://issues.apache.org/jira/browse/MESOS-6914
> Project: Mesos
>  Issue Type: Bug
>Reporter: yangjunfeng
>
> I am green hand in spark on mesos.
> when I run spark-shell on mesos. The error is below:
> Command 'hadoop version 2>&1' failed; this is the output:
> sh: hadoop: command not found
> Failed to fetch 
> 'hdfs://188.188.0.189:9000/usr/yjf/spark-2.1.0-bin-hadoop2.7.tgz': Failed to 
> create HDFS client: Failed to execute 'hadoop version 2>&1'; the command was 
> either not found or exited with a non-zero exit status: 127
> Failed to synchronize with agent (it's probably exited)
> How can I fix this problom.
> Thanks a lot!



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (MESOS-6915) Encountered a problem while starting mesos-master

2017-02-03 Thread Anand Mazumdar (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851795#comment-15851795
 ] 

Anand Mazumdar commented on MESOS-6915:
---

[~jijoj] Can you upload your master logs and add more context helping us in 
triaging the issue?

> Encountered a problem while starting mesos-master
> -
>
> Key: MESOS-6915
> URL: https://issues.apache.org/jira/browse/MESOS-6915
> Project: Mesos
>  Issue Type: Wish
>  Components: agent, master
>Affects Versions: 1.1.0
>Reporter: Jijo Joy
>Assignee: Kevin Klues
>
> I0112 17:23:43.639902 17432 http.cpp:391] HTTP GET for /master/state from 
> 192.168.10.35:44407 with User-Agent='Mozilla/5.0 (Windows NT 6.1; WOW64) 
> AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
> I0112 17:23:51.350908 17432 http.cpp:391] HTTP GET for /master/state from 
> 192.168.10.35:29323 with User-Agent='Mozilla/5.0 (Windows NT 6.1; WOW64) 
> AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
> I0112 17:23:52.892664 17430 http.cpp:391] HTTP GET for /master/state from 
> 192.168.10.35:29323 with User-Agent='Mozilla/5.0 (Windows NT 6.1; WOW64; 
> Trident/7.0; rv:11.0) like Gecko'
> I am getting the above notification while running mesos-master.sh
> But still able to get the JAVA PYTHON example executed successfully .
> I am new to the Apache Mesos and Clustering Environment. Kindly help !!



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (MESOS-7006) Launch docker containers with --cpus instead of cpu-shares

2017-02-03 Thread Anand Mazumdar (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851846#comment-15851846
 ] 

Anand Mazumdar commented on MESOS-7006:
---

[~jieyu] Would you be able to shepherd this change?

> Launch docker containers with --cpus instead of cpu-shares
> --
>
> Key: MESOS-7006
> URL: https://issues.apache.org/jira/browse/MESOS-7006
> Project: Mesos
>  Issue Type: Improvement
>Affects Versions: 1.1.0
>Reporter: Craig W
>Assignee: Tomasz Janiszewski
>
> docker 1.13 was recently released and it now has a new --cpus flag which 
> allows a user to specify how many cpus a container should have. This is much 
> simpler for users to reason about.
> mesos should switch to starting a container with --cpus instead of 
> --cpu-shares, or at least make it configurable.
> https://blog.docker.com/2017/01/cpu-management-docker-1-13/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (MESOS-3901) Enable Mesos to be able know when it is hosted behind a proxy with a URL prefix

2017-02-03 Thread Anand Mazumdar (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851850#comment-15851850
 ] 

Anand Mazumdar commented on MESOS-3901:
---

gotcha, thanks for the update [~haosd...@gmail.com]!

> Enable Mesos to be able know when it is hosted behind a proxy with a URL 
> prefix
> ---
>
> Key: MESOS-3901
> URL: https://issues.apache.org/jira/browse/MESOS-3901
> Project: Mesos
>  Issue Type: Improvement
>  Components: webui
>Reporter: Harpreet
>Assignee: haosdent
>  Labels: mesosphere
>
> If Mesos is run behind a proxy with a URL prefix e.g.  
> https://:/services/mesos (`/services/mesos` being the URL 
> prefix), sandboxes in mesos don't load. This happens because when
>   Mesos is accessed through a proxy at 
> https://:/services/mesos, Mesos tries to request slave state 
> from 
> https://:/slave/20151110-232502-218431498-5050-1234-S1/slave(1)/state.json?jsonp=angular.callbacks._4.
>  This URL is missing the /services/mesos path prefix, so the request fails. 
> Fixing this by rewriting URLs in the body of every response, would not be a 
> clean solution and can be error prone.
> After searching around a bit we've learned that this is apparently a common 
> issue with webapps, because there is no standard specification for making 
> them aware of their base URL path. Some will allow you to specify a base path 
> in configuration[1], others will respect an X-Forwarded-Path header if a 
> proxy provides it[2], and others don't handle this at all. 
> It would be great to have explicit support in for this in Mesos.
> [1] 
> http://search.cpan.org/~bobtfish/Catalyst-TraitFor-Request-ProxyBase-0.05/lib/Catalyst/TraitFor/Request/ProxyBase.pm
> [2] https://github.com/mattkenney/feedsquish/blob/master/rupta.py#L94



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (MESOS-3901) Enable Mesos to be able know when it is hosted behind a proxy with a URL prefix

2017-02-03 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-3901:
--
Priority: Critical  (was: Major)

> Enable Mesos to be able know when it is hosted behind a proxy with a URL 
> prefix
> ---
>
> Key: MESOS-3901
> URL: https://issues.apache.org/jira/browse/MESOS-3901
> Project: Mesos
>  Issue Type: Improvement
>  Components: webui
>Reporter: Harpreet
>Assignee: haosdent
>Priority: Critical
>  Labels: mesosphere
>
> If Mesos is run behind a proxy with a URL prefix e.g.  
> https://:/services/mesos (`/services/mesos` being the URL 
> prefix), sandboxes in mesos don't load. This happens because when
>   Mesos is accessed through a proxy at 
> https://:/services/mesos, Mesos tries to request slave state 
> from 
> https://:/slave/20151110-232502-218431498-5050-1234-S1/slave(1)/state.json?jsonp=angular.callbacks._4.
>  This URL is missing the /services/mesos path prefix, so the request fails. 
> Fixing this by rewriting URLs in the body of every response, would not be a 
> clean solution and can be error prone.
> After searching around a bit we've learned that this is apparently a common 
> issue with webapps, because there is no standard specification for making 
> them aware of their base URL path. Some will allow you to specify a base path 
> in configuration[1], others will respect an X-Forwarded-Path header if a 
> proxy provides it[2], and others don't handle this at all. 
> It would be great to have explicit support in for this in Mesos.
> [1] 
> http://search.cpan.org/~bobtfish/Catalyst-TraitFor-Request-ProxyBase-0.05/lib/Catalyst/TraitFor/Request/ProxyBase.pm
> [2] https://github.com/mattkenney/feedsquish/blob/master/rupta.py#L94



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (MESOS-5364) Consider adding `unlink` functionality to libprocess

2017-02-03 Thread Anand Mazumdar (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-5364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851917#comment-15851917
 ] 

Anand Mazumdar commented on MESOS-5364:
---

Resolving this issue since {{relink}} has been implemented. Would file another 
issue to add functionality in the executor driver to use {{relink}}.

> Consider adding `unlink` functionality to libprocess
> 
>
> Key: MESOS-5364
> URL: https://issues.apache.org/jira/browse/MESOS-5364
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Anand Mazumdar
>  Labels: libprocess, mesosphere
>
> Currently we don't have the {{unlink}} functionality in libprocess i.e. 
> Erlang's equivalent of http://erlang.org/doc/man/erlang.html#unlink-1. We 
> have a lot of places in our current code with {{TODO's}} for implementing it.
> It can benefit us in a couple of ways:
> - Based on the business logic of the actor, it would want to authoritatively 
> communicate that it is no longer interested in {{ExitedEvent}} for the 
> external remote link.
> - Sometimes, the {{ExitedEvent}} might be delayed or might be dropped due to 
> the remote instance being unavailable (e.g., partition, network 
> intermediaries not sending RST's etc). 
> I did not find any old JIRA's pertaining to this but I did come across an 
> initial attempt to add this though albeit for injecting {{exited}} events as 
> part of the initial review for MESOS-1059.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (MESOS-7057) Consider using the relink in the executor driver.

2017-02-03 Thread Anand Mazumdar (JIRA)

Anand Mazumdar created MESOS-7057:
-

 Summary: Consider using the relink in the executor driver.
 Key: MESOS-7057
 URL: https://issues.apache.org/jira/browse/MESOS-7057
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.1.0, 1.0.2
Reporter: Anand Mazumdar
Assignee: Anand Mazumdar


As outlined in the root cause analysis for MESOS-5332, it is possible for a 
iptables firewall to terminate an idle connection after a timeout. (the default 
is 5 days). Once this happens, the executor driver is not notified of the 
disconnection. It keeps on thinking that it is still connected with the agent.

When the agent process is restarted, the executor still tries to re-use the old 
broken connection to send the re-register message to the agent. This is when it 
eventually realizes that the connection is broken (due to the nature of TCP) 
and calls the {{exited}} callback and commits suicide in 15 minutes upon the 
recovery timeout.

To offset this, an executor should always {{relink}} when it receives a 
reconnect request from the agent.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (MESOS-7057) Consider using the relink functionality of libprocess in the executor driver.

2017-02-03 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-7057:
--
Summary: Consider using the relink functionality of libprocess in the 
executor driver.  (was: Consider using the relink in the executor driver.)

> Consider using the relink functionality of libprocess in the executor driver.
> -
>
> Key: MESOS-7057
> URL: https://issues.apache.org/jira/browse/MESOS-7057
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Anand Mazumdar
>Assignee: Anand Mazumdar
>  Labels: mesosphere
>
> As outlined in the root cause analysis for MESOS-5332, it is possible for a 
> iptables firewall to terminate an idle connection after a timeout. (the 
> default is 5 days). Once this happens, the executor driver is not notified of 
> the disconnection. It keeps on thinking that it is still connected with the 
> agent.
> When the agent process is restarted, the executor still tries to re-use the 
> old broken connection to send the re-register message to the agent. This is 
> when it eventually realizes that the connection is broken (due to the nature 
> of TCP) and calls the {{exited}} callback and commits suicide in 15 minutes 
> upon the recovery timeout.
> To offset this, an executor should always {{relink}} when it receives a 
> reconnect request from the agent.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (MESOS-7057) Consider using the relink in the executor driver.

2017-02-03 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-7057:
--
Shepherd: Vinod Kone

> Consider using the relink in the executor driver.
> -
>
> Key: MESOS-7057
> URL: https://issues.apache.org/jira/browse/MESOS-7057
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Anand Mazumdar
>Assignee: Anand Mazumdar
>  Labels: mesosphere
>
> As outlined in the root cause analysis for MESOS-5332, it is possible for a 
> iptables firewall to terminate an idle connection after a timeout. (the 
> default is 5 days). Once this happens, the executor driver is not notified of 
> the disconnection. It keeps on thinking that it is still connected with the 
> agent.
> When the agent process is restarted, the executor still tries to re-use the 
> old broken connection to send the re-register message to the agent. This is 
> when it eventually realizes that the connection is broken (due to the nature 
> of TCP) and calls the {{exited}} callback and commits suicide in 15 minutes 
> upon the recovery timeout.
> To offset this, an executor should always {{relink}} when it receives a 
> reconnect request from the agent.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (MESOS-3901) Enable Mesos to be able know when it is hosted behind a proxy with a URL prefix

2017-02-03 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-3901:
--
Target Version/s: 1.3.0

> Enable Mesos to be able know when it is hosted behind a proxy with a URL 
> prefix
> ---
>
> Key: MESOS-3901
> URL: https://issues.apache.org/jira/browse/MESOS-3901
> Project: Mesos
>  Issue Type: Improvement
>  Components: webui
>Reporter: Harpreet
>Assignee: haosdent
>Priority: Critical
>  Labels: mesosphere
>
> If Mesos is run behind a proxy with a URL prefix e.g.  
> https://:/services/mesos (`/services/mesos` being the URL 
> prefix), sandboxes in mesos don't load. This happens because when
>   Mesos is accessed through a proxy at 
> https://:/services/mesos, Mesos tries to request slave state 
> from 
> https://:/slave/20151110-232502-218431498-5050-1234-S1/slave(1)/state.json?jsonp=angular.callbacks._4.
>  This URL is missing the /services/mesos path prefix, so the request fails. 
> Fixing this by rewriting URLs in the body of every response, would not be a 
> clean solution and can be error prone.
> After searching around a bit we've learned that this is apparently a common 
> issue with webapps, because there is no standard specification for making 
> them aware of their base URL path. Some will allow you to specify a base path 
> in configuration[1], others will respect an X-Forwarded-Path header if a 
> proxy provides it[2], and others don't handle this at all. 
> It would be great to have explicit support in for this in Mesos.
> [1] 
> http://search.cpan.org/~bobtfish/Catalyst-TraitFor-Request-ProxyBase-0.05/lib/Catalyst/TraitFor/Request/ProxyBase.pm
> [2] https://github.com/mattkenney/feedsquish/blob/master/rupta.py#L94



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (MESOS-7067) Add the `OnTerminationPolicy` to the TaskInfo protobuf.

2017-02-06 Thread Anand Mazumdar (JIRA)

Anand Mazumdar created MESOS-7067:
-

 Summary: Add the `OnTerminationPolicy` to the TaskInfo protobuf.
 Key: MESOS-7067
 URL: https://issues.apache.org/jira/browse/MESOS-7067
 Project: Mesos
  Issue Type: Task
Reporter: Anand Mazumdar
Assignee: Anand Mazumdar


As outlined in the [design doc | 
https://docs.google.com/document/d/1VxfoZ-DzMHnKY0gzoccHEhx1rvdC2-RATJfJUfiAwGY/edit?usp=sharing]
 , we need to introduce the {{OnTerminationPolicy}} to the {{TaskInfo}} 
protobuf allowing every task to specify what would an executor do upon task 
termination. 

Note that this issue won't introduce the {{RestartPolicy}} message and those 
would be added via a separate issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (MESOS-7068) Add OnTerminationPolicy handling to the default executor.

2017-02-06 Thread Anand Mazumdar (JIRA)

Anand Mazumdar created MESOS-7068:
-

 Summary: Add OnTerminationPolicy handling to the default executor.
 Key: MESOS-7068
 URL: https://issues.apache.org/jira/browse/MESOS-7068
 Project: Mesos
  Issue Type: Task
Reporter: Anand Mazumdar
Assignee: Anand Mazumdar


We should support handling {{OnTerminationPolicy}} specified in {{TaskInfo}} to 
the default executor. Currently, the default policy for the default executor is 
to kill the entire task group when a task in the task group fails. This would 
allow framework developers to specify a custom policy e.g., keep the executor 
still alive when a back up task in the task group fails etc.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Assigned] (MESOS-7075) mesos-execute rejects all offers

2017-02-07 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar reassigned MESOS-7075:
-

Assignee: Benjamin Mahler
Target Version/s: 1.2.0
Priority: Blocker  (was: Major)

Assigning to [~bmahler] as per offline discussion.

> mesos-execute rejects all offers
> 
>
> Key: MESOS-7075
> URL: https://issues.apache.org/jira/browse/MESOS-7075
> Project: Mesos
>  Issue Type: Bug
>  Components: framework
>Affects Versions: 1.2.0
>Reporter: Gastón Kleiman
>Assignee: Benjamin Mahler
>Priority: Blocker
>  Labels: resources
>
> Mesos now includes {{Resource.AllocationInfo}} in the resources sent in an 
> offer.
> An {{Resources}} instance without {{Resource.AllocationInfo}} will not be 
> contained in one with it set. The subtraction operator will also treat those 
> instances different.
> This makes {{mesos-execute}} reject all offers.
> We need to update {{mesos-execute}} and probably other C++ frameworks in our 
> repo that use the Resources class.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (MESOS-6033) Scheduler library's reconnection logic should be teardown-aware

2017-02-07 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6033:
--
Labels: mesosphere newbie  (was: mesosphere)

> Scheduler library's reconnection logic should be teardown-aware
> ---
>
> Key: MESOS-6033
> URL: https://issues.apache.org/jira/browse/MESOS-6033
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API
>Affects Versions: 1.0.0, 1.0.1
>Reporter: Greg Mann
>  Labels: mesosphere, newbie
>
> The reconnection logic in the scheduler library currently is not aware of any 
> explicit teardown calls sent by the scheduler to the master. This means that 
> after a scheduler has explicitly torn itself down, it will attempt a 
> reconnection to master under the covers, and if this succeeds the scheduler's 
> {{connected}} callback will be invoked. This forces frameworks to implement 
> their own teardown awareness logic in their callbacks.
> We should add teardown-awareness to the scheduler library's reconnection code 
> so that framework authors don't have to worry about this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (MESOS-6007) Operator API v1 Improvements

2017-02-07 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6007:
--
Labels: mesosphere  (was: )

> Operator API v1 Improvements
> 
>
> Key: MESOS-6007
> URL: https://issues.apache.org/jira/browse/MESOS-6007
> Project: Mesos
>  Issue Type: Epic
>Reporter: Vinod Kone
>  Labels: mesosphere
>
> This is follow up epic to track the improvement work from MESOS-4791.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (MESOS-5735) Update WebUI to use v1 operator API

2017-02-07 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-5735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-5735:
--
Labels: mesosphere  (was: )

> Update WebUI to use v1 operator API
> ---
>
> Key: MESOS-5735
> URL: https://issues.apache.org/jira/browse/MESOS-5735
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>Assignee: Jay Guo
>  Labels: mesosphere
>
> Having the WebUI use the v1 API would be a good validation of it's usefulness 
> and correctness.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (MESOS-5726) Benchmark the v1 Operator API

2017-02-07 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-5726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-5726:
--
Labels: mesosphere  (was: )

> Benchmark the v1 Operator API
> -
>
> Key: MESOS-5726
> URL: https://issues.apache.org/jira/browse/MESOS-5726
> Project: Mesos
>  Issue Type: Task
>Reporter: Vinod Kone
>Assignee: haosdent
>  Labels: mesosphere
>
> Just like what we did with the v1 framework API, we need to benchmark the 
> performance of v1 operator API.
> As part of this benchmarking, we should evaluate whether evolving 
> un-versioned protos to versioned protos in some of the API handlers (e.g., 
> getFrameworks) is expensive.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (MESOS-5858) Operator API should accept the request with charset specified in Content-Type

2017-02-07 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-5858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-5858:
--
Labels: mesosphere  (was: )

> Operator API should accept the request with charset specified in Content-Type
> -
>
> Key: MESOS-5858
> URL: https://issues.apache.org/jira/browse/MESOS-5858
> Project: Mesos
>  Issue Type: Bug
>Reporter: zhou xing
>Assignee: Abhishek Dasgupta
>Priority: Minor
>  Labels: mesosphere
>
> When requesting from client like WebUI, the Content-Type of the request maybe 
> set to
> ```application/json; charset=utf-8```
> , the request will get 415 response code with message
> ```Expecting 'Content-Type' of application/json or application/x-protobuf```
> The following code in http.cpp just compare the content-type of the request 
> directly with “application/json” or “application/x-protobuf”:
> ```...
>   if (contentType.get() == APPLICATION_PROTOBUF) {
> if (!v1Call.ParseFromString(request.body)) {
>   return BadRequest("Failed to parse body into Call protobuf");
> }
>   } else if (contentType.get() == APPLICATION_JSON) {
> …
> We need to accept a request with charset set in Content-Type



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (MESOS-6082) Add scheduler Call and Event based metrics to the master.

2017-02-07 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6082:
--
Labels: mesosphere  (was: )

> Add scheduler Call and Event based metrics to the master.
> -
>
> Key: MESOS-6082
> URL: https://issues.apache.org/jira/browse/MESOS-6082
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Benjamin Mahler
>Assignee: Abhishek Dasgupta
>Priority: Critical
>  Labels: mesosphere
>
> Currently, the master only has metrics for the old-style messages and these 
> are re-used for calls unfortunately:
> {code}
>   // Messages from schedulers.
>   process::metrics::Counter messages_register_framework;
>   process::metrics::Counter messages_reregister_framework;
>   process::metrics::Counter messages_unregister_framework;
>   process::metrics::Counter messages_deactivate_framework;
>   process::metrics::Counter messages_kill_task;
>   process::metrics::Counter messages_status_update_acknowledgement;
>   process::metrics::Counter messages_resource_request;
>   process::metrics::Counter messages_launch_tasks;
>   process::metrics::Counter messages_decline_offers;
>   process::metrics::Counter messages_revive_offers;
>   process::metrics::Counter messages_suppress_offers;
>   process::metrics::Counter messages_reconcile_tasks;
>   process::metrics::Counter messages_framework_to_executor;
> {code}
> Now that we've introduced the Call/Event based API, we should have metrics 
> that reflect this. For example:
> {code}
> {
>   scheduler/calls: 100
>   scheduler/calls/decline: 90,
>   scheduler/calls/accept: 10,
>   scheduler/calls/accept/operations/create: 1,
>   scheduler/calls/accept/operations/destroy: 0,
>   scheduler/calls/accept/operations/launch: 4,
>   scheduler/calls/accept/operations/launch_group: 2,
>   scheduler/calls/accept/operations/reserve: 1,
>   scheduler/calls/accept/operations/unreserve: 0,
>   scheduler/calls/kill: 0,
>   // etc
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (MESOS-6016) Expose the unversioned Call and Event Scheduler/Executor Protobufs.

2017-02-07 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6016:
--
Labels: mesosphere  (was: mesos)

> Expose the unversioned Call and Event Scheduler/Executor Protobufs.
> ---
>
> Key: MESOS-6016
> URL: https://issues.apache.org/jira/browse/MESOS-6016
> Project: Mesos
>  Issue Type: Task
>Reporter: Anand Mazumdar
>  Labels: mesosphere
>
> Currently, we don't expose the un-versioned (v0) {{Call}}/{{Event}} 
> scheduler/executor protobufs externally to framework authors. This is a bit 
> disjoint since we already expose the unversioned Mesos protos. The reasoning 
> for not doing so earlier was that Mesos would use the v0 protobufs as an 
> alternative to having separate internal protobufs internally. 
> However, that is not going to work. Eventually, when we introduce a backward 
> incompatible change in {{v1}} protobufs, we would create new {{v2}} 
> protobufs. But, we would need to ensure that {{v2}} protobufs can somehow be 
> translated to {{v0}} without breaking existing users. That's a pretty hard 
> thing to do! In the interim, to help framework authors migrate their 
> frameworks (they might be storing old protobufs in ZK/other reliable storage) 
> , we should expose the v0 scheduler/executor protobufs too and create another 
> internal translation layer for Mesos.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (MESOS-6846) Support `teardown` in the v1 operator API.

2017-02-07 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6846:
--
Labels: mesosphere  (was: )

> Support `teardown` in the v1 operator API.
> --
>
> Key: MESOS-6846
> URL: https://issues.apache.org/jira/browse/MESOS-6846
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API
>Reporter: Joerg Schad
>  Labels: mesosphere
>
> Currently, the v1 operator API does not support teardown of frameworks.
> The semantics should be similar to the old HTTP endpoint: 
> http://mesos.apache.org/documentation/latest/endpoints/master/teardown/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (MESOS-6773) Provide REST-style endpoints that map to v1 master/agent Calls.

2017-02-07 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6773:
--
Labels: mesosphere  (was: )

> Provide REST-style endpoints that map to v1 master/agent Calls.
> ---
>
> Key: MESOS-6773
> URL: https://issues.apache.org/jira/browse/MESOS-6773
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API
>Reporter: Benjamin Mahler
>  Labels: mesosphere
>
> With the addition of V1 {{master::Call}} and {{agent::Call}} to replace the 
> V0 REST-style endpoints (e.g. /state, /metrics/snapshot, etc), users can no 
> longer hit these endpoints in their browser or use query parameters. Also, 
> tooling has to send POST data, which is a bit more onerous in most libraries 
> than simply using a URL with query parameters.
> Per the [design 
> doc|https://docs.google.com/document/d/1XfgF4jDXZDVIEWQPx6Y4glgeTTswAAxw6j8dPDAtoeI],
>  we can add a mapping to REST-style endpoints to provide users with a means 
> to hit these endpoints without POST data.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (MESOS-7071) Agent State Lacks Framework Principal

2017-02-07 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-7071:
--
Description: The agent's {{/state}} endpoint needs to have framework 
principal present so that agent-side services do not need to query master state 
for this one piece of information.  (was: The agent state needs to have 
framework principal present so agent-side services do not need to query master 
state for this one piece of information.)

> Agent State Lacks Framework Principal 
> --
>
> Key: MESOS-7071
> URL: https://issues.apache.org/jira/browse/MESOS-7071
> Project: Mesos
>  Issue Type: Improvement
>Affects Versions: 1.2.0
>Reporter: Jeff Malnick
>Assignee: Jeff Malnick
>  Labels: mesosphere
>
> The agent's {{/state}} endpoint needs to have framework principal present so 
> that agent-side services do not need to query master state for this one piece 
> of information.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (MESOS-7082) ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillTask/0 is flaky

2017-02-07 Thread Anand Mazumdar (JIRA)

Anand Mazumdar created MESOS-7082:
-

 Summary: 
ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillTask/0 is flaky
 Key: MESOS-7082
 URL: https://issues.apache.org/jira/browse/MESOS-7082
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.2.0
 Environment: ubuntu 16.04 with/without SSL
Reporter: Anand Mazumdar


Showed up on our internal CI

{noformat}
07:00:17 [ RUN  ] 
ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillTask/0
07:00:17 I0207 07:00:17.775459  2952 cluster.cpp:160] Creating default 'local' 
authorizer
07:00:17 I0207 07:00:17.776511  2970 master.cpp:383] Master 
fa1554c4-572a-4b89-8994-a89460f588d3 (ip-10-153-254-29.ec2.internal) started on 
10.153.254.29:38570
07:00:17 I0207 07:00:17.776538  2970 master.cpp:385] Flags at startup: 
--acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="HierarchicalDRF" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authenticators="crammd5" 
--authorizers="local" --credentials="/tmp/ZROfJk/credentials" 
--framework_sorter="drf" --help="false" --hostname_lookup="true" 
--http_authenticators="basic" --http_framework_authenticators="basic" 
--initialize_driver_logging="true" --log_auto_initialize="true" 
--logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--max_unreachable_tasks_per_framework="1000" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--root_submissions="true" --user_sorter="drf" --version="false" 
--webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/ZROfJk/master" 
--zk_session_timeout="10secs"
07:00:17 I0207 07:00:17.776674  2970 master.cpp:435] Master only allowing 
authenticated frameworks to register
07:00:17 I0207 07:00:17.776687  2970 master.cpp:449] Master only allowing 
authenticated agents to register
07:00:17 I0207 07:00:17.776695  2970 master.cpp:462] Master only allowing 
authenticated HTTP frameworks to register
07:00:17 I0207 07:00:17.776703  2970 credentials.hpp:37] Loading credentials 
for authentication from '/tmp/ZROfJk/credentials'
07:00:17 I0207 07:00:17.776779  2970 master.cpp:507] Using default 'crammd5' 
authenticator
07:00:17 I0207 07:00:17.776841  2970 http.cpp:919] Using default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
07:00:17 I0207 07:00:17.776919  2970 http.cpp:919] Using default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
07:00:17 I0207 07:00:17.776970  2970 http.cpp:919] Using default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
07:00:17 I0207 07:00:17.777009  2970 master.cpp:587] Authorization enabled
07:00:17 I0207 07:00:17.777122  2975 hierarchical.cpp:161] Initialized 
hierarchical allocator process
07:00:17 I0207 07:00:17.777138  2974 whitelist_watcher.cpp:77] No whitelist 
given
07:00:17 I0207 07:00:17.04  2976 master.cpp:2123] Elected as the leading 
master!
07:00:17 I0207 07:00:17.26  2976 master.cpp:1645] Recovering from registrar
07:00:17 I0207 07:00:17.84  2975 registrar.cpp:329] Recovering registrar
07:00:17 I0207 07:00:17.777989  2973 registrar.cpp:362] Successfully fetched 
the registry (0B) in 176384ns
07:00:17 I0207 07:00:17.778023  2973 registrar.cpp:461] Applied 1 operations in 
7573ns; attempting to update the registry
07:00:17 I0207 07:00:17.778249  2976 registrar.cpp:506] Successfully updated 
the registry in 210944ns
07:00:17 I0207 07:00:17.778290  2976 registrar.cpp:392] Successfully recovered 
registrar
07:00:17 I0207 07:00:17.778373  2976 master.cpp:1761] Recovered 0 agents from 
the registry (172B); allowing 10mins for agents to re-register
07:00:17 I0207 07:00:17.778394  2974 hierarchical.cpp:188] Skipping recovery of 
hierarchical allocator: nothing to recover
07:00:17 I0207 07:00:17.869381  2952 containerizer.cpp:220] Using isolation: 
posix/cpu,posix/mem,filesystem/posix,network/cni
07:00:17 I0207 07:00:17.872557  2952 linux_launcher.cpp:150] Using 
/sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
07:00:17 I0207 07:00:17.872915  2952 provisioner.cpp:249] Using default backend 
'overlay'
07:00:17 I0207 07:00:17.873425  2952 cluster.cpp:446] Creating default 'local' 
authorizer
07:00:17 I0207 07:00:17.873791  2974 slave.cpp:211] Mesos agent started on 
(716)@10.153.254.29:38570
07:00:17 I0207 07:00:17.874034  2952 scheduler.cpp:184] Version: 1.2.0
07:00:17 I0207 07:00:17.873829  2974 slave.cpp:212] Flags at startup: --acls=""

[jira] [Commented] (MESOS-7082) ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillTask/0 is flaky

2017-02-07 Thread Anand Mazumdar (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15857444#comment-15857444
 ] 

Anand Mazumdar commented on MESOS-7082:
---

[~gilbert] [~jieyu] Any insights on what might be going wrong here? It looks 
like the default executor exited fine (from the logs) but the executor 
container was not destroyed leading to the failed assertion later.

> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillTask/0 is 
> flaky
> 
>
> Key: MESOS-7082
> URL: https://issues.apache.org/jira/browse/MESOS-7082
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
> Environment: ubuntu 16.04 with/without SSL
>Reporter: Anand Mazumdar
>  Labels: flaky, flaky-test, mesosphere
>
> Showed up on our internal CI
> {noformat}
> 07:00:17 [ RUN  ] 
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillTask/0
> 07:00:17 I0207 07:00:17.775459  2952 cluster.cpp:160] Creating default 
> 'local' authorizer
> 07:00:17 I0207 07:00:17.776511  2970 master.cpp:383] Master 
> fa1554c4-572a-4b89-8994-a89460f588d3 (ip-10-153-254-29.ec2.internal) started 
> on 10.153.254.29:38570
> 07:00:17 I0207 07:00:17.776538  2970 master.cpp:385] Flags at startup: 
> --acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/ZROfJk/credentials" 
> --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --http_authenticators="basic" --http_framework_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/ZROfJk/master" 
> --zk_session_timeout="10secs"
> 07:00:17 I0207 07:00:17.776674  2970 master.cpp:435] Master only allowing 
> authenticated frameworks to register
> 07:00:17 I0207 07:00:17.776687  2970 master.cpp:449] Master only allowing 
> authenticated agents to register
> 07:00:17 I0207 07:00:17.776695  2970 master.cpp:462] Master only allowing 
> authenticated HTTP frameworks to register
> 07:00:17 I0207 07:00:17.776703  2970 credentials.hpp:37] Loading credentials 
> for authentication from '/tmp/ZROfJk/credentials'
> 07:00:17 I0207 07:00:17.776779  2970 master.cpp:507] Using default 'crammd5' 
> authenticator
> 07:00:17 I0207 07:00:17.776841  2970 http.cpp:919] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> 07:00:17 I0207 07:00:17.776919  2970 http.cpp:919] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> 07:00:17 I0207 07:00:17.776970  2970 http.cpp:919] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> 07:00:17 I0207 07:00:17.777009  2970 master.cpp:587] Authorization enabled
> 07:00:17 I0207 07:00:17.777122  2975 hierarchical.cpp:161] Initialized 
> hierarchical allocator process
> 07:00:17 I0207 07:00:17.777138  2974 whitelist_watcher.cpp:77] No whitelist 
> given
> 07:00:17 I0207 07:00:17.04  2976 master.cpp:2123] Elected as the leading 
> master!
> 07:00:17 I0207 07:00:17.26  2976 master.cpp:1645] Recovering from 
> registrar
> 07:00:17 I0207 07:00:17.84  2975 registrar.cpp:329] Recovering registrar
> 07:00:17 I0207 07:00:17.777989  2973 registrar.cpp:362] Successfully fetched 
> the registry (0B) in 176384ns
> 07:00:17 I0207 07:00:17.778023  2973 registrar.cpp:461] Applied 1 operations 
> in 7573ns; attempting to update the registry
> 07:00:17 I0207 07:00:17.778249  2976 registrar.cpp:506] Successfully updated 
> the registry in 210944ns
> 07:00:17 I0207 07:00:17.778290  2976 registrar.cpp:392] Successfully 
> recovered registrar
> 07:00:17 I0207 07:00:17.778373  2976 master.cpp:1761] Recovered 0 agents from 
> the registry (172B); allowing 10mins for agents to re-register
> 07:00:17 I0207 07:00:17.778394  2974 hierarchical.cpp:188] Skipping recovery 
> of hierarchical allocator: nothing to recover
> 07:00:17 I0207 07:00:17.869381  2952 co

[jira] [Updated] (MESOS-7081) Mesos copy backend fails on specific docker images

2017-02-07 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-7081:
--
Description: 
Filing this on behalf of a customer who reported (I will pass this along to 
them so that they can add more context, if needed):

It appears that docker images doing very specific things trigger the following 
error:

{noformat}
I0207 22:25:23.212878 14465 executor.cpp:189] Version: 1.1.0
I0207 22:25:23.338201 14464 default_executor.cpp:123] Received SUBSCRIBED event
I0207 22:25:23.338567 14464 default_executor.cpp:127] Subscribed executor on 
10.190.112.153
I0207 22:25:23.338830 14461 default_executor.cpp:123] Received LAUNCH_GROUP 
event
E0207 22:25:23.710819 14461 default_executor.cpp:366] Received '500 Internal 
Server Error' (Collect failed: Failed to copy layer: cp: not writing through 
dangling symlink 
'/var/lib/mesos/slave/provisioner/containers/4e6bdf2b-6a9f-4893-bbb9-8355e4863d22/containers/80a88749-8f79-4544-b7e9-185592f3593d/backends/copy/rootfses/bea1dd9d-a643-48a6-bf75-f6e9e893dbc4/bin/tar'
) while launching child container
I0207 22:25:23.710849 14461 default_executor.cpp:760] Terminating after 1secs
{noformat}

  was:
Filing this on behalf of a customer who reported (I will pass this along to 
them so that they can add more context, if needed):

It appears that docker images doing very specific things trigger the following 
error:

I0207 22:25:23.212878 14465 executor.cpp:189] Version: 1.1.0
I0207 22:25:23.338201 14464 default_executor.cpp:123] Received SUBSCRIBED event
I0207 22:25:23.338567 14464 default_executor.cpp:127] Subscribed executor on 
10.190.112.153
I0207 22:25:23.338830 14461 default_executor.cpp:123] Received LAUNCH_GROUP 
event
E0207 22:25:23.710819 14461 default_executor.cpp:366] Received '500 Internal 
Server Error' (Collect failed: Failed to copy layer: cp: not writing through 
dangling symlink 
'/var/lib/mesos/slave/provisioner/containers/4e6bdf2b-6a9f-4893-bbb9-8355e4863d22/containers/80a88749-8f79-4544-b7e9-185592f3593d/backends/copy/rootfses/bea1dd9d-a643-48a6-bf75-f6e9e893dbc4/bin/tar'
) while launching child container
I0207 22:25:23.710849 14461 default_executor.cpp:760] Terminating after 1secs
ecs


> Mesos copy backend fails on specific docker images
> --
>
> Key: MESOS-7081
> URL: https://issues.apache.org/jira/browse/MESOS-7081
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.1.0
>Reporter: Harpreet
>
> Filing this on behalf of a customer who reported (I will pass this along to 
> them so that they can add more context, if needed):
> It appears that docker images doing very specific things trigger the 
> following error:
> {noformat}
> I0207 22:25:23.212878 14465 executor.cpp:189] Version: 1.1.0
> I0207 22:25:23.338201 14464 default_executor.cpp:123] Received SUBSCRIBED 
> event
> I0207 22:25:23.338567 14464 default_executor.cpp:127] Subscribed executor on 
> 10.190.112.153
> I0207 22:25:23.338830 14461 default_executor.cpp:123] Received LAUNCH_GROUP 
> event
> E0207 22:25:23.710819 14461 default_executor.cpp:366] Received '500 Internal 
> Server Error' (Collect failed: Failed to copy layer: cp: not writing through 
> dangling symlink 
> '/var/lib/mesos/slave/provisioner/containers/4e6bdf2b-6a9f-4893-bbb9-8355e4863d22/containers/80a88749-8f79-4544-b7e9-185592f3593d/backends/copy/rootfses/bea1dd9d-a643-48a6-bf75-f6e9e893dbc4/bin/tar'
> ) while launching child container
> I0207 22:25:23.710849 14461 default_executor.cpp:760] Terminating after 1secs
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (MESOS-7083) No master is currently leading

2017-02-08 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-7083:
--
Component/s: webui

> No master is currently leading
> --
>
> Key: MESOS-7083
> URL: https://issues.apache.org/jira/browse/MESOS-7083
> Project: Mesos
>  Issue Type: Bug
>  Components: master, webui
>Affects Versions: 1.1.0
>Reporter: hemanth makaraju
>
> when i run http://127.0.0.1:5050 on web-browser i see "No master is currently 
> leading" but mesos resolve command detected master
> mesos-resolve zk://172.17.0.2:2181/mesos
> I0208 11:17:33.489379 24715 zookeeper.cpp:259] A new leading master 
> (UPID=master@127.0.0.1:5050) is detected
> this is the command i used to run mesos-master
> mesos-master --zk=zk://127.0.0.1:2181/mesos --quorum=1 
> --advertise_ip=127.0.0.1 --advertise_port=5050 --work_dir=/mesos/master



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (MESOS-7077) Check failed: resource.has_allocation_info().

2017-02-08 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-7077:
--
Shepherd: Michael Park

> Check failed: resource.has_allocation_info().
> -
>
> Key: MESOS-7077
> URL: https://issues.apache.org/jira/browse/MESOS-7077
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: James Peach
>Assignee: Benjamin Mahler
>Priority: Blocker
>
> Seeing this {{CHECK}} fail with top-of-tree master:
> {noformat}
> F0207 16:00:44.657328 3351272 master.cpp:8980] Check failed: 
> resource.has_allocation_info()
> {noformat}
> The symbolicated backtrace is:
> {noformat}
> (gdb) where
> #0  0x7f009f1315e5 in raise () from /lib64/libc.so.6
> #1  0x7f009f132dc5 in abort () from /lib64/libc.so.6
> #2  0x7f00a168e496 in google::DumpStackTraceAndExit () at 
> src/utilities.cc:147
> #3  0x7f00a1685e7d in google::LogMessage::Fail () at src/logging.cc:1458
> #4  0x7f00a1687c0d in google::LogMessage::SendToLog (this=Unhandled dwarf 
> expression opcode 0xf3
> ) at src/logging.cc:1412
> #5  0x7f00a1685a02 in google::LogMessage::Flush (this=0x7f00917ef560) at 
> src/logging.cc:1281
> #6  0x7f00a16885e9 in google::LogMessageFatal::~LogMessageFatal 
> (this=Unhandled dwarf expression opcode 0xf3
> ) at src/logging.cc:1984
> #7  0x7f00a0a1184c in mesos::internal::master::Slave::addTask 
> (this=0x7f007c830280, task=0x7f0080835340)
> at ../../src/master/master.cpp:8980
> #8  0x7f00a0a18b53 in mesos::internal::master::Slave::Slave 
> (this=0x7f007c830280, _master=Unhandled dwarf expression opcode 0xf3
> )
> at ../../src/master/master.cpp:8947
> #9  0x7f00a0a19c57 in mesos::internal::master::Master::_reregisterSlave 
> (this=0x7f00990bf000,
> slaveInfo=..., pid=..., checkpointedResources=Unhandled dwarf expression 
> opcode 0xf3
> ) at ../../src/master/master.cpp:5759
> #10 0x7f00a0a1cb22 in operator() (__functor=Unhandled dwarf expression 
> opcode 0xf3
> )
> at ../../3rdparty/libprocess/include/process/dispatch.hpp:229
> #11 std::_Function_handler process::dispatch(const process::PID&, void (T::*)(P0, P1, P2, P3, P4, P5, 
> P6, P7, P8, P9), A0, A1, A2, A3, A4, A5, A6, A7, A8, A9) [with T = 
> mesos::internal::master::Master; P0 = const mesos::SlaveInfo&; P1 = const 
> process::UPID&; P2 = const std::vector&; P3 = const 
> std::vector&; P4 = const std::vector&; P5 = 
> const std::vector&; P6 = const 
> std::vector&; P7 = const 
> std::basic_string&; P8 = const 
> std::vector&; P9 = const process::Future&; 
> A0 = mesos::SlaveInfo; A1 = process::UPID; A2 = std::vector; 
> A3 = std::vector; A4 = std::vector; A5 = 
> std::vector; A6 = 
> std::vector; A7 = 
> std::basic_string; A8 = std::vector; A9 = 
> process::Future]:: >::_M_invoke(const 
> std::_Any_data &, process::ProcessBase *) (
> __functor=Unhandled dwarf expression opcode 0xf3
> {noformat}
> I expect that this happened because the master moved to the latest version 
> before all the agents had moved.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (MESOS-7079) Consider supporting CNI version 0.3.

2017-02-08 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-7079:
--
Summary: Consider supporting CNI version 0.3.  (was: Support CNI version 
0.3.)

> Consider supporting CNI version 0.3.
> 
>
> Key: MESOS-7079
> URL: https://issues.apache.org/jira/browse/MESOS-7079
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jie Yu
>
> The cni isolator currently only supports v0.2.
> CNI spec v0.3 added the support for
> 1) configuration list
> https://github.com/containernetworking/cni/blob/master/SPEC.md#network-configuration-lists
> 2) interfaces in Results
> https://github.com/containernetworking/cni/blob/master/SPEC.md#result
> We should try to support CNI v0.3 in CNI isolator.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (MESOS-7079) Consider supporting CNI version 0.3.

2017-02-08 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-7079:
--
Issue Type: Improvement  (was: Bug)

> Consider supporting CNI version 0.3.
> 
>
> Key: MESOS-7079
> URL: https://issues.apache.org/jira/browse/MESOS-7079
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Jie Yu
>
> The cni isolator currently only supports v0.2.
> CNI spec v0.3 added the support for
> 1) configuration list
> https://github.com/containernetworking/cni/blob/master/SPEC.md#network-configuration-lists
> 2) interfaces in Results
> https://github.com/containernetworking/cni/blob/master/SPEC.md#result
> We should try to support CNI v0.3 in CNI isolator.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (MESOS-7075) mesos-execute rejects all offers

2017-02-08 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-7075:
--
Shepherd: Anand Mazumdar

> mesos-execute rejects all offers
> 
>
> Key: MESOS-7075
> URL: https://issues.apache.org/jira/browse/MESOS-7075
> Project: Mesos
>  Issue Type: Bug
>  Components: framework
>Affects Versions: 1.2.0
>Reporter: Gastón Kleiman
>Assignee: Benjamin Mahler
>Priority: Blocker
>  Labels: resources
> Fix For: 1.2.0
>
>
> Mesos now includes {{Resource.AllocationInfo}} in the resources sent in an 
> offer.
> An {{Resources}} instance without {{Resource.AllocationInfo}} will not be 
> contained in one with it set. The subtraction operator will also treat those 
> instances different.
> This makes {{mesos-execute}} reject all offers.
> We need to update {{mesos-execute}} and probably other C++ frameworks in our 
> repo that use the Resources class.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Comment Edited] (MESOS-7095) Basic make check from getting started link fails

2017-02-09 Thread Anand Mazumdar (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15859775#comment-15859775
 ] 

Anand Mazumdar edited comment on MESOS-7095 at 2/9/17 4:47 PM:
---

[~AlecBr] Thanks for reporting this. For triaging, can you add more details 
about the platform you were trying to build on? Also, you can enclose the stack 
trace in \{code\}\{code\} blocks.


was (Author: anandmazumdar):
[~AlecBr] Thanks for reporting this. For triaging, can you add more details 
about the platform you were trying to build on? Also, you can enclose the stack 
trace in {code}..{code} blocks.

> Basic make check from getting started link fails
> 
>
> Key: MESOS-7095
> URL: https://issues.apache.org/jira/browse/MESOS-7095
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Reporter: Alec Bruns
>
> *** Aborted at 1486657215 (unix time) try "date -d @1486657215" if you are 
> using GNU date *** PC: @0x1080b7367 apr_pool_create_ex *** SIGSEGV 
> (@0x30) received by PID 25167 (TID 0x7fffbdd073c0) stack trace: *** @ 
> 0x7fffb50c7bba _sigtramp @  0x72c0517 (unknown) @
> 0x107eaa13a svn_pool_create_ex @0x107691d6e svn::diff() @ 
>0x107691042 SVNTest_DiffPatch_Test::TestBody() @0x1077026ba 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>() @
> 0x1076b3ad7 testing::internal::HandleExceptionsInMethodIfSupported<>() @  
>   0x1076b3985 testing::Test::Run() @0x1076b54f8 
> testing::TestInfo::Run() @0x1076b6867 testing::TestCase::Run()
>  @0x1076c65dc testing::internal::UnitTestImpl::RunAllTests() @
> 0x1077033da testing::internal::HandleSehExceptionsInMethodIfSupported<>() 
> @0x1076c6007 
> testing::internal::HandleExceptionsInMethodIfSupported<>() @
> 0x1076c5ed8 testing::UnitTest::Run() @0x1074d55c1 RUN_ALL_TESTS() 
> @0x1074d5580 main @ 0x7fffb4eba255 start make[6]: *** 
> [check-local] Segmentation fault: 11 make[5]: *** [check-am] Error 2 make[4]: 
> *** [check-recursive] Error 1 make[3]: *** [check] Error 2 make[2]: *** 
> [check-recursive] Error 1 make[1]: *** [check] Error 2 make: *** 
> [check-recursive] Error 1



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (MESOS-7095) Basic make check from getting started link fails

2017-02-09 Thread Anand Mazumdar (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15859775#comment-15859775
 ] 

Anand Mazumdar commented on MESOS-7095:
---

[~AlecBr] Thanks for reporting this. For triaging, can you add more details 
about the platform you were trying to build on? Also, you can enclose the stack 
trace in {code}..{code} blocks.

> Basic make check from getting started link fails
> 
>
> Key: MESOS-7095
> URL: https://issues.apache.org/jira/browse/MESOS-7095
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Reporter: Alec Bruns
>
> *** Aborted at 1486657215 (unix time) try "date -d @1486657215" if you are 
> using GNU date *** PC: @0x1080b7367 apr_pool_create_ex *** SIGSEGV 
> (@0x30) received by PID 25167 (TID 0x7fffbdd073c0) stack trace: *** @ 
> 0x7fffb50c7bba _sigtramp @  0x72c0517 (unknown) @
> 0x107eaa13a svn_pool_create_ex @0x107691d6e svn::diff() @ 
>0x107691042 SVNTest_DiffPatch_Test::TestBody() @0x1077026ba 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>() @
> 0x1076b3ad7 testing::internal::HandleExceptionsInMethodIfSupported<>() @  
>   0x1076b3985 testing::Test::Run() @0x1076b54f8 
> testing::TestInfo::Run() @0x1076b6867 testing::TestCase::Run()
>  @0x1076c65dc testing::internal::UnitTestImpl::RunAllTests() @
> 0x1077033da testing::internal::HandleSehExceptionsInMethodIfSupported<>() 
> @0x1076c6007 
> testing::internal::HandleExceptionsInMethodIfSupported<>() @
> 0x1076c5ed8 testing::UnitTest::Run() @0x1074d55c1 RUN_ALL_TESTS() 
> @0x1074d5580 main @ 0x7fffb4eba255 start make[6]: *** 
> [check-local] Segmentation fault: 11 make[5]: *** [check-am] Error 2 make[4]: 
> *** [check-recursive] Error 1 make[3]: *** [check] Error 2 make[2]: *** 
> [check-recursive] Error 1 make[1]: *** [check] Error 2 make: *** 
> [check-recursive] Error 1



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Assigned] (MESOS-7074) port_mapping isolator: do not depend on /sys/class/net//speed

2017-02-09 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar reassigned MESOS-7074:
-

Assignee: Pierre Cheynier

> port_mapping isolator: do not depend on /sys/class/net//speed
> -
>
> Key: MESOS-7074
> URL: https://issues.apache.org/jira/browse/MESOS-7074
> Project: Mesos
>  Issue Type: Improvement
>  Components: isolation
>Reporter: Pierre Cheynier
>Assignee: Pierre Cheynier
>Priority: Minor
>
> I tried to use network/port_mapping isolator and faced this issue:  
> {{/sys/class/net//speed}} is unreadable because it is not set by the 
> underlying network driver.
> * on virtualized environment (KVM, virtualbox, AMIs, etc.)
> * on CentOS 7 with teaming driver enabled, it doesn't report the speed.
> {noformat}
> # cat /sys/class/net/team0/speed 
> cat: /sys/class/net/team0/speed: Invalid argument
> {noformat}
> Here is a pointer to the code blocks that perform those tests: 
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob;f=src/slave/containerizer/mesos/isolators/network/port_mapping.cpp;h=f6f2bfe1d5d3c113036ad44a480f97bbb462a269;hb=HEAD#l1588
> In my opinion, to solve most of these use-cases, we could make those tests 
> non-blocking, i.e. replace errors by log/warnings.
> It could result in an inconsistency between the interface speed and the 
> bw_limit set, but I guess there are more benefits in making it usable in more 
> environments.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (MESOS-7074) port_mapping isolator: do not depend on /sys/class/net//speed

2017-02-09 Thread Anand Mazumdar (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15859781#comment-15859781
 ] 

Anand Mazumdar commented on MESOS-7074:
---

[~jieyu] I took the liberty of adding you as the shepherd here as the review 
was addressed to you.

> port_mapping isolator: do not depend on /sys/class/net//speed
> -
>
> Key: MESOS-7074
> URL: https://issues.apache.org/jira/browse/MESOS-7074
> Project: Mesos
>  Issue Type: Improvement
>  Components: isolation
>Reporter: Pierre Cheynier
>Assignee: Pierre Cheynier
>Priority: Minor
>
> I tried to use network/port_mapping isolator and faced this issue:  
> {{/sys/class/net//speed}} is unreadable because it is not set by the 
> underlying network driver.
> * on virtualized environment (KVM, virtualbox, AMIs, etc.)
> * on CentOS 7 with teaming driver enabled, it doesn't report the speed.
> {noformat}
> # cat /sys/class/net/team0/speed 
> cat: /sys/class/net/team0/speed: Invalid argument
> {noformat}
> Here is a pointer to the code blocks that perform those tests: 
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob;f=src/slave/containerizer/mesos/isolators/network/port_mapping.cpp;h=f6f2bfe1d5d3c113036ad44a480f97bbb462a269;hb=HEAD#l1588
> In my opinion, to solve most of these use-cases, we could make those tests 
> non-blocking, i.e. replace errors by log/warnings.
> It could result in an inconsistency between the interface speed and the 
> bw_limit set, but I guess there are more benefits in making it usable in more 
> environments.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (MESOS-7074) port_mapping isolator: do not depend on /sys/class/net//speed

2017-02-09 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-7074:
--
Shepherd: Jie Yu

> port_mapping isolator: do not depend on /sys/class/net//speed
> -
>
> Key: MESOS-7074
> URL: https://issues.apache.org/jira/browse/MESOS-7074
> Project: Mesos
>  Issue Type: Improvement
>  Components: isolation
>Reporter: Pierre Cheynier
>Assignee: Pierre Cheynier
>Priority: Minor
>
> I tried to use network/port_mapping isolator and faced this issue:  
> {{/sys/class/net//speed}} is unreadable because it is not set by the 
> underlying network driver.
> * on virtualized environment (KVM, virtualbox, AMIs, etc.)
> * on CentOS 7 with teaming driver enabled, it doesn't report the speed.
> {noformat}
> # cat /sys/class/net/team0/speed 
> cat: /sys/class/net/team0/speed: Invalid argument
> {noformat}
> Here is a pointer to the code blocks that perform those tests: 
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob;f=src/slave/containerizer/mesos/isolators/network/port_mapping.cpp;h=f6f2bfe1d5d3c113036ad44a480f97bbb462a269;hb=HEAD#l1588
> In my opinion, to solve most of these use-cases, we could make those tests 
> non-blocking, i.e. replace errors by log/warnings.
> It could result in an inconsistency between the interface speed and the 
> bw_limit set, but I guess there are more benefits in making it usable in more 
> environments.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (MESOS-7094) Slave not displaying correctly in the Mesos Web UI

2017-02-09 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-7094:
--
Summary: Slave not displaying correctly in the Mesos Web UI  (was: Slave 
not displaying correctly in Mesos Web Ui)

> Slave not displaying correctly in the Mesos Web UI
> --
>
> Key: MESOS-7094
> URL: https://issues.apache.org/jira/browse/MESOS-7094
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.28.1
>Reporter: DuMont DevOps
>Priority: Minor
>  Labels: webui
> Attachments: mesos-slaves-chrome-console.png, mesos-webui.png
>
>
> We're currently experiencing issues with mesos web ui.
> We added recently 2 new nodes to the cluster, which are now shown in mesos 
> web ui.
> If we click on one of the new nodes, we get into the slave overview. Instead 
> of showing the slave stats, we see stats for the master node. (see attached 
> picture) 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (MESOS-7094) Slave not displaying correctly in the Mesos Web UI

2017-02-09 Thread Anand Mazumdar (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15859792#comment-15859792
 ] 

Anand Mazumdar commented on MESOS-7094:
---

[~haosd...@gmail.com] Can you take a look at this user issue when you get a 
chance?

> Slave not displaying correctly in the Mesos Web UI
> --
>
> Key: MESOS-7094
> URL: https://issues.apache.org/jira/browse/MESOS-7094
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.28.1
>Reporter: DuMont DevOps
>Priority: Minor
>  Labels: webui
> Attachments: mesos-slaves-chrome-console.png, mesos-webui.png
>
>
> We're currently experiencing issues with mesos web ui.
> We added recently 2 new nodes to the cluster, which are now shown in mesos 
> web ui.
> If we click on one of the new nodes, we get into the slave overview. Instead 
> of showing the slave stats, we see stats for the master node. (see attached 
> picture) 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (MESOS-7095) Basic make check from getting started link fails

2017-02-09 Thread Anand Mazumdar (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15859805#comment-15859805
 ] 

Anand Mazumdar commented on MESOS-7095:
---

[~AlecBr] From the stack trace, this looks likely due to the incompatibility 
between the system installed svn and apr headers on Sierra. Confirming to see 
if you performed the following steps from the Getting Started guide:

{noformat}
# There is an incompatiblity with the system installed svn and apr headers.
# We need the svn and apr headers from a brew installation of subversion.
# You may need to unlink the existing version of subversion installed via
# brew in order to configure correctly.
$ brew unlink subversion # (If already installed)
$ brew install subversion

# When configuring, the svn and apr headers from brew will be automatically
# detected, so no need to explicitly point to them. Also,
# `-Wno-deprecated-declarations` is needed to suppress warnings.
$ ../configure CXXFLAGS=-Wno-deprecated-declarations
{noformat}

> Basic make check from getting started link fails
> 
>
> Key: MESOS-7095
> URL: https://issues.apache.org/jira/browse/MESOS-7095
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Reporter: Alec Bruns
>
> {*** Aborted at 1486657215 (unix time) try "date -d @1486657215" if you are 
> using GNU date *** PC: @0x1080b7367 apr_pool_create_ex *** SIGSEGV 
> (@0x30) received by PID 25167 (TID 0x7fffbdd073c0) stack trace: ***} 
> \{@ 0x7fffb50c7bba _sigtramp 
> @\{ 0x72c0517 (unknown)\} 
> @0x107eaa13a svn_pool_create_ex 
> @0x107691d6e svn::diff() 
> @0x107691042 SVNTest_DiffPatch_Test::TestBody()
>  @0x1077026ba 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>() 
> @0x1076b3ad7 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
>  @0x1076b3985 testing::Test::Run() 
> @0x1076b54f8 testing::TestInfo::Run() 
> @0x1076b6867 testing::TestCase::Run() 
> @0x1076c65dc testing::internal::UnitTestImpl::RunAllTests() 
> @0x1077033da 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>() 
> @0x1076c6007 
> testing::internal::HandleExceptionsInMethodIfSupported<>() 
> @0x1076c5ed8 testing::UnitTest::Run() 
> @0x1074d55c1 RUN_ALL_TESTS() 
> @0x1074d5580 main 
> @ 0x7fffb4eba255 start 
> make[6]: *** [check-local] Segmentation fault: 11 
> make[5]: *** [check-am] Error 2 make[4]: *** [check-recursive] Error 1
>  make[3]: *** [check] Error 2 make[2]: *** [check-recursive] Error 1 
> make[1]: *** [check] Error 2 make: *** [check-recursive] Error 1
> make: *** [check-recursive] Error 1



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (MESOS-6913) AgentAPIStreamingTest.AttachInputToNestedContainerSession fails on Mac OS.

2017-02-09 Thread Anand Mazumdar (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15860339#comment-15860339
 ] 

Anand Mazumdar commented on MESOS-6913:
---

^^ This is pretty bizarre. It looks like the command executor in this case 
itself exited with a status 1 upon launch. This doesn't look related to the 
{{loop}} stack smashing issue since that comes into play later in the test 
lifecycle. Would keep on digging more.

> AgentAPIStreamingTest.AttachInputToNestedContainerSession fails on Mac OS.
> --
>
> Key: MESOS-6913
> URL: https://issues.apache.org/jira/browse/MESOS-6913
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
> Environment: Mac OS 10.11.6 with Apple clang-703.0.31
>Reporter: Alexander Rukletsov
>Assignee: Anand Mazumdar
>Priority: Critical
>  Labels: mesosphere
> Fix For: 1.2.0
>
>
> {noformat}
> [ RUN  ] 
> ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession/0
> make[3]: *** [check-local] Illegal instruction: 4
> make[2]: *** [check-am] Error 2
> make[1]: *** [check] Error 2
> make: *** [check-recursive] Error 1
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (MESOS-7100) Missing AGENT_REMOVED event in event stream

2017-02-09 Thread Anand Mazumdar (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15860442#comment-15860442
 ] 

Anand Mazumdar commented on MESOS-7100:
---

Thanks for reporting the issue.

{{AGENT_REMOVED}} is only sent when the agent is explicitly shutdown for 
[maintenance|http://mesos.apache.org/documentation/latest/maintenance] or when 
the agent is explicitly shutdown using the {{SIGUSR1}} signal. 

We want to introduce the following events as you alluded to in the future:
- {{AGENT_UNREACHABLE}}: Sent when the agent gets partitioned away from the 
master. In the example you mentioned above, the event would be sent by the 
master when the agent fails health checks. (Default: 75 seconds)
- {{AGENT_UPDATED}}: Sent when the partitioned agent re-registers with the 
master.


> Missing AGENT_REMOVED event in event stream
> ---
>
> Key: MESOS-7100
> URL: https://issues.apache.org/jira/browse/MESOS-7100
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Affects Versions: 1.1.0
>Reporter: Haralds Ulmanis
>Priority: Minor
>
> I'm playing with event stream via HTTP endpoints.
> So - i got all events - SUBSCRIBED, TASK_ADDED, TASK_UPDATED, AGENT_ADDED 
> except AGENT_REMOVED.
> What i do:
> Just stop agent or terminate server(if cloud).
> What i expect: 
> Once it disappear from agent list (in mesos UI) to get event AGENT_REMOVED.
> Not sure about internals, maybe that is not correct event and agents got 
> removed after some period, if they do not come up. But in general some event 
> to indicate that agent went offline and not available would be good.
> If AGENT_REMOVED and AGENT_ADDED is kinda one-time events .. maybe something 
> like: AGENT_CONNECTED/RECONNECTED, AGENT_LEAVE events would be great.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (MESOS-7100) Missing AGENT_REMOVED event in event stream

2017-02-09 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-7100:
--
Target Version/s: 1.3.0
  Labels: mesosphere  (was: )

> Missing AGENT_REMOVED event in event stream
> ---
>
> Key: MESOS-7100
> URL: https://issues.apache.org/jira/browse/MESOS-7100
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Affects Versions: 1.1.0
>Reporter: Haralds Ulmanis
>Priority: Minor
>  Labels: mesosphere
>
> I'm playing with event stream via HTTP endpoints.
> So - i got all events - SUBSCRIBED, TASK_ADDED, TASK_UPDATED, AGENT_ADDED 
> except AGENT_REMOVED.
> What i do:
> Just stop agent or terminate server(if cloud).
> What i expect: 
> Once it disappear from agent list (in mesos UI) to get event AGENT_REMOVED.
> Not sure about internals, maybe that is not correct event and agents got 
> removed after some period, if they do not come up. But in general some event 
> to indicate that agent went offline and not available would be good.
> If AGENT_REMOVED and AGENT_ADDED is kinda one-time events .. maybe something 
> like: AGENT_CONNECTED/RECONNECTED, AGENT_LEAVE events would be great.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (MESOS-7102) Crash when sending a SIGUSR1 signal to the agent.

2017-02-09 Thread Anand Mazumdar (JIRA)

Anand Mazumdar created MESOS-7102:
-

 Summary: Crash when sending a SIGUSR1 signal to the agent.
 Key: MESOS-7102
 URL: https://issues.apache.org/jira/browse/MESOS-7102
 Project: Mesos
  Issue Type: Bug
  Components: agent
Affects Versions: 1.2.0
 Environment: ubuntu 16.04
Reporter: Anand Mazumdar
Priority: Critical


Looks like sending a {{SIGUSR1}} to the agent crashes it. This is a regression 
and used to work fine in the 1.1 release.

Steps to reproduce:
- Start the agent.
- Send it a {{SIGUSR1}} signal.

The agent should crash with a stack trace similar to this:
{noformat}
I0209 16:19:46.210819 31977472 slave.cpp:851] Received SIGUSR1 signal from user 
gmann; unregistering and shutting down
I0209 16:19:46.210960 31977472 slave.cpp:803] Agent terminating
*** Aborted at 1486685986 (unix time) try "date -d @1486685986" if you are 
using GNU date ***
PC: @ 0x7fffbc4904fc _pthread_key_global_init
*** SIGSEGV (@0x38) received by PID 88894 (TID 0x7fffc50c83c0) stack trace: ***
@ 0x7fffbc488bba _sigtramp
@ 0x7fe8a5d03f38 (unknown)
@0x10b6d67d9 
_ZZ11synchronizeINSt3__115recursive_mutexEE12SynchronizedIT_EPS3_ENKUlPS1_E_clES6_
@0x10b6d67b8 
_ZZ11synchronizeINSt3__115recursive_mutexEE12SynchronizedIT_EPS3_ENUlPS1_E_8__invokeES6_
@0x10b6d6889 Synchronized<>::Synchronized()
@0x10b6d678d Synchronized<>::Synchronized()
@0x10b6a708a synchronize<>()
@0x10e2f148d process::ProcessManager::wait()
@0x10e2e9a78 process::wait()
@0x10b30614f process::wait()
@0x10c9619dc 
mesos::internal::slave::StatusUpdateManager::~StatusUpdateManager()
@0x10c961a55 
mesos::internal::slave::StatusUpdateManager::~StatusUpdateManager()
@0x10b1ab035 main
@ 0x7fffbc27b255 start
[1]88894 segmentation fault  bin/mesos-agent.sh —master=127.0.0.1:5050
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (MESOS-7102) Crash when sending a SIGUSR1 signal to the agent.

2017-02-09 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-7102:
--
Description: 
Looks like sending a {{SIGUSR1}} to the agent crashes it. This is a regression 
and used to work fine in the 1.1 release. Note that the agent does unregisters 
with the master and the crash happens after that.

Steps to reproduce:
- Start the agent.
- Send it a {{SIGUSR1}} signal.

The agent should crash with a stack trace similar to this:
{noformat}
I0209 16:19:46.210819 31977472 slave.cpp:851] Received SIGUSR1 signal from user 
gmann; unregistering and shutting down
I0209 16:19:46.210960 31977472 slave.cpp:803] Agent terminating
*** Aborted at 1486685986 (unix time) try "date -d @1486685986" if you are 
using GNU date ***
PC: @ 0x7fffbc4904fc _pthread_key_global_init
*** SIGSEGV (@0x38) received by PID 88894 (TID 0x7fffc50c83c0) stack trace: ***
@ 0x7fffbc488bba _sigtramp
@ 0x7fe8a5d03f38 (unknown)
@0x10b6d67d9 
_ZZ11synchronizeINSt3__115recursive_mutexEE12SynchronizedIT_EPS3_ENKUlPS1_E_clES6_
@0x10b6d67b8 
_ZZ11synchronizeINSt3__115recursive_mutexEE12SynchronizedIT_EPS3_ENUlPS1_E_8__invokeES6_
@0x10b6d6889 Synchronized<>::Synchronized()
@0x10b6d678d Synchronized<>::Synchronized()
@0x10b6a708a synchronize<>()
@0x10e2f148d process::ProcessManager::wait()
@0x10e2e9a78 process::wait()
@0x10b30614f process::wait()
@0x10c9619dc 
mesos::internal::slave::StatusUpdateManager::~StatusUpdateManager()
@0x10c961a55 
mesos::internal::slave::StatusUpdateManager::~StatusUpdateManager()
@0x10b1ab035 main
@ 0x7fffbc27b255 start
[1]88894 segmentation fault  bin/mesos-agent.sh —master=127.0.0.1:5050
{noformat}

  was:
Looks like sending a {{SIGUSR1}} to the agent crashes it. This is a regression 
and used to work fine in the 1.1 release.

Steps to reproduce:
- Start the agent.
- Send it a {{SIGUSR1}} signal.

The agent should crash with a stack trace similar to this:
{noformat}
I0209 16:19:46.210819 31977472 slave.cpp:851] Received SIGUSR1 signal from user 
gmann; unregistering and shutting down
I0209 16:19:46.210960 31977472 slave.cpp:803] Agent terminating
*** Aborted at 1486685986 (unix time) try "date -d @1486685986" if you are 
using GNU date ***
PC: @ 0x7fffbc4904fc _pthread_key_global_init
*** SIGSEGV (@0x38) received by PID 88894 (TID 0x7fffc50c83c0) stack trace: ***
@ 0x7fffbc488bba _sigtramp
@ 0x7fe8a5d03f38 (unknown)
@0x10b6d67d9 
_ZZ11synchronizeINSt3__115recursive_mutexEE12SynchronizedIT_EPS3_ENKUlPS1_E_clES6_
@0x10b6d67b8 
_ZZ11synchronizeINSt3__115recursive_mutexEE12SynchronizedIT_EPS3_ENUlPS1_E_8__invokeES6_
@0x10b6d6889 Synchronized<>::Synchronized()
@0x10b6d678d Synchronized<>::Synchronized()
@0x10b6a708a synchronize<>()
@0x10e2f148d process::ProcessManager::wait()
@0x10e2e9a78 process::wait()
@0x10b30614f process::wait()
@0x10c9619dc 
mesos::internal::slave::StatusUpdateManager::~StatusUpdateManager()
@0x10c961a55 
mesos::internal::slave::StatusUpdateManager::~StatusUpdateManager()
@0x10b1ab035 main
@ 0x7fffbc27b255 start
[1]88894 segmentation fault  bin/mesos-agent.sh —master=127.0.0.1:5050
{noformat}


> Crash when sending a SIGUSR1 signal to the agent.
> -
>
> Key: MESOS-7102
> URL: https://issues.apache.org/jira/browse/MESOS-7102
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.2.0
> Environment: ubuntu 16.04
>Reporter: Anand Mazumdar
>Priority: Critical
>  Labels: mesosphere
>
> Looks like sending a {{SIGUSR1}} to the agent crashes it. This is a 
> regression and used to work fine in the 1.1 release. Note that the agent does 
> unregisters with the master and the crash happens after that.
> Steps to reproduce:
> - Start the agent.
> - Send it a {{SIGUSR1}} signal.
> The agent should crash with a stack trace similar to this:
> {noformat}
> I0209 16:19:46.210819 31977472 slave.cpp:851] Received SIGUSR1 signal from 
> user gmann; unregistering and shutting down
> I0209 16:19:46.210960 31977472 slave.cpp:803] Agent terminating
> *** Aborted at 1486685986 (unix time) try "date -d @1486685986" if you are 
> using GNU date ***
> PC: @ 0x7fffbc4904fc _pthread_key_global_init
> *** SIGSEGV (@0x38) received by PID 88894 (TID 0x7fffc50c83c0) stack trace: 
> ***
> @ 0x7fffbc488bba _sigtramp
> @ 0x7fe8a5d03f38 (unknown)
> @0x10b6d67d9 
> _ZZ11synchronizeINSt3__115recursive_mutexEE12SynchronizedIT_EPS3_ENKUlPS1_E_clES6_
> @0x10b6d67b8 
> _ZZ11synchronizeI

[jira] [Commented] (MESOS-7102) Crash when sending a SIGUSR1 signal to the agent.

2017-02-09 Thread Anand Mazumdar (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15860487#comment-15860487
 ] 

Anand Mazumdar commented on MESOS-7102:
---

The root cause might be that we don't delete the {{StatusUpdateManager}} actor 
before invoking {{finalize}} on the agent. 
https://github.com/apache/mesos/blob/4844353847657e9449de433172905a8659033d0e/src/slave/main.cpp#L446-L465

Note that this would also happen in the case when the master sends the agent an 
explicit shutdown message.

cc: [~kaysoky]

> Crash when sending a SIGUSR1 signal to the agent.
> -
>
> Key: MESOS-7102
> URL: https://issues.apache.org/jira/browse/MESOS-7102
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.2.0
> Environment: ubuntu 16.04
>Reporter: Anand Mazumdar
>Priority: Critical
>  Labels: mesosphere
>
> Looks like sending a {{SIGUSR1}} to the agent crashes it. This is a 
> regression and used to work fine in the 1.1 release. Note that the agent does 
> unregisters with the master and the crash happens after that.
> Steps to reproduce:
> - Start the agent.
> - Send it a {{SIGUSR1}} signal.
> The agent should crash with a stack trace similar to this:
> {noformat}
> I0209 16:19:46.210819 31977472 slave.cpp:851] Received SIGUSR1 signal from 
> user gmann; unregistering and shutting down
> I0209 16:19:46.210960 31977472 slave.cpp:803] Agent terminating
> *** Aborted at 1486685986 (unix time) try "date -d @1486685986" if you are 
> using GNU date ***
> PC: @ 0x7fffbc4904fc _pthread_key_global_init
> *** SIGSEGV (@0x38) received by PID 88894 (TID 0x7fffc50c83c0) stack trace: 
> ***
> @ 0x7fffbc488bba _sigtramp
> @ 0x7fe8a5d03f38 (unknown)
> @0x10b6d67d9 
> _ZZ11synchronizeINSt3__115recursive_mutexEE12SynchronizedIT_EPS3_ENKUlPS1_E_clES6_
> @0x10b6d67b8 
> _ZZ11synchronizeINSt3__115recursive_mutexEE12SynchronizedIT_EPS3_ENUlPS1_E_8__invokeES6_
> @0x10b6d6889 Synchronized<>::Synchronized()
> @0x10b6d678d Synchronized<>::Synchronized()
> @0x10b6a708a synchronize<>()
> @0x10e2f148d process::ProcessManager::wait()
> @0x10e2e9a78 process::wait()
> @0x10b30614f process::wait()
> @0x10c9619dc 
> mesos::internal::slave::StatusUpdateManager::~StatusUpdateManager()
> @0x10c961a55 
> mesos::internal::slave::StatusUpdateManager::~StatusUpdateManager()
> @0x10b1ab035 main
> @ 0x7fffbc27b255 start
> [1]88894 segmentation fault  bin/mesos-agent.sh —master=127.0.0.1:5050
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Assigned] (MESOS-7102) Crash when sending a SIGUSR1 signal to the agent.

2017-02-09 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar reassigned MESOS-7102:
-

Assignee: Anand Mazumdar

> Crash when sending a SIGUSR1 signal to the agent.
> -
>
> Key: MESOS-7102
> URL: https://issues.apache.org/jira/browse/MESOS-7102
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.2.0
> Environment: ubuntu 16.04
>Reporter: Anand Mazumdar
>Assignee: Anand Mazumdar
>Priority: Critical
>  Labels: mesosphere
>
> Looks like sending a {{SIGUSR1}} to the agent crashes it. This is a 
> regression and used to work fine in the 1.1 release. Note that the agent does 
> unregisters with the master and the crash happens after that.
> Steps to reproduce:
> - Start the agent.
> - Send it a {{SIGUSR1}} signal.
> The agent should crash with a stack trace similar to this:
> {noformat}
> I0209 16:19:46.210819 31977472 slave.cpp:851] Received SIGUSR1 signal from 
> user gmann; unregistering and shutting down
> I0209 16:19:46.210960 31977472 slave.cpp:803] Agent terminating
> *** Aborted at 1486685986 (unix time) try "date -d @1486685986" if you are 
> using GNU date ***
> PC: @ 0x7fffbc4904fc _pthread_key_global_init
> *** SIGSEGV (@0x38) received by PID 88894 (TID 0x7fffc50c83c0) stack trace: 
> ***
> @ 0x7fffbc488bba _sigtramp
> @ 0x7fe8a5d03f38 (unknown)
> @0x10b6d67d9 
> _ZZ11synchronizeINSt3__115recursive_mutexEE12SynchronizedIT_EPS3_ENKUlPS1_E_clES6_
> @0x10b6d67b8 
> _ZZ11synchronizeINSt3__115recursive_mutexEE12SynchronizedIT_EPS3_ENUlPS1_E_8__invokeES6_
> @0x10b6d6889 Synchronized<>::Synchronized()
> @0x10b6d678d Synchronized<>::Synchronized()
> @0x10b6a708a synchronize<>()
> @0x10e2f148d process::ProcessManager::wait()
> @0x10e2e9a78 process::wait()
> @0x10b30614f process::wait()
> @0x10c9619dc 
> mesos::internal::slave::StatusUpdateManager::~StatusUpdateManager()
> @0x10c961a55 
> mesos::internal::slave::StatusUpdateManager::~StatusUpdateManager()
> @0x10b1ab035 main
> @ 0x7fffbc27b255 start
> [1]88894 segmentation fault  bin/mesos-agent.sh —master=127.0.0.1:5050
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (MESOS-2369) Segfault when mesos-slave tries to clean up docker containers on startup

2017-02-10 Thread Anand Mazumdar (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15861444#comment-15861444
 ] 

Anand Mazumdar commented on MESOS-2369:
---

Thanks [~bmerry] for the reproduction steps. I am assigning this to myself to 
carry out further root cause analysis.

> Segfault when mesos-slave tries to clean up docker containers on startup
> 
>
> Key: MESOS-2369
> URL: https://issues.apache.org/jira/browse/MESOS-2369
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 0.21.1, 1.2.0, 1.3.0
> Environment: Debian Jessie, mesos package 0.21.1-1.2.debian77 
> docker 1.3.2 build 39fa2fa
>Reporter: Pas
>
> I did a gdb backtrace, it seems like a stack overflow due to a bit too much 
> recursion.
> The interesting aspect is that after running mesos-slave with strace -f -b 
> execve it successfully proceeded with the docker cleanup. However, there were 
> a few strace sessions (on other slaves) where I was able to observe the 
> SIGSEGV, and it was around (or a bit before) the "docker ps -a" call, because 
> docker got a broken pipe shortly, then got killed by the propagating SIGSEGV 
> signal.
> {code}
> 
> #59296 0x76e7cd98 in process::Future 
> process::Future long>::then(std::tr1::function 
> (unsigned long const&)> const&) const () from 
> /usr/local/lib/libmesos-0.21.1.so
> #59297 0x76e4f5d3 in process::io::internal::_read(int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59298 0x76e5012c in process::io::internal::__read(unsigned long, 
> int, std::tr1::shared_ptr const&, boost::shared_array 
> const&, unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59299 0x76e53000 in 
> std::tr1::_Function_handler (unsigned long 
> const&), std::tr1::_Bind 
> (*(std::tr1::_Placeholder<1>, int, std::tr1::shared_ptr, 
> boost::shared_array, unsigned long))(unsigned long, int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long)> >::_M_invoke(std::tr1::_Any_data const&, unsigned long 
> const&) () from /usr/local/lib/libmesos-0.21.1.so
> #59300 0x76e7d23b in void process::internal::thenf std::string>(std::tr1::shared_ptr > const&, 
> std::tr1::function (unsigned long const&)> 
> const&, process::Future const&) ()
>from /usr/local/lib/libmesos-0.21.1.so
> #59301 0x7689ee60 in process::Future long>::onAny(std::tr1::function const&)> 
> const&) const () from /usr/local/lib/libmesos-0.21.1.so
> #59302 0x76e7cd98 in process::Future 
> process::Future long>::then(std::tr1::function 
> (unsigned long const&)> const&) const () from 
> /usr/local/lib/libmesos-0.21.1.so
> #59303 0x76e4f5d3 in process::io::internal::_read(int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59304 0x76e5012c in process::io::internal::__read(unsigned long, 
> int, std::tr1::shared_ptr const&, boost::shared_array 
> const&, unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59305 0x76e53000 in 
> std::tr1::_Function_handler (unsigned long 
> const&), std::tr1::_Bind 
> (*(std::tr1::_Placeholder<1>, int, std::tr1::shared_ptr, 
> boost::shared_array, unsigned long))(unsigned long, int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long)> >::_M_invoke(std::tr1::_Any_data const&, unsigned long 
> const&) () from /usr/local/lib/libmesos-0.21.1.so
> #59306 0x76e7d23b in void process::internal::thenf std::string>(std::tr1::shared_ptr > const&, 
> std::tr1::function (unsigned long const&)> 
> const&, process::Future const&) ()
>from /usr/local/lib/libmesos-0.21.1.so
> #59307 0x7689ee60 in process::Future long>::onAny(std::tr1::function const&)> 
> const&) const () from /usr/local/lib/libmesos-0.21.1.so
> #59308 0x76e7cd98 in process::Future 
> process::Future long>::then(std::tr1::function 
> (unsigned long const&)> const&) const () from 
> /usr/local/lib/libmesos-0.21.1.so
> #59309 0x76e4f5d3 in process::io::internal::_read(int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59310 0x76e5012c in process::io::internal::__read(unsigned long, 
> int, std::tr1::shared_ptr const&, boost::shared_array 
> const&, unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59311 0x76e53000 in 
> std::tr1::_Function_handler (unsigned long 
> const&), std::tr1::_Bind 
> (*(std::tr1::_Placeholder<1>, int, std::tr1::shared_ptr, 
> boost::shared_array, unsigned long))(unsigned long, int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long

[jira] [Assigned] (MESOS-2369) Segfault when mesos-slave tries to clean up docker containers on startup

2017-02-10 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar reassigned MESOS-2369:
-

Assignee: Anand Mazumdar

> Segfault when mesos-slave tries to clean up docker containers on startup
> 
>
> Key: MESOS-2369
> URL: https://issues.apache.org/jira/browse/MESOS-2369
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 0.21.1, 1.2.0, 1.3.0
> Environment: Debian Jessie, mesos package 0.21.1-1.2.debian77 
> docker 1.3.2 build 39fa2fa
>Reporter: Pas
>Assignee: Anand Mazumdar
>
> I did a gdb backtrace, it seems like a stack overflow due to a bit too much 
> recursion.
> The interesting aspect is that after running mesos-slave with strace -f -b 
> execve it successfully proceeded with the docker cleanup. However, there were 
> a few strace sessions (on other slaves) where I was able to observe the 
> SIGSEGV, and it was around (or a bit before) the "docker ps -a" call, because 
> docker got a broken pipe shortly, then got killed by the propagating SIGSEGV 
> signal.
> {code}
> 
> #59296 0x76e7cd98 in process::Future 
> process::Future long>::then(std::tr1::function 
> (unsigned long const&)> const&) const () from 
> /usr/local/lib/libmesos-0.21.1.so
> #59297 0x76e4f5d3 in process::io::internal::_read(int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59298 0x76e5012c in process::io::internal::__read(unsigned long, 
> int, std::tr1::shared_ptr const&, boost::shared_array 
> const&, unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59299 0x76e53000 in 
> std::tr1::_Function_handler (unsigned long 
> const&), std::tr1::_Bind 
> (*(std::tr1::_Placeholder<1>, int, std::tr1::shared_ptr, 
> boost::shared_array, unsigned long))(unsigned long, int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long)> >::_M_invoke(std::tr1::_Any_data const&, unsigned long 
> const&) () from /usr/local/lib/libmesos-0.21.1.so
> #59300 0x76e7d23b in void process::internal::thenf std::string>(std::tr1::shared_ptr > const&, 
> std::tr1::function (unsigned long const&)> 
> const&, process::Future const&) ()
>from /usr/local/lib/libmesos-0.21.1.so
> #59301 0x7689ee60 in process::Future long>::onAny(std::tr1::function const&)> 
> const&) const () from /usr/local/lib/libmesos-0.21.1.so
> #59302 0x76e7cd98 in process::Future 
> process::Future long>::then(std::tr1::function 
> (unsigned long const&)> const&) const () from 
> /usr/local/lib/libmesos-0.21.1.so
> #59303 0x76e4f5d3 in process::io::internal::_read(int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59304 0x76e5012c in process::io::internal::__read(unsigned long, 
> int, std::tr1::shared_ptr const&, boost::shared_array 
> const&, unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59305 0x76e53000 in 
> std::tr1::_Function_handler (unsigned long 
> const&), std::tr1::_Bind 
> (*(std::tr1::_Placeholder<1>, int, std::tr1::shared_ptr, 
> boost::shared_array, unsigned long))(unsigned long, int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long)> >::_M_invoke(std::tr1::_Any_data const&, unsigned long 
> const&) () from /usr/local/lib/libmesos-0.21.1.so
> #59306 0x76e7d23b in void process::internal::thenf std::string>(std::tr1::shared_ptr > const&, 
> std::tr1::function (unsigned long const&)> 
> const&, process::Future const&) ()
>from /usr/local/lib/libmesos-0.21.1.so
> #59307 0x7689ee60 in process::Future long>::onAny(std::tr1::function const&)> 
> const&) const () from /usr/local/lib/libmesos-0.21.1.so
> #59308 0x76e7cd98 in process::Future 
> process::Future long>::then(std::tr1::function 
> (unsigned long const&)> const&) const () from 
> /usr/local/lib/libmesos-0.21.1.so
> #59309 0x76e4f5d3 in process::io::internal::_read(int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59310 0x76e5012c in process::io::internal::__read(unsigned long, 
> int, std::tr1::shared_ptr const&, boost::shared_array 
> const&, unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59311 0x76e53000 in 
> std::tr1::_Function_handler (unsigned long 
> const&), std::tr1::_Bind 
> (*(std::tr1::_Placeholder<1>, int, std::tr1::shared_ptr, 
> boost::shared_array, unsigned long))(unsigned long, int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long)> >::_M_invoke(std::tr1::_Any_data const&, unsigned long 
> const&) () from /usr/local/lib/libmesos-0

[jira] [Updated] (MESOS-7102) Crash when sending a SIGUSR1 signal to the agent.

2017-02-10 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-7102:
--
Shepherd: Joseph Wu
  Sprint: Mesosphere Sprint 51
Story Points: 2

> Crash when sending a SIGUSR1 signal to the agent.
> -
>
> Key: MESOS-7102
> URL: https://issues.apache.org/jira/browse/MESOS-7102
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.2.0
> Environment: ubuntu 16.04
>Reporter: Anand Mazumdar
>Assignee: Anand Mazumdar
>Priority: Critical
>  Labels: mesosphere
> Fix For: 1.3.0
>
>
> Looks like sending a {{SIGUSR1}} to the agent crashes it. This is a 
> regression and used to work fine in the 1.1 release. Note that the agent does 
> unregisters with the master and the crash happens after that.
> Steps to reproduce:
> - Start the agent.
> - Send it a {{SIGUSR1}} signal.
> The agent should crash with a stack trace similar to this:
> {noformat}
> I0209 16:19:46.210819 31977472 slave.cpp:851] Received SIGUSR1 signal from 
> user gmann; unregistering and shutting down
> I0209 16:19:46.210960 31977472 slave.cpp:803] Agent terminating
> *** Aborted at 1486685986 (unix time) try "date -d @1486685986" if you are 
> using GNU date ***
> PC: @ 0x7fffbc4904fc _pthread_key_global_init
> *** SIGSEGV (@0x38) received by PID 88894 (TID 0x7fffc50c83c0) stack trace: 
> ***
> @ 0x7fffbc488bba _sigtramp
> @ 0x7fe8a5d03f38 (unknown)
> @0x10b6d67d9 
> _ZZ11synchronizeINSt3__115recursive_mutexEE12SynchronizedIT_EPS3_ENKUlPS1_E_clES6_
> @0x10b6d67b8 
> _ZZ11synchronizeINSt3__115recursive_mutexEE12SynchronizedIT_EPS3_ENUlPS1_E_8__invokeES6_
> @0x10b6d6889 Synchronized<>::Synchronized()
> @0x10b6d678d Synchronized<>::Synchronized()
> @0x10b6a708a synchronize<>()
> @0x10e2f148d process::ProcessManager::wait()
> @0x10e2e9a78 process::wait()
> @0x10b30614f process::wait()
> @0x10c9619dc 
> mesos::internal::slave::StatusUpdateManager::~StatusUpdateManager()
> @0x10c961a55 
> mesos::internal::slave::StatusUpdateManager::~StatusUpdateManager()
> @0x10b1ab035 main
> @ 0x7fffbc27b255 start
> [1]88894 segmentation fault  bin/mesos-agent.sh —master=127.0.0.1:5050
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (MESOS-7102) Crash when sending a SIGUSR1 signal to the agent.

2017-02-10 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-7102:
--
Target Version/s: 1.2.1

> Crash when sending a SIGUSR1 signal to the agent.
> -
>
> Key: MESOS-7102
> URL: https://issues.apache.org/jira/browse/MESOS-7102
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.2.0
> Environment: ubuntu 16.04
>Reporter: Anand Mazumdar
>Assignee: Anand Mazumdar
>Priority: Critical
>  Labels: mesosphere
> Fix For: 1.3.0
>
>
> Looks like sending a {{SIGUSR1}} to the agent crashes it. This is a 
> regression and used to work fine in the 1.1 release. Note that the agent does 
> unregisters with the master and the crash happens after that.
> Steps to reproduce:
> - Start the agent.
> - Send it a {{SIGUSR1}} signal.
> The agent should crash with a stack trace similar to this:
> {noformat}
> I0209 16:19:46.210819 31977472 slave.cpp:851] Received SIGUSR1 signal from 
> user gmann; unregistering and shutting down
> I0209 16:19:46.210960 31977472 slave.cpp:803] Agent terminating
> *** Aborted at 1486685986 (unix time) try "date -d @1486685986" if you are 
> using GNU date ***
> PC: @ 0x7fffbc4904fc _pthread_key_global_init
> *** SIGSEGV (@0x38) received by PID 88894 (TID 0x7fffc50c83c0) stack trace: 
> ***
> @ 0x7fffbc488bba _sigtramp
> @ 0x7fe8a5d03f38 (unknown)
> @0x10b6d67d9 
> _ZZ11synchronizeINSt3__115recursive_mutexEE12SynchronizedIT_EPS3_ENKUlPS1_E_clES6_
> @0x10b6d67b8 
> _ZZ11synchronizeINSt3__115recursive_mutexEE12SynchronizedIT_EPS3_ENUlPS1_E_8__invokeES6_
> @0x10b6d6889 Synchronized<>::Synchronized()
> @0x10b6d678d Synchronized<>::Synchronized()
> @0x10b6a708a synchronize<>()
> @0x10e2f148d process::ProcessManager::wait()
> @0x10e2e9a78 process::wait()
> @0x10b30614f process::wait()
> @0x10c9619dc 
> mesos::internal::slave::StatusUpdateManager::~StatusUpdateManager()
> @0x10c961a55 
> mesos::internal::slave::StatusUpdateManager::~StatusUpdateManager()
> @0x10b1ab035 main
> @ 0x7fffbc27b255 start
> [1]88894 segmentation fault  bin/mesos-agent.sh —master=127.0.0.1:5050
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (MESOS-2369) Segfault when mesos-slave tries to clean up docker containers on startup

2017-02-10 Thread Anand Mazumdar (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15862082#comment-15862082
 ] 

Anand Mazumdar commented on MESOS-2369:
---

[~bmerry] I gave it a try on a couple of Ubuntu 16.04 vms but couldn't 
reproduce with the steps you mentioned. Would it be possible for you to give us 
a stack trace helping us to debug the issue further or can you double check 
again if you missed anything in the steps to reproduce? 
This was with {{ulimit -s 4096}}.

[~bbannier] had managed to reproduce the issue on {{HEAD}} today but the stack 
trace turned out to be the one from MESOS-7102 and that has been already fixed 
today. 


> Segfault when mesos-slave tries to clean up docker containers on startup
> 
>
> Key: MESOS-2369
> URL: https://issues.apache.org/jira/browse/MESOS-2369
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 0.21.1, 1.2.0, 1.3.0
> Environment: Debian Jessie, mesos package 0.21.1-1.2.debian77 
> docker 1.3.2 build 39fa2fa
>Reporter: Pas
>Assignee: Anand Mazumdar
>
> I did a gdb backtrace, it seems like a stack overflow due to a bit too much 
> recursion.
> The interesting aspect is that after running mesos-slave with strace -f -b 
> execve it successfully proceeded with the docker cleanup. However, there were 
> a few strace sessions (on other slaves) where I was able to observe the 
> SIGSEGV, and it was around (or a bit before) the "docker ps -a" call, because 
> docker got a broken pipe shortly, then got killed by the propagating SIGSEGV 
> signal.
> {code}
> 
> #59296 0x76e7cd98 in process::Future 
> process::Future long>::then(std::tr1::function 
> (unsigned long const&)> const&) const () from 
> /usr/local/lib/libmesos-0.21.1.so
> #59297 0x76e4f5d3 in process::io::internal::_read(int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59298 0x76e5012c in process::io::internal::__read(unsigned long, 
> int, std::tr1::shared_ptr const&, boost::shared_array 
> const&, unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59299 0x76e53000 in 
> std::tr1::_Function_handler (unsigned long 
> const&), std::tr1::_Bind 
> (*(std::tr1::_Placeholder<1>, int, std::tr1::shared_ptr, 
> boost::shared_array, unsigned long))(unsigned long, int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long)> >::_M_invoke(std::tr1::_Any_data const&, unsigned long 
> const&) () from /usr/local/lib/libmesos-0.21.1.so
> #59300 0x76e7d23b in void process::internal::thenf std::string>(std::tr1::shared_ptr > const&, 
> std::tr1::function (unsigned long const&)> 
> const&, process::Future const&) ()
>from /usr/local/lib/libmesos-0.21.1.so
> #59301 0x7689ee60 in process::Future long>::onAny(std::tr1::function const&)> 
> const&) const () from /usr/local/lib/libmesos-0.21.1.so
> #59302 0x76e7cd98 in process::Future 
> process::Future long>::then(std::tr1::function 
> (unsigned long const&)> const&) const () from 
> /usr/local/lib/libmesos-0.21.1.so
> #59303 0x76e4f5d3 in process::io::internal::_read(int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59304 0x76e5012c in process::io::internal::__read(unsigned long, 
> int, std::tr1::shared_ptr const&, boost::shared_array 
> const&, unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59305 0x76e53000 in 
> std::tr1::_Function_handler (unsigned long 
> const&), std::tr1::_Bind 
> (*(std::tr1::_Placeholder<1>, int, std::tr1::shared_ptr, 
> boost::shared_array, unsigned long))(unsigned long, int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long)> >::_M_invoke(std::tr1::_Any_data const&, unsigned long 
> const&) () from /usr/local/lib/libmesos-0.21.1.so
> #59306 0x76e7d23b in void process::internal::thenf std::string>(std::tr1::shared_ptr > const&, 
> std::tr1::function (unsigned long const&)> 
> const&, process::Future const&) ()
>from /usr/local/lib/libmesos-0.21.1.so
> #59307 0x7689ee60 in process::Future long>::onAny(std::tr1::function const&)> 
> const&) const () from /usr/local/lib/libmesos-0.21.1.so
> #59308 0x76e7cd98 in process::Future 
> process::Future long>::then(std::tr1::function 
> (unsigned long const&)> const&) const () from 
> /usr/local/lib/libmesos-0.21.1.so
> #59309 0x76e4f5d3 in process::io::internal::_read(int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59310 0x76e5012c in process::io::internal::__read(unsigned long, 
> int, std::tr

[jira] [Updated] (MESOS-7057) Consider using the relink functionality of libprocess in the executor driver.

2017-02-10 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-7057:
--
Target Version/s: 1.1.2, 1.3.0, 1.2.1  (was: 1.1.2, 1.3.0)

> Consider using the relink functionality of libprocess in the executor driver.
> -
>
> Key: MESOS-7057
> URL: https://issues.apache.org/jira/browse/MESOS-7057
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Anand Mazumdar
>Assignee: Anand Mazumdar
>  Labels: mesosphere
>
> As outlined in the root cause analysis for MESOS-5332, it is possible for a 
> iptables firewall to terminate an idle connection after a timeout. (the 
> default is 5 days). Once this happens, the executor driver is not notified of 
> the disconnection. It keeps on thinking that it is still connected with the 
> agent.
> When the agent process is restarted, the executor still tries to re-use the 
> old broken connection to send the re-register message to the agent. This is 
> when it eventually realizes that the connection is broken (due to the nature 
> of TCP) and calls the {{exited}} callback and commits suicide in 15 minutes 
> upon the recovery timeout.
> To offset this, an executor should always {{relink}} when it receives a 
> reconnect request from the agent.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (MESOS-7057) Consider using the relink functionality of libprocess in the executor driver.

2017-02-11 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-7057:
--
  Sprint: Mesosphere Sprint 51
Story Points: 2

> Consider using the relink functionality of libprocess in the executor driver.
> -
>
> Key: MESOS-7057
> URL: https://issues.apache.org/jira/browse/MESOS-7057
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Anand Mazumdar
>Assignee: Anand Mazumdar
>  Labels: mesosphere
> Fix For: 1.3.0
>
>
> As outlined in the root cause analysis for MESOS-5332, it is possible for a 
> iptables firewall to terminate an idle connection after a timeout. (the 
> default is 5 days). Once this happens, the executor driver is not notified of 
> the disconnection. It keeps on thinking that it is still connected with the 
> agent.
> When the agent process is restarted, the executor still tries to re-use the 
> old broken connection to send the re-register message to the agent. This is 
> when it eventually realizes that the connection is broken (due to the nature 
> of TCP) and calls the {{exited}} callback and commits suicide in 15 minutes 
> upon the recovery timeout.
> To offset this, an executor should always {{relink}} when it receives a 
> reconnect request from the agent.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (MESOS-7119) Mesos master crash while accepting inverse offer.

2017-02-12 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-7119:
--
Attachment: crash-log-master.gz

> Mesos master crash while accepting inverse offer.
> -
>
> Key: MESOS-7119
> URL: https://issues.apache.org/jira/browse/MESOS-7119
> Project: Mesos
>  Issue Type: Bug
>Reporter: Anand Mazumdar
>Priority: Critical
> Attachments: crash-log-master.gz
>
>
> We noticed a Mesos master invariant check failing leading to a crash while 
> accepting an inverse offer.
> {noformat}
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.564393 27362 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0002
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.564457 27362 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0003
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.564517 27362 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0003
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.566793 27367 master.cpp:6664] Sending 1 
> offers to framework 2d45d0b7-0d58-43e4-9662-d876a100a055-0009 (hello-
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.567001 27367 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0001
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.567091 27367 master.cpp:6754] Sending 1 
> inverse offers to framework 2d45d0b7-0d58-43e4-9662-d876a100a055-0009
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.567168 27367 master.cpp:6754] Sending 1 
> inverse offers to framework 2d45d0b7-0d58-43e4-9662-d876a100a055-0018
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.567234 27367 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0061
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.567322 27367 master.cpp:6754] Sending 1 
> inverse offers to framework 2d45d0b7-0d58-43e4-9662-d876a100a055-0012
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.567405 27367 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0003
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.567876 27363 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0061
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.567975 27363 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0062
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.568056 27363 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0003
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.584126 27369 http.cpp:410] HTTP POST for 
> /master/api/v1/scheduler from 10.10.0.68:41428
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: W0211 17:00:41.584228 27369 master.cpp:4601] Ignoring 
> accept of inverse offer 01021b50-55f0-420e-8744-1ba1eceb3f55-O135611 s
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: F0211 17:00:41.584259 27369 master.cpp:4605] 
> CHECK_SOME(slaveId): is NONE
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: *** Check failure stack trace: ***
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: @ 0x7f9af0da91ad  google::LogMessage::Fail()
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: @ 0x7f9af0daafdd  google::LogMessage::SendToLog()
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: @ 0x7f9af0da8d9c  google::LogMessage::Flush()
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: @ 0x7f9af0dab8d9  
> google::LogMessageFatal::~LogMessageFatal()
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: @ 0x7f9af005d4a9  _CheckFatal::~_CheckFatal()
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: @ 0x7f9af0285235  
> mesos::internal::ma

[jira] [Created] (MESOS-7119) Mesos master crash while accepting inverse offer.

2017-02-12 Thread Anand Mazumdar (JIRA)

Anand Mazumdar created MESOS-7119:
-

 Summary: Mesos master crash while accepting inverse offer.
 Key: MESOS-7119
 URL: https://issues.apache.org/jira/browse/MESOS-7119
 Project: Mesos
  Issue Type: Bug
Reporter: Anand Mazumdar
Priority: Critical
 Attachments: crash-log-master.gz

We noticed a Mesos master invariant check failing leading to a crash while 
accepting an inverse offer.

{noformat}
Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal mesos-master[27357]: 
I0211 17:00:41.564393 27362 master.cpp:6754] Sending 1 inverse offers to 
framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0002
Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal mesos-master[27357]: 
I0211 17:00:41.564457 27362 master.cpp:6754] Sending 1 inverse offers to 
framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0003
Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal mesos-master[27357]: 
I0211 17:00:41.564517 27362 master.cpp:6754] Sending 1 inverse offers to 
framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0003
Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal mesos-master[27357]: 
I0211 17:00:41.566793 27367 master.cpp:6664] Sending 1 offers to framework 
2d45d0b7-0d58-43e4-9662-d876a100a055-0009 (hello-
Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal mesos-master[27357]: 
I0211 17:00:41.567001 27367 master.cpp:6754] Sending 1 inverse offers to 
framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0001
Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal mesos-master[27357]: 
I0211 17:00:41.567091 27367 master.cpp:6754] Sending 1 inverse offers to 
framework 2d45d0b7-0d58-43e4-9662-d876a100a055-0009
Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal mesos-master[27357]: 
I0211 17:00:41.567168 27367 master.cpp:6754] Sending 1 inverse offers to 
framework 2d45d0b7-0d58-43e4-9662-d876a100a055-0018
Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal mesos-master[27357]: 
I0211 17:00:41.567234 27367 master.cpp:6754] Sending 1 inverse offers to 
framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0061
Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal mesos-master[27357]: 
I0211 17:00:41.567322 27367 master.cpp:6754] Sending 1 inverse offers to 
framework 2d45d0b7-0d58-43e4-9662-d876a100a055-0012
Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal mesos-master[27357]: 
I0211 17:00:41.567405 27367 master.cpp:6754] Sending 1 inverse offers to 
framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0003
Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal mesos-master[27357]: 
I0211 17:00:41.567876 27363 master.cpp:6754] Sending 1 inverse offers to 
framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0061
Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal mesos-master[27357]: 
I0211 17:00:41.567975 27363 master.cpp:6754] Sending 1 inverse offers to 
framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0062
Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal mesos-master[27357]: 
I0211 17:00:41.568056 27363 master.cpp:6754] Sending 1 inverse offers to 
framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0003
Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal mesos-master[27357]: 
I0211 17:00:41.584126 27369 http.cpp:410] HTTP POST for 
/master/api/v1/scheduler from 10.10.0.68:41428
Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal mesos-master[27357]: 
W0211 17:00:41.584228 27369 master.cpp:4601] Ignoring accept of inverse offer 
01021b50-55f0-420e-8744-1ba1eceb3f55-O135611 s
Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal mesos-master[27357]: 
F0211 17:00:41.584259 27369 master.cpp:4605] CHECK_SOME(slaveId): is NONE
Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal mesos-master[27357]: 
*** Check failure stack trace: ***
Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal mesos-master[27357]: 
@ 0x7f9af0da91ad  google::LogMessage::Fail()
Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal mesos-master[27357]: 
@ 0x7f9af0daafdd  google::LogMessage::SendToLog()
Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal mesos-master[27357]: 
@ 0x7f9af0da8d9c  google::LogMessage::Flush()
Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal mesos-master[27357]: 
@ 0x7f9af0dab8d9  google::LogMessageFatal::~LogMessageFatal()
Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal mesos-master[27357]: 
@ 0x7f9af005d4a9  _CheckFatal::~_CheckFatal()
Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal mesos-master[27357]: 
@ 0x7f9af0285235  mesos::internal::master::Master::acceptInverseOffers()
Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal mesos-master[27357]: 
@ 0x7f9af01f5bc9  mesos::internal::master::Master::Http::scheduler()
Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal mesos-master[27357]: 
@ 0x7f9af024aa77  
_ZNSt17_Function_handlerIFN7process6FutureINS0_4h

[jira] [Updated] (MESOS-7119) Mesos master crash while accepting inverse offer.

2017-02-12 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-7119:
--
Labels: maintenance mesosphere  (was: )

> Mesos master crash while accepting inverse offer.
> -
>
> Key: MESOS-7119
> URL: https://issues.apache.org/jira/browse/MESOS-7119
> Project: Mesos
>  Issue Type: Bug
>Reporter: Anand Mazumdar
>Priority: Critical
>  Labels: maintenance, mesosphere
> Attachments: crash-log-master.gz
>
>
> We noticed a Mesos master invariant check failing leading to a crash while 
> accepting an inverse offer.
> {noformat}
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.564393 27362 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0002
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.564457 27362 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0003
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.564517 27362 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0003
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.566793 27367 master.cpp:6664] Sending 1 
> offers to framework 2d45d0b7-0d58-43e4-9662-d876a100a055-0009 (hello-
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.567001 27367 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0001
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.567091 27367 master.cpp:6754] Sending 1 
> inverse offers to framework 2d45d0b7-0d58-43e4-9662-d876a100a055-0009
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.567168 27367 master.cpp:6754] Sending 1 
> inverse offers to framework 2d45d0b7-0d58-43e4-9662-d876a100a055-0018
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.567234 27367 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0061
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.567322 27367 master.cpp:6754] Sending 1 
> inverse offers to framework 2d45d0b7-0d58-43e4-9662-d876a100a055-0012
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.567405 27367 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0003
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.567876 27363 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0061
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.567975 27363 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0062
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.568056 27363 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0003
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.584126 27369 http.cpp:410] HTTP POST for 
> /master/api/v1/scheduler from 10.10.0.68:41428
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: W0211 17:00:41.584228 27369 master.cpp:4601] Ignoring 
> accept of inverse offer 01021b50-55f0-420e-8744-1ba1eceb3f55-O135611 s
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: F0211 17:00:41.584259 27369 master.cpp:4605] 
> CHECK_SOME(slaveId): is NONE
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: *** Check failure stack trace: ***
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: @ 0x7f9af0da91ad  google::LogMessage::Fail()
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: @ 0x7f9af0daafdd  google::LogMessage::SendToLog()
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: @ 0x7f9af0da8d9c  google::LogMessage::Flush()
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: @ 0x7f9af0dab8d9  
> google::LogMessageFatal::~LogMessageFatal()
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: @ 0x7f9af005d4a9  _CheckFatal::~_CheckFatal()
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-mast

[jira] [Updated] (MESOS-7119) Mesos master crash while accepting inverse offer.

2017-02-12 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-7119:
--
Description: 
We noticed a Mesos master invariant check failing leading to a crash while 
accepting an inverse offer. The {{HEAD}} is : 
{{c7fc1377b33c4eb83a01167bdb53c102c06b9a99}} from Jan 11. 
https://github.com/apache/mesos/commit/c7fc1377b33c4eb83a01167bdb53c102c06b9a99

{noformat}
Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal mesos-master[27357]: 
I0211 17:00:41.564393 27362 master.cpp:6754] Sending 1 inverse offers to 
framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0002
Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal mesos-master[27357]: 
I0211 17:00:41.564457 27362 master.cpp:6754] Sending 1 inverse offers to 
framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0003
Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal mesos-master[27357]: 
I0211 17:00:41.564517 27362 master.cpp:6754] Sending 1 inverse offers to 
framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0003
Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal mesos-master[27357]: 
I0211 17:00:41.566793 27367 master.cpp:6664] Sending 1 offers to framework 
2d45d0b7-0d58-43e4-9662-d876a100a055-0009 (hello-
Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal mesos-master[27357]: 
I0211 17:00:41.567001 27367 master.cpp:6754] Sending 1 inverse offers to 
framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0001
Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal mesos-master[27357]: 
I0211 17:00:41.567091 27367 master.cpp:6754] Sending 1 inverse offers to 
framework 2d45d0b7-0d58-43e4-9662-d876a100a055-0009
Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal mesos-master[27357]: 
I0211 17:00:41.567168 27367 master.cpp:6754] Sending 1 inverse offers to 
framework 2d45d0b7-0d58-43e4-9662-d876a100a055-0018
Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal mesos-master[27357]: 
I0211 17:00:41.567234 27367 master.cpp:6754] Sending 1 inverse offers to 
framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0061
Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal mesos-master[27357]: 
I0211 17:00:41.567322 27367 master.cpp:6754] Sending 1 inverse offers to 
framework 2d45d0b7-0d58-43e4-9662-d876a100a055-0012
Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal mesos-master[27357]: 
I0211 17:00:41.567405 27367 master.cpp:6754] Sending 1 inverse offers to 
framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0003
Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal mesos-master[27357]: 
I0211 17:00:41.567876 27363 master.cpp:6754] Sending 1 inverse offers to 
framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0061
Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal mesos-master[27357]: 
I0211 17:00:41.567975 27363 master.cpp:6754] Sending 1 inverse offers to 
framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0062
Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal mesos-master[27357]: 
I0211 17:00:41.568056 27363 master.cpp:6754] Sending 1 inverse offers to 
framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0003
Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal mesos-master[27357]: 
I0211 17:00:41.584126 27369 http.cpp:410] HTTP POST for 
/master/api/v1/scheduler from 10.10.0.68:41428
Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal mesos-master[27357]: 
W0211 17:00:41.584228 27369 master.cpp:4601] Ignoring accept of inverse offer 
01021b50-55f0-420e-8744-1ba1eceb3f55-O135611 s
Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal mesos-master[27357]: 
F0211 17:00:41.584259 27369 master.cpp:4605] CHECK_SOME(slaveId): is NONE
Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal mesos-master[27357]: 
*** Check failure stack trace: ***
Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal mesos-master[27357]: 
@ 0x7f9af0da91ad  google::LogMessage::Fail()
Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal mesos-master[27357]: 
@ 0x7f9af0daafdd  google::LogMessage::SendToLog()
Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal mesos-master[27357]: 
@ 0x7f9af0da8d9c  google::LogMessage::Flush()
Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal mesos-master[27357]: 
@ 0x7f9af0dab8d9  google::LogMessageFatal::~LogMessageFatal()
Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal mesos-master[27357]: 
@ 0x7f9af005d4a9  _CheckFatal::~_CheckFatal()
Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal mesos-master[27357]: 
@ 0x7f9af0285235  mesos::internal::master::Master::acceptInverseOffers()
Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal mesos-master[27357]: 
@ 0x7f9af01f5bc9  mesos::internal::master::Master::Http::scheduler()
Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal mesos-master[27357]: 
@ 0x7f9af024aa77  
_ZNSt17_Function_handlerIFN7process6FutureINS0_4http8ResponseEEERKNS2_7RequestERK6OptionIS

[jira] [Assigned] (MESOS-7119) Mesos master crash while accepting inverse offer.

2017-02-13 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar reassigned MESOS-7119:
-

Assignee: Anand Mazumdar

> Mesos master crash while accepting inverse offer.
> -
>
> Key: MESOS-7119
> URL: https://issues.apache.org/jira/browse/MESOS-7119
> Project: Mesos
>  Issue Type: Bug
>Reporter: Anand Mazumdar
>Assignee: Anand Mazumdar
>Priority: Critical
>  Labels: maintenance, mesosphere
> Attachments: crash-log-master.gz
>
>
> We noticed a Mesos master invariant check failing leading to a crash while 
> accepting an inverse offer. The {{HEAD}} is : 
> {{c7fc1377b33c4eb83a01167bdb53c102c06b9a99}} from Jan 11. 
> https://github.com/apache/mesos/commit/c7fc1377b33c4eb83a01167bdb53c102c06b9a99
> {noformat}
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.564393 27362 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0002
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.564457 27362 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0003
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.564517 27362 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0003
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.566793 27367 master.cpp:6664] Sending 1 
> offers to framework 2d45d0b7-0d58-43e4-9662-d876a100a055-0009 (hello-
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.567001 27367 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0001
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.567091 27367 master.cpp:6754] Sending 1 
> inverse offers to framework 2d45d0b7-0d58-43e4-9662-d876a100a055-0009
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.567168 27367 master.cpp:6754] Sending 1 
> inverse offers to framework 2d45d0b7-0d58-43e4-9662-d876a100a055-0018
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.567234 27367 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0061
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.567322 27367 master.cpp:6754] Sending 1 
> inverse offers to framework 2d45d0b7-0d58-43e4-9662-d876a100a055-0012
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.567405 27367 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0003
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.567876 27363 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0061
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.567975 27363 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0062
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.568056 27363 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0003
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.584126 27369 http.cpp:410] HTTP POST for 
> /master/api/v1/scheduler from 10.10.0.68:41428
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: W0211 17:00:41.584228 27369 master.cpp:4601] Ignoring 
> accept of inverse offer 01021b50-55f0-420e-8744-1ba1eceb3f55-O135611 s
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: F0211 17:00:41.584259 27369 master.cpp:4605] 
> CHECK_SOME(slaveId): is NONE
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: *** Check failure stack trace: ***
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: @ 0x7f9af0da91ad  google::LogMessage::Fail()
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: @ 0x7f9af0daafdd  google::LogMessage::SendToLog()
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: @ 0x7f9af0da8d9c  google::LogMessage::Flush()
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: @ 0x7f9af0dab8d9  
> google::LogMessageFatal::~LogMessageFatal()
> Feb 11 17:00:

[jira] [Updated] (MESOS-7119) Mesos master crash while accepting inverse offer.

2017-02-13 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-7119:
--
Target Version/s: 1.1.2, 1.3.0, 1.2.1

> Mesos master crash while accepting inverse offer.
> -
>
> Key: MESOS-7119
> URL: https://issues.apache.org/jira/browse/MESOS-7119
> Project: Mesos
>  Issue Type: Bug
>Reporter: Anand Mazumdar
>Assignee: Anand Mazumdar
>Priority: Critical
>  Labels: maintenance, mesosphere
> Attachments: crash-log-master.gz
>
>
> We noticed a Mesos master invariant check failing leading to a crash while 
> accepting an inverse offer. The {{HEAD}} is : 
> {{c7fc1377b33c4eb83a01167bdb53c102c06b9a99}} from Jan 11. 
> https://github.com/apache/mesos/commit/c7fc1377b33c4eb83a01167bdb53c102c06b9a99
> {noformat}
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.564393 27362 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0002
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.564457 27362 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0003
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.564517 27362 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0003
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.566793 27367 master.cpp:6664] Sending 1 
> offers to framework 2d45d0b7-0d58-43e4-9662-d876a100a055-0009 (hello-
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.567001 27367 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0001
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.567091 27367 master.cpp:6754] Sending 1 
> inverse offers to framework 2d45d0b7-0d58-43e4-9662-d876a100a055-0009
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.567168 27367 master.cpp:6754] Sending 1 
> inverse offers to framework 2d45d0b7-0d58-43e4-9662-d876a100a055-0018
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.567234 27367 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0061
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.567322 27367 master.cpp:6754] Sending 1 
> inverse offers to framework 2d45d0b7-0d58-43e4-9662-d876a100a055-0012
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.567405 27367 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0003
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.567876 27363 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0061
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.567975 27363 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0062
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.568056 27363 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0003
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.584126 27369 http.cpp:410] HTTP POST for 
> /master/api/v1/scheduler from 10.10.0.68:41428
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: W0211 17:00:41.584228 27369 master.cpp:4601] Ignoring 
> accept of inverse offer 01021b50-55f0-420e-8744-1ba1eceb3f55-O135611 s
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: F0211 17:00:41.584259 27369 master.cpp:4605] 
> CHECK_SOME(slaveId): is NONE
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: *** Check failure stack trace: ***
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: @ 0x7f9af0da91ad  google::LogMessage::Fail()
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: @ 0x7f9af0daafdd  google::LogMessage::SendToLog()
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: @ 0x7f9af0da8d9c  google::LogMessage::Flush()
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: @ 0x7f9af0dab8d9  
> google::LogMessageFatal::~LogMessageFatal()
> Feb 11

[jira] [Commented] (MESOS-2369) Segfault when mesos-slave tries to clean up docker containers on startup

2017-02-13 Thread Anand Mazumdar (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15864130#comment-15864130
 ] 

Anand Mazumdar commented on MESOS-2369:
---

Thanks for all the info! Let me try to reproduce the OOM behavior on my end.

> Segfault when mesos-slave tries to clean up docker containers on startup
> 
>
> Key: MESOS-2369
> URL: https://issues.apache.org/jira/browse/MESOS-2369
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 0.21.1, 1.2.0, 1.3.0
> Environment: Debian Jessie, mesos package 0.21.1-1.2.debian77 
> docker 1.3.2 build 39fa2fa
>Reporter: Pas
>Assignee: Anand Mazumdar
> Attachments: playbook.yml, Vagrantfile
>
>
> I did a gdb backtrace, it seems like a stack overflow due to a bit too much 
> recursion.
> The interesting aspect is that after running mesos-slave with strace -f -b 
> execve it successfully proceeded with the docker cleanup. However, there were 
> a few strace sessions (on other slaves) where I was able to observe the 
> SIGSEGV, and it was around (or a bit before) the "docker ps -a" call, because 
> docker got a broken pipe shortly, then got killed by the propagating SIGSEGV 
> signal.
> {code}
> 
> #59296 0x76e7cd98 in process::Future 
> process::Future long>::then(std::tr1::function 
> (unsigned long const&)> const&) const () from 
> /usr/local/lib/libmesos-0.21.1.so
> #59297 0x76e4f5d3 in process::io::internal::_read(int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59298 0x76e5012c in process::io::internal::__read(unsigned long, 
> int, std::tr1::shared_ptr const&, boost::shared_array 
> const&, unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59299 0x76e53000 in 
> std::tr1::_Function_handler (unsigned long 
> const&), std::tr1::_Bind 
> (*(std::tr1::_Placeholder<1>, int, std::tr1::shared_ptr, 
> boost::shared_array, unsigned long))(unsigned long, int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long)> >::_M_invoke(std::tr1::_Any_data const&, unsigned long 
> const&) () from /usr/local/lib/libmesos-0.21.1.so
> #59300 0x76e7d23b in void process::internal::thenf std::string>(std::tr1::shared_ptr > const&, 
> std::tr1::function (unsigned long const&)> 
> const&, process::Future const&) ()
>from /usr/local/lib/libmesos-0.21.1.so
> #59301 0x7689ee60 in process::Future long>::onAny(std::tr1::function const&)> 
> const&) const () from /usr/local/lib/libmesos-0.21.1.so
> #59302 0x76e7cd98 in process::Future 
> process::Future long>::then(std::tr1::function 
> (unsigned long const&)> const&) const () from 
> /usr/local/lib/libmesos-0.21.1.so
> #59303 0x76e4f5d3 in process::io::internal::_read(int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59304 0x76e5012c in process::io::internal::__read(unsigned long, 
> int, std::tr1::shared_ptr const&, boost::shared_array 
> const&, unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59305 0x76e53000 in 
> std::tr1::_Function_handler (unsigned long 
> const&), std::tr1::_Bind 
> (*(std::tr1::_Placeholder<1>, int, std::tr1::shared_ptr, 
> boost::shared_array, unsigned long))(unsigned long, int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long)> >::_M_invoke(std::tr1::_Any_data const&, unsigned long 
> const&) () from /usr/local/lib/libmesos-0.21.1.so
> #59306 0x76e7d23b in void process::internal::thenf std::string>(std::tr1::shared_ptr > const&, 
> std::tr1::function (unsigned long const&)> 
> const&, process::Future const&) ()
>from /usr/local/lib/libmesos-0.21.1.so
> #59307 0x7689ee60 in process::Future long>::onAny(std::tr1::function const&)> 
> const&) const () from /usr/local/lib/libmesos-0.21.1.so
> #59308 0x76e7cd98 in process::Future 
> process::Future long>::then(std::tr1::function 
> (unsigned long const&)> const&) const () from 
> /usr/local/lib/libmesos-0.21.1.so
> #59309 0x76e4f5d3 in process::io::internal::_read(int, 
> std::tr1::shared_ptr const&, boost::shared_array const&, 
> unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59310 0x76e5012c in process::io::internal::__read(unsigned long, 
> int, std::tr1::shared_ptr const&, boost::shared_array 
> const&, unsigned long) () from /usr/local/lib/libmesos-0.21.1.so
> #59311 0x76e53000 in 
> std::tr1::_Function_handler (unsigned long 
> const&), std::tr1::_Bind 
> (*(std::tr1::_Placeholder<1>, int, std::tr1::shared_ptr, 
> boost::shared_array, unsigned long))(unsigned long, int, 
> std::tr1::shared_ptr cons

[jira] [Updated] (MESOS-7114) GroupTest.GroupCancelWithDisconnect fails on Mac OS.

2017-02-13 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-7114:
--
Labels: flaky flaky-test mesosphere  (was: mesosphere)

> GroupTest.GroupCancelWithDisconnect fails on Mac OS.
> 
>
> Key: MESOS-7114
> URL: https://issues.apache.org/jira/browse/MESOS-7114
> Project: Mesos
>  Issue Type: Bug
>  Components: test
> Environment: OS X
>Reporter: Benjamin Bannier
>  Labels: flaky, flaky-test, mesosphere
>
> We saw {{GroupTest.GroupCancelWithDisconnect}} fail on a recent OS X in a SSL 
> build in our internal CI recently:
> {code}
> [ RUN  ] GroupTest.GroupCancelWithDisconnect
> I0209 19:22:17.574175 1985630208 zookeeper_test_server.cpp:156] Started 
> ZooKeeperTestServer on port 55440
> 2017-02-09 19:22:17,574:84405(0x70416000):ZOO_INFO@log_env@726: Client 
> environment:zookeeper.version=zookeeper C client 3.4.8
> 2017-02-09 19:22:17,574:84405(0x70416000):ZOO_INFO@log_env@730: Client 
> environment:host.name=Jenkinss-Mac-mini.local
> 2017-02-09 19:22:17,574:84405(0x70416000):ZOO_INFO@log_env@737: Client 
> environment:os.name=Darwin
> 2017-02-09 19:22:17,574:84405(0x70416000):ZOO_INFO@log_env@738: Client 
> environment:os.arch=15.6.0
> 2017-02-09 19:22:17,574:84405(0x70416000):ZOO_INFO@log_env@739: Client 
> environment:os.version=Darwin Kernel Version 15.6.0: Mon Jan  9 23:07:29 PST 
> 2017; root:xnu-3248.60.11.2.1~1/RELEASE_X86_64
> 2017-02-09 19:22:17,574:84405(0x70416000):ZOO_INFO@log_env@747: Client 
> environment:user.name=jenkins
> 2017-02-09 19:22:17,574:84405(0x70416000):ZOO_INFO@log_env@755: Client 
> environment:user.home=/Users/jenkins
> 2017-02-09 19:22:17,574:84405(0x70416000):ZOO_INFO@log_env@767: Client 
> environment:user.dir=/Users/jenkins/workspace/workspace/mesos/Mesos_CI-build/FLAG/SSL/label/mac/mesos/build
> 2017-02-09 19:22:17,574:84405(0x70416000):ZOO_INFO@zookeeper_init@800: 
> Initiating client connection, host=127.0.0.1:55440 sessionTimeout=1 
> watcher=0x10eba31e0 sessionId=0 sessionPasswd= context=0x7f8edb6dee50 
> flags=0
> 2017-02-09 19:22:17,574:84405(0x70cb4000):ZOO_INFO@check_events@1728: 
> initiated connection to server [127.0.0.1:55440]
> 2017-02-09 19:22:17,578:84405(0x70cb4000):ZOO_INFO@check_events@1775: 
> session establishment complete on server [127.0.0.1:55440], 
> sessionId=0x15a260af865, negotiated timeout=1
> I0209 19:22:17.578824 3211264 group.cpp:340] Group process 
> (zookeeper-group(44)@10.0.90.182:54133) connected to ZooKeeper
> I0209 19:22:17.578876 3211264 group.cpp:830] Syncing group operations: queue 
> size (joins, cancels, datas) = (1, 0, 0)
> I0209 19:22:17.578893 3211264 group.cpp:418] Trying to create path '/test' in 
> ZooKeeper
> I0209 19:22:17.582217 2674688 group.cpp:699] Trying to get '/test/00' 
> in ZooKeeper
> I0209 19:22:17.582960 1985630208 zookeeper_test_server.cpp:116] Shutting down 
> ZooKeeperTestServer on port 55440
> 2017-02-09 
> 19:22:17,583:84405(0x70cb4000):ZOO_ERROR@handle_socket_error_msg@1746: 
> Socket [127.0.0.1:55440] zk retcode=-4, errno=64(Host is down): failed while 
> receiving a server response
> I0209 19:22:17.583799 1601536 group.cpp:451] Lost connection to ZooKeeper, 
> attempting to reconnect ...
> I0209 19:22:17.584373 1601536 group.cpp:656] Trying to remove 
> '/test/00' in ZooKeeper
> 2017-02-09 19:22:17,584:84405(0x70cb4000):ZOO_INFO@check_events@1728: 
> initiated connection to server [127.0.0.1:55440]
> 2017-02-09 
> 19:22:17,585:84405(0x70cb4000):ZOO_ERROR@handle_socket_error_msg@1746: 
> Socket [127.0.0.1:55440] zk retcode=-4, errno=64(Host is down): failed while 
> receiving a server response
> I0209 19:22:17.586333 1985630208 zookeeper_test_server.cpp:156] Started 
> ZooKeeperTestServer on port 55440
> 2017-02-09 
> 19:23:05,168:84405(0x70cb4000):ZOO_WARN@zookeeper_interest@1570: Exceeded 
> deadline by 44249ms
> 2017-02-09 19:23:05,196:84405(0x70cb4000):ZOO_INFO@check_events@1728: 
> initiated connection to server [127.0.0.1:55440]
> 2017-02-09 
> 19:23:05,232:84405(0x70cb4000):ZOO_ERROR@handle_socket_error_msg@1764: 
> Socket [127.0.0.1:55440] zk retcode=-112, errno=70(Stale NFS file handle): 
> sessionId=0x15a260af865 has expired.
> I0209 19:23:05.232564 2138112 group.cpp:830] Syncing group operations: queue 
> size (joins, cancels, datas) = (0, 1, 0)
> I0209 19:23:05.243890 2138112 group.cpp:656] Trying to remove 
> '/test/00' in ZooKeeper
> W0209 19:23:05.257310 2138112 group.cpp:494] Timed out waiting to connect to 
> ZooKeeper. Forcing ZooKeeper session (sessionId=15a260af865) expiration
> I0209 19:23:05.258072 2138112 group.cpp:510] ZooKeeper session expired
> ../../src/tests/group_tests.cpp:183: Failure
> Fail

[jira] [Commented] (MESOS-7023) IOSwitchboardTest.RecoverThenKillSwitchboardContainerDestroyed is flaky

2017-02-13 Thread Anand Mazumdar (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15864628#comment-15864628
 ] 

Anand Mazumdar commented on MESOS-7023:
---

{noformat}
commit 17394a11e68679992c7bd955d269bf8ae7897200
Author: Anand Mazumdar 
Date:   Mon Feb 13 14:55:46 2017 -0800

Temporarily disabled a consistently failing test.

The issue is tracked via MESOS-7023.
{noformat}

> IOSwitchboardTest.RecoverThenKillSwitchboardContainerDestroyed is flaky
> ---
>
> Key: MESOS-7023
> URL: https://issues.apache.org/jira/browse/MESOS-7023
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, test
> Environment: ASF CI, cmake, gcc, Ubuntu 14.04, without libevent/SSL
>Reporter: Greg Mann
>Assignee: Kevin Klues
>  Labels: debugging, flaky
> Attachments: IOSwitchboardTest. 
> RecoverThenKillSwitchboardContainerDestroyed.txt
>
>
> This was observed on ASF CI:
> {code}
> /mesos/src/tests/containerizer/io_switchboard_tests.cpp:1052: Failure
> Value of: statusFailed->reason()
>   Actual: 1
> Expected: TaskStatus::REASON_IO_SWITCHBOARD_EXITED
> Which is: 27
> {code}
> Find full log attached.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (MESOS-7123) Investigate splitting offer messages instead of sending a giant single resource offer message.

2017-02-13 Thread Anand Mazumdar (JIRA)

Anand Mazumdar created MESOS-7123:
-

 Summary: Investigate splitting offer messages instead of sending a 
giant single resource offer message.
 Key: MESOS-7123
 URL: https://issues.apache.org/jira/browse/MESOS-7123
 Project: Mesos
  Issue Type: Improvement
Reporter: Anand Mazumdar
Priority: Critical


Currently, the Mesos master batches all the resource offers into a single 
message and then sends it to the scheduler. However, for large clusters this 
can be problematic as this message can exceed the maximum allowed default 
protobuf message size (~64mb). When such a message reaches the scheduler, it's 
dropped with a warning followed by a failed invariant check.

{noformat}
[libprotobuf ERROR google/protobuf/io/coded_stream.cc:180] A protocol message 
was rejected because it was too big (more than 67108864 bytes).  To increase 
the limit (or to disable these warnings), see 
CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stre
am.h.
F0213 21:33:57.658892 60996 sched.cpp:895] Check failed: offers.size() == 
pids.size() (32664 vs. 0)
*** Check failure stack trace: ***
@ 0x7f8d1b4d69bd  (unknown)
@ 0x7f8d1b4d8750  (unknown)
@ 0x7f8d1b4d6582  (unknown)
@ 0x7f8d1b4d90e9  (unknown)
@ 0x7f8d1aaa646c  (unknown)
@ 0x7f8d1aaa7df7  (unknown)
@ 0x7f8d1aa8ee4a  (unknown)
@ 0x7f8d1aa9d109  (unknown)
@ 0x7f8d1b46e4e4  (unknown)
@ 0x7f8d1b46e827  (unknown)
@ 0x7f8e319b0220  (unknown)
@ 0x7f8e3355ddc5  start_thread
@ 0x7f8e32c62ced  __clone
@  (nil)  (unknown)
{noformat}

Possible solutions can be to either batch the offers e.g., 100 offers per 
message or have a N:1 mapping ie., 1 offer per message by the Mesos master.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (MESOS-7123) Investigate splitting offer messages instead of sending a giant single resource offer message.

2017-02-13 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-7123:
--
Description: 
Currently, the Mesos master batches all the resource offers into a single 
message and then sends it to the scheduler. However, for large clusters this 
can be problematic as this message can exceed the maximum allowed default 
protobuf message size (~64mb). When such a message reaches the scheduler, it's 
dropped with a warning followed by a failed invariant check.

{noformat}
[libprotobuf ERROR google/protobuf/io/coded_stream.cc:180] A protocol message 
was rejected because it was too big (more than 67108864 bytes).  To increase 
the limit (or to disable these warnings), see 
CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stre
am.h.
F0213 21:33:57.658892 60996 sched.cpp:895] Check failed: offers.size() == 
pids.size() (32664 vs. 0)
*** Check failure stack trace: ***
@ 0x7f8d1b4d69bd  (unknown)
@ 0x7f8d1b4d8750  (unknown)
@ 0x7f8d1b4d6582  (unknown)
@ 0x7f8d1b4d90e9  (unknown)
@ 0x7f8d1aaa646c  (unknown)
@ 0x7f8d1aaa7df7  (unknown)
@ 0x7f8d1aa8ee4a  (unknown)
@ 0x7f8d1aa9d109  (unknown)
@ 0x7f8d1b46e4e4  (unknown)
@ 0x7f8d1b46e827  (unknown)
@ 0x7f8e319b0220  (unknown)
@ 0x7f8e3355ddc5  start_thread
@ 0x7f8e32c62ced  __clone
@  (nil)  (unknown)
{noformat}

Possible solutions can be to either batch the offers e.g., 100 offers per 
message or have a N:1 mapping ie., 1 offer per message by the Mesos master. The 
batch size can be set via a master flag at startup with a reasonable default 
value.

  was:
Currently, the Mesos master batches all the resource offers into a single 
message and then sends it to the scheduler. However, for large clusters this 
can be problematic as this message can exceed the maximum allowed default 
protobuf message size (~64mb). When such a message reaches the scheduler, it's 
dropped with a warning followed by a failed invariant check.

{noformat}
[libprotobuf ERROR google/protobuf/io/coded_stream.cc:180] A protocol message 
was rejected because it was too big (more than 67108864 bytes).  To increase 
the limit (or to disable these warnings), see 
CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stre
am.h.
F0213 21:33:57.658892 60996 sched.cpp:895] Check failed: offers.size() == 
pids.size() (32664 vs. 0)
*** Check failure stack trace: ***
@ 0x7f8d1b4d69bd  (unknown)
@ 0x7f8d1b4d8750  (unknown)
@ 0x7f8d1b4d6582  (unknown)
@ 0x7f8d1b4d90e9  (unknown)
@ 0x7f8d1aaa646c  (unknown)
@ 0x7f8d1aaa7df7  (unknown)
@ 0x7f8d1aa8ee4a  (unknown)
@ 0x7f8d1aa9d109  (unknown)
@ 0x7f8d1b46e4e4  (unknown)
@ 0x7f8d1b46e827  (unknown)
@ 0x7f8e319b0220  (unknown)
@ 0x7f8e3355ddc5  start_thread
@ 0x7f8e32c62ced  __clone
@  (nil)  (unknown)
{noformat}

Possible solutions can be to either batch the offers e.g., 100 offers per 
message or have a N:1 mapping ie., 1 offer per message by the Mesos master.


> Investigate splitting offer messages instead of sending a giant single 
> resource offer message.
> --
>
> Key: MESOS-7123
> URL: https://issues.apache.org/jira/browse/MESOS-7123
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Anand Mazumdar
>Priority: Critical
>  Labels: mesosphere
>
> Currently, the Mesos master batches all the resource offers into a single 
> message and then sends it to the scheduler. However, for large clusters this 
> can be problematic as this message can exceed the maximum allowed default 
> protobuf message size (~64mb). When such a message reaches the scheduler, 
> it's dropped with a warning followed by a failed invariant check.
> {noformat}
> [libprotobuf ERROR google/protobuf/io/coded_stream.cc:180] A protocol message 
> was rejected because it was too big (more than 67108864 bytes).  To increase 
> the limit (or to disable these warnings), see 
> CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stre
> am.h.
> F0213 21:33:57.658892 60996 sched.cpp:895] Check failed: offers.size() == 
> pids.size() (32664 vs. 0)
> *** Check failure stack trace: ***
> @ 0x7f8d1b4d69bd  (unknown)
> @ 0x7f8d1b4d8750  (unknown)
> @ 0x7f8d1b4d6582  (unknown)
> @ 0x7f8d1b4d90e9  (unknown)
> @ 0x7f8d1aaa646c  (unknown)
> @ 0x7f8d1aaa7df7  (unknown)
> @ 0x7f8d1aa8ee4a  (unknown)
> @ 0x7f8d1aa9d109  (unknown)
> @ 0x7f8d1b46e4e4  (unknown)
> @ 0x7f8d1b46e827  (unknown)
> @ 0x7f8e319b0220  (unknown)
> @ 0x7f8e3355ddc5  st

[jira] [Updated] (MESOS-6784) IOSwitchboardTest.KillSwitchboardContainerDestroyed is flaky

2017-02-13 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6784:
--
Target Version/s: 1.3.0  (was: 1.2.0)

> IOSwitchboardTest.KillSwitchboardContainerDestroyed is flaky
> 
>
> Key: MESOS-6784
> URL: https://issues.apache.org/jira/browse/MESOS-6784
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Neil Conway
>Priority: Critical
>  Labels: mesosphere
>
> {noformat}
> [ RUN  ] IOSwitchboardTest.KillSwitchboardContainerDestroyed
> I1212 13:57:02.641043  2211 containerizer.cpp:220] Using isolation: 
> posix/cpu,filesystem/posix,network/cni
> W1212 13:57:02.641438  2211 backend.cpp:76] Failed to create 'overlay' 
> backend: OverlayBackend requires root privileges, but is running as user nrc
> W1212 13:57:02.641559  2211 backend.cpp:76] Failed to create 'bind' backend: 
> BindBackend requires root privileges
> I1212 13:57:02.642822  2268 containerizer.cpp:594] Recovering containerizer
> I1212 13:57:02.643975  2253 provisioner.cpp:253] Provisioner recovery complete
> I1212 13:57:02.644953  2255 containerizer.cpp:986] Starting container 
> 09e87380-00ab-4987-83c9-fa1c5d86717f for executor 'executor' of framework
> I1212 13:57:02.647004  2245 switchboard.cpp:430] Allocated pseudo terminal 
> '/dev/pts/54' for container 09e87380-00ab-4987-83c9-fa1c5d86717f
> I1212 13:57:02.652305  2245 switchboard.cpp:596] Created I/O switchboard 
> server (pid: 2705) listening on socket file 
> '/tmp/mesos-io-switchboard-b4af1c92-6633-44f3-9d35-e0e36edaf70a' for 
> container 09e87380-00ab-4987-83c9-fa1c5d86717f
> I1212 13:57:02.655513  2267 launcher.cpp:133] Forked child with pid '2706' 
> for container '09e87380-00ab-4987-83c9-fa1c5d86717f'
> I1212 13:57:02.655732  2267 containerizer.cpp:1621] Checkpointing container's 
> forked pid 2706 to 
> '/tmp/IOSwitchboardTest_KillSwitchboardContainerDestroyed_Me5CRx/meta/slaves/frameworks/executors/executor/runs/09e87380-00ab-4987-83c9-fa1c5d86717f/pids/forked.pid'
> I1212 13:57:02.726306  2265 containerizer.cpp:2463] Container 
> 09e87380-00ab-4987-83c9-fa1c5d86717f has exited
> I1212 13:57:02.726352  2265 containerizer.cpp:2100] Destroying container 
> 09e87380-00ab-4987-83c9-fa1c5d86717f in RUNNING state
> E1212 13:57:02.726495  2243 switchboard.cpp:861] Unexpected termination of 
> I/O switchboard server: 'IOSwitchboard' exited with signal: Killed for 
> container 09e87380-00ab-4987-83c9-fa1c5d86717f
> I1212 13:57:02.726563  2265 launcher.cpp:149] Asked to destroy container 
> 09e87380-00ab-4987-83c9-fa1c5d86717f
> E1212 13:57:02.783607  2228 switchboard.cpp:799] Failed to remove unix domain 
> socket file '/tmp/mesos-io-switchboard-b4af1c92-6633-44f3-9d35-e0e36edaf70a' 
> for container '09e87380-00ab-4987-83c9-fa1c5d86717f': No such file or 
> directory
> ../../mesos/src/tests/containerizer/io_switchboard_tests.cpp:661: Failure
> Value of: wait.get()->reasons().size() == 1
>   Actual: false
> Expected: true
> *** Aborted at 1481579822 (unix time) try "date -d @1481579822" if you are 
> using GNU date ***
> PC: @  0x1bf16d0 testing::UnitTest::AddTestPartResult()
> *** SIGSEGV (@0x0) received by PID 2211 (TID 0x7faed7d078c0) from PID 0; 
> stack trace: ***
> @ 0x7faecf855100 (unknown)
> @  0x1bf16d0 testing::UnitTest::AddTestPartResult()
> @  0x1be6247 testing::internal::AssertHelper::operator=()
> @  0x19ed751 
> mesos::internal::tests::IOSwitchboardTest_KillSwitchboardContainerDestroyed_Test::TestBody()
> @  0x1c0ed8c 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @  0x1c09e74 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @  0x1beb505 testing::Test::Run()
> @  0x1bebc88 testing::TestInfo::Run()
> @  0x1bec2ce testing::TestCase::Run()
> @  0x1bf2ba8 testing::internal::UnitTestImpl::RunAllTests()
> @  0x1c0f9b1 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @  0x1c0a9f2 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @  0x1bf18ee testing::UnitTest::Run()
> @  0x11bc9e3 RUN_ALL_TESTS()
> @  0x11bc599 main
> @ 0x7faece663b15 __libc_start_main
> @   0xa9c219 (unknown)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Assigned] (MESOS-6784) IOSwitchboardTest.KillSwitchboardContainerDestroyed is flaky

2017-02-13 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar reassigned MESOS-6784:
-

Assignee: (was: Anand Mazumdar)

> IOSwitchboardTest.KillSwitchboardContainerDestroyed is flaky
> 
>
> Key: MESOS-6784
> URL: https://issues.apache.org/jira/browse/MESOS-6784
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Neil Conway
>  Labels: mesosphere
>
> {noformat}
> [ RUN  ] IOSwitchboardTest.KillSwitchboardContainerDestroyed
> I1212 13:57:02.641043  2211 containerizer.cpp:220] Using isolation: 
> posix/cpu,filesystem/posix,network/cni
> W1212 13:57:02.641438  2211 backend.cpp:76] Failed to create 'overlay' 
> backend: OverlayBackend requires root privileges, but is running as user nrc
> W1212 13:57:02.641559  2211 backend.cpp:76] Failed to create 'bind' backend: 
> BindBackend requires root privileges
> I1212 13:57:02.642822  2268 containerizer.cpp:594] Recovering containerizer
> I1212 13:57:02.643975  2253 provisioner.cpp:253] Provisioner recovery complete
> I1212 13:57:02.644953  2255 containerizer.cpp:986] Starting container 
> 09e87380-00ab-4987-83c9-fa1c5d86717f for executor 'executor' of framework
> I1212 13:57:02.647004  2245 switchboard.cpp:430] Allocated pseudo terminal 
> '/dev/pts/54' for container 09e87380-00ab-4987-83c9-fa1c5d86717f
> I1212 13:57:02.652305  2245 switchboard.cpp:596] Created I/O switchboard 
> server (pid: 2705) listening on socket file 
> '/tmp/mesos-io-switchboard-b4af1c92-6633-44f3-9d35-e0e36edaf70a' for 
> container 09e87380-00ab-4987-83c9-fa1c5d86717f
> I1212 13:57:02.655513  2267 launcher.cpp:133] Forked child with pid '2706' 
> for container '09e87380-00ab-4987-83c9-fa1c5d86717f'
> I1212 13:57:02.655732  2267 containerizer.cpp:1621] Checkpointing container's 
> forked pid 2706 to 
> '/tmp/IOSwitchboardTest_KillSwitchboardContainerDestroyed_Me5CRx/meta/slaves/frameworks/executors/executor/runs/09e87380-00ab-4987-83c9-fa1c5d86717f/pids/forked.pid'
> I1212 13:57:02.726306  2265 containerizer.cpp:2463] Container 
> 09e87380-00ab-4987-83c9-fa1c5d86717f has exited
> I1212 13:57:02.726352  2265 containerizer.cpp:2100] Destroying container 
> 09e87380-00ab-4987-83c9-fa1c5d86717f in RUNNING state
> E1212 13:57:02.726495  2243 switchboard.cpp:861] Unexpected termination of 
> I/O switchboard server: 'IOSwitchboard' exited with signal: Killed for 
> container 09e87380-00ab-4987-83c9-fa1c5d86717f
> I1212 13:57:02.726563  2265 launcher.cpp:149] Asked to destroy container 
> 09e87380-00ab-4987-83c9-fa1c5d86717f
> E1212 13:57:02.783607  2228 switchboard.cpp:799] Failed to remove unix domain 
> socket file '/tmp/mesos-io-switchboard-b4af1c92-6633-44f3-9d35-e0e36edaf70a' 
> for container '09e87380-00ab-4987-83c9-fa1c5d86717f': No such file or 
> directory
> ../../mesos/src/tests/containerizer/io_switchboard_tests.cpp:661: Failure
> Value of: wait.get()->reasons().size() == 1
>   Actual: false
> Expected: true
> *** Aborted at 1481579822 (unix time) try "date -d @1481579822" if you are 
> using GNU date ***
> PC: @  0x1bf16d0 testing::UnitTest::AddTestPartResult()
> *** SIGSEGV (@0x0) received by PID 2211 (TID 0x7faed7d078c0) from PID 0; 
> stack trace: ***
> @ 0x7faecf855100 (unknown)
> @  0x1bf16d0 testing::UnitTest::AddTestPartResult()
> @  0x1be6247 testing::internal::AssertHelper::operator=()
> @  0x19ed751 
> mesos::internal::tests::IOSwitchboardTest_KillSwitchboardContainerDestroyed_Test::TestBody()
> @  0x1c0ed8c 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @  0x1c09e74 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @  0x1beb505 testing::Test::Run()
> @  0x1bebc88 testing::TestInfo::Run()
> @  0x1bec2ce testing::TestCase::Run()
> @  0x1bf2ba8 testing::internal::UnitTestImpl::RunAllTests()
> @  0x1c0f9b1 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @  0x1c0a9f2 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @  0x1bf18ee testing::UnitTest::Run()
> @  0x11bc9e3 RUN_ALL_TESTS()
> @  0x11bc599 main
> @ 0x7faece663b15 __libc_start_main
> @   0xa9c219 (unknown)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (MESOS-6784) IOSwitchboardTest.KillSwitchboardContainerDestroyed is flaky

2017-02-13 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6784:
--
Priority: Critical  (was: Major)

> IOSwitchboardTest.KillSwitchboardContainerDestroyed is flaky
> 
>
> Key: MESOS-6784
> URL: https://issues.apache.org/jira/browse/MESOS-6784
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Neil Conway
>Priority: Critical
>  Labels: mesosphere
>
> {noformat}
> [ RUN  ] IOSwitchboardTest.KillSwitchboardContainerDestroyed
> I1212 13:57:02.641043  2211 containerizer.cpp:220] Using isolation: 
> posix/cpu,filesystem/posix,network/cni
> W1212 13:57:02.641438  2211 backend.cpp:76] Failed to create 'overlay' 
> backend: OverlayBackend requires root privileges, but is running as user nrc
> W1212 13:57:02.641559  2211 backend.cpp:76] Failed to create 'bind' backend: 
> BindBackend requires root privileges
> I1212 13:57:02.642822  2268 containerizer.cpp:594] Recovering containerizer
> I1212 13:57:02.643975  2253 provisioner.cpp:253] Provisioner recovery complete
> I1212 13:57:02.644953  2255 containerizer.cpp:986] Starting container 
> 09e87380-00ab-4987-83c9-fa1c5d86717f for executor 'executor' of framework
> I1212 13:57:02.647004  2245 switchboard.cpp:430] Allocated pseudo terminal 
> '/dev/pts/54' for container 09e87380-00ab-4987-83c9-fa1c5d86717f
> I1212 13:57:02.652305  2245 switchboard.cpp:596] Created I/O switchboard 
> server (pid: 2705) listening on socket file 
> '/tmp/mesos-io-switchboard-b4af1c92-6633-44f3-9d35-e0e36edaf70a' for 
> container 09e87380-00ab-4987-83c9-fa1c5d86717f
> I1212 13:57:02.655513  2267 launcher.cpp:133] Forked child with pid '2706' 
> for container '09e87380-00ab-4987-83c9-fa1c5d86717f'
> I1212 13:57:02.655732  2267 containerizer.cpp:1621] Checkpointing container's 
> forked pid 2706 to 
> '/tmp/IOSwitchboardTest_KillSwitchboardContainerDestroyed_Me5CRx/meta/slaves/frameworks/executors/executor/runs/09e87380-00ab-4987-83c9-fa1c5d86717f/pids/forked.pid'
> I1212 13:57:02.726306  2265 containerizer.cpp:2463] Container 
> 09e87380-00ab-4987-83c9-fa1c5d86717f has exited
> I1212 13:57:02.726352  2265 containerizer.cpp:2100] Destroying container 
> 09e87380-00ab-4987-83c9-fa1c5d86717f in RUNNING state
> E1212 13:57:02.726495  2243 switchboard.cpp:861] Unexpected termination of 
> I/O switchboard server: 'IOSwitchboard' exited with signal: Killed for 
> container 09e87380-00ab-4987-83c9-fa1c5d86717f
> I1212 13:57:02.726563  2265 launcher.cpp:149] Asked to destroy container 
> 09e87380-00ab-4987-83c9-fa1c5d86717f
> E1212 13:57:02.783607  2228 switchboard.cpp:799] Failed to remove unix domain 
> socket file '/tmp/mesos-io-switchboard-b4af1c92-6633-44f3-9d35-e0e36edaf70a' 
> for container '09e87380-00ab-4987-83c9-fa1c5d86717f': No such file or 
> directory
> ../../mesos/src/tests/containerizer/io_switchboard_tests.cpp:661: Failure
> Value of: wait.get()->reasons().size() == 1
>   Actual: false
> Expected: true
> *** Aborted at 1481579822 (unix time) try "date -d @1481579822" if you are 
> using GNU date ***
> PC: @  0x1bf16d0 testing::UnitTest::AddTestPartResult()
> *** SIGSEGV (@0x0) received by PID 2211 (TID 0x7faed7d078c0) from PID 0; 
> stack trace: ***
> @ 0x7faecf855100 (unknown)
> @  0x1bf16d0 testing::UnitTest::AddTestPartResult()
> @  0x1be6247 testing::internal::AssertHelper::operator=()
> @  0x19ed751 
> mesos::internal::tests::IOSwitchboardTest_KillSwitchboardContainerDestroyed_Test::TestBody()
> @  0x1c0ed8c 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @  0x1c09e74 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @  0x1beb505 testing::Test::Run()
> @  0x1bebc88 testing::TestInfo::Run()
> @  0x1bec2ce testing::TestCase::Run()
> @  0x1bf2ba8 testing::internal::UnitTestImpl::RunAllTests()
> @  0x1c0f9b1 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @  0x1c0a9f2 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @  0x1bf18ee testing::UnitTest::Run()
> @  0x11bc9e3 RUN_ALL_TESTS()
> @  0x11bc599 main
> @ 0x7faece663b15 __libc_start_main
> @   0xa9c219 (unknown)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (MESOS-7126) configure fails (without flags) on CentOS 7.3

2017-02-14 Thread Anand Mazumdar (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15866199#comment-15866199
 ] 

Anand Mazumdar commented on MESOS-7126:
---

[~rharnasch] As per instructions on the [Getting 
Started|https://mesos.apache.org/gettingstarted/] page, did you install the 
{{cyrus-sasl-devel cyrus-sasl-md5}} packages?

> configure fails (without flags) on CentOS 7.3
> -
>
> Key: MESOS-7126
> URL: https://issues.apache.org/jira/browse/MESOS-7126
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.1.0
> Environment: CentOS 7.3
> Linux Kernel 3.10.0
>Reporter: Raul Harnasch
>
> ../configure produces CRAM-MD5 error:
> {quote}
> ' We need CRAM-MD5 support for SASL authentication. '
> {quote}
> libnl3-devel has been installed as per 
> https://issues.apache.org/jira/browse/MESOS-6649 (even though I am not 
> running configure with network-isolator)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (MESOS-7125) ./configure does not run ./config.status

2017-02-14 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-7125:
--
Component/s: cmake

> ./configure does not run ./config.status
> 
>
> Key: MESOS-7125
> URL: https://issues.apache.org/jira/browse/MESOS-7125
> Project: Mesos
>  Issue Type: Bug
>  Components: cmake, general
>Affects Versions: 1.3.0
>Reporter: Will Rouesnel
>Priority: Minor
>
> When checking out a fresh build with make, ./bootstrap && ./configure && make 
> will fail because ./config.status is not run by the ./configure script.
> This is not major, but is surprising and not included in the current build 
> recipes provided by the documentation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (MESOS-7125) ./configure does not run ./config.status

2017-02-14 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-7125:
--
Component/s: (was: cmake)
 build

> ./configure does not run ./config.status
> 
>
> Key: MESOS-7125
> URL: https://issues.apache.org/jira/browse/MESOS-7125
> Project: Mesos
>  Issue Type: Bug
>  Components: build, general
>Affects Versions: 1.3.0
>Reporter: Will Rouesnel
>Priority: Minor
>
> When checking out a fresh build with make, ./bootstrap && ./configure && make 
> will fail because ./config.status is not run by the ./configure script.
> This is not major, but is surprising and not included in the current build 
> recipes provided by the documentation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (MESOS-7126) configure fails (without flags) on CentOS 7.3

2017-02-14 Thread Anand Mazumdar (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15866404#comment-15866404
 ] 

Anand Mazumdar commented on MESOS-7126:
---

That's strange. Can you check if {{libsasl2.so}} is present to be sure? Here is 
the relevant magic line throwing this error in our {{configure}} script: 
https://github.com/apache/mesos/blob/master/configure.ac#L1638 , I couldn't 
think of anything else other than the library itself missing(?)

FWIW, I just used vagrant to bootup the {{bento/centos-7.3}} image and followed 
the instructions for CentOS 7.1 and the {{configure}} with no arguments 
succeeded. 

{noformat}
$ yum list installed cyrus*
cyrus-sasl.x86_64   
2.1.26-20.el7_2 

@base
cyrus-sasl-devel.x86_64 
2.1.26-20.el7_2 

@base
cyrus-sasl-lib.x86_64   
2.1.26-20.el7_2 

@anaconda
cyrus-sasl-md5.x86_64   
2.1.26-20.el7_2 

@base
{noformat}

> configure fails (without flags) on CentOS 7.3
> -
>
> Key: MESOS-7126
> URL: https://issues.apache.org/jira/browse/MESOS-7126
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.1.0
> Environment: CentOS 7.3
> Linux Kernel 3.10.0
>Reporter: Raul Harnasch
>
> ../configure produces CRAM-MD5 error:
> {quote}
> ' We need CRAM-MD5 support for SASL authentication. '
> {quote}
> libnl3-devel has been installed as per 
> https://issues.apache.org/jira/browse/MESOS-6649 (even though I am not 
> running configure with network-isolator)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (MESOS-7126) configure fails (without flags) on CentOS 7.3

2017-02-14 Thread Anand Mazumdar (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15866492#comment-15866492
 ] 

Anand Mazumdar commented on MESOS-7126:
---

hmm, I am out of ideas. cc'ing [~tillt] [~adam-mesos] in case they have seen 
this before.

> configure fails (without flags) on CentOS 7.3
> -
>
> Key: MESOS-7126
> URL: https://issues.apache.org/jira/browse/MESOS-7126
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.1.0
> Environment: CentOS 7.3
> Linux Kernel 3.10.0
>Reporter: Raul Harnasch
>
> ../configure produces CRAM-MD5 error:
> {quote}
> ' We need CRAM-MD5 support for SASL authentication. '
> {quote}
> libnl3-devel has been installed as per 
> https://issues.apache.org/jira/browse/MESOS-6649 (even though I am not 
> running configure with network-isolator)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (MESOS-7129) Default executor exits with a stack trace in a few scenarios.

2017-02-14 Thread Anand Mazumdar (JIRA)

Anand Mazumdar created MESOS-7129:
-

 Summary: Default executor exits with a stack trace in a few 
scenarios.
 Key: MESOS-7129
 URL: https://issues.apache.org/jira/browse/MESOS-7129
 Project: Mesos
  Issue Type: Bug
Reporter: Anand Mazumdar
Assignee: Anand Mazumdar
Priority: Blocker


This happened due to MESOS-6296 accidentally making the 1.2 release. In some 
scenarios, the default executor commits suicide (as expected) but does it with 
a stack trace due to a failed invariant check.

The error scenarios happen when some task(s) in the task group terminated 
successfully, and if the other task(s) thereafter either were killed or 
terminated with a non-zero status code.

e.g., Task 1 terminated with status 0 , Task 2 was killed by the scheduler. In 
this scenario, the default executor would commit suicide with a stack trace.





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (MESOS-7131) Make the long lived framework use the default executor.

2017-02-15 Thread Anand Mazumdar (JIRA)

Anand Mazumdar created MESOS-7131:
-

 Summary: Make the long lived framework use the default executor.
 Key: MESOS-7131
 URL: https://issues.apache.org/jira/browse/MESOS-7131
 Project: Mesos
  Issue Type: Improvement
Reporter: Anand Mazumdar


Currently, the [long lived framework| 
https://github.com/apache/mesos/blob/master/src/examples/long_lived_framework.cpp]
 uses it's own executor {{long_lived_executor}} for launching tasks. Now that 
we have added functionality to the default executor to launch multiple task 
groups (MESOS-6296), we should make the long lived framework use the [default 
executor| 
https://github.com/apache/mesos/blob/master/src/launcher/default_executor.cpp] 
instead. 

Also, we do not have a metric for executor failures on the long running 
framework. We should also consider adding a metric for it as part of this 
change to ensure that we can test the life cycle of the default executor.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (MESOS-7119) Mesos master crash while accepting inverse offer.

2017-02-15 Thread Anand Mazumdar (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15868475#comment-15868475
 ] 

Anand Mazumdar commented on MESOS-7119:
---

{noformat}



commit 3ade0364edb1b905c1e5ee6cb143c3f8728f8ba9
Author: Anand Mazumdar 
Date:   Wed Feb 15 11:56:37 2017 -0800

Fixed a crash on the master upon receiving an invalid inverse offer.

The erroneous invariant check for `slaveId` can be trigerred when
the master accepts an invalid inverse offer or when the inverse offer
has been already rescinded.

Review: https://reviews.apache.org/r/56587/
{noformat}

> Mesos master crash while accepting inverse offer.
> -
>
> Key: MESOS-7119
> URL: https://issues.apache.org/jira/browse/MESOS-7119
> Project: Mesos
>  Issue Type: Bug
>Reporter: Anand Mazumdar
>Assignee: Anand Mazumdar
>Priority: Critical
>  Labels: maintenance, mesosphere
> Attachments: crash-log-master.gz
>
>
> We noticed a Mesos master invariant check failing leading to a crash while 
> accepting an inverse offer. The {{HEAD}} is : 
> {{c7fc1377b33c4eb83a01167bdb53c102c06b9a99}} from Jan 11. 
> https://github.com/apache/mesos/commit/c7fc1377b33c4eb83a01167bdb53c102c06b9a99
> {noformat}
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.564393 27362 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0002
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.564457 27362 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0003
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.564517 27362 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0003
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.566793 27367 master.cpp:6664] Sending 1 
> offers to framework 2d45d0b7-0d58-43e4-9662-d876a100a055-0009 (hello-
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.567001 27367 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0001
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.567091 27367 master.cpp:6754] Sending 1 
> inverse offers to framework 2d45d0b7-0d58-43e4-9662-d876a100a055-0009
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.567168 27367 master.cpp:6754] Sending 1 
> inverse offers to framework 2d45d0b7-0d58-43e4-9662-d876a100a055-0018
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.567234 27367 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0061
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.567322 27367 master.cpp:6754] Sending 1 
> inverse offers to framework 2d45d0b7-0d58-43e4-9662-d876a100a055-0012
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.567405 27367 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0003
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.567876 27363 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0061
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.567975 27363 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0062
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.568056 27363 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0003
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.584126 27369 http.cpp:410] HTTP POST for 
> /master/api/v1/scheduler from 10.10.0.68:41428
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: W0211 17:00:41.584228 27369 master.cpp:4601] Ignoring 
> accept of inverse offer 01021b50-55f0-420e-8744-1ba1eceb3f55-O135611 s
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: F0211 17:00:41.584259 27369 master.cpp:4605] 
> CHECK_SOME(slaveId): is NONE
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: *** Check failure stack trace: ***
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: @ 0x7f9af0da91ad  google::L

[jira] [Commented] (MESOS-7119) Mesos master crash while accepting inverse offer.

2017-02-15 Thread Anand Mazumdar (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15868482#comment-15868482
 ] 

Anand Mazumdar commented on MESOS-7119:
---

Commit to 1.2.x branch
{noformat}
commit e53fcaf7f7cde11f76d05492d50b6458482168e0
Author: Anand Mazumdar 
Date:   Wed Feb 15 11:56:37 2017 -0800

Fixed a crash on the master upon receiving an invalid inverse offer.

The erroneous invariant check for `slaveId` can be trigerred when
the master accepts an invalid inverse offer or when the inverse offer
has been already rescinded.

Review: https://reviews.apache.org/r/56587/
{noformat}

> Mesos master crash while accepting inverse offer.
> -
>
> Key: MESOS-7119
> URL: https://issues.apache.org/jira/browse/MESOS-7119
> Project: Mesos
>  Issue Type: Bug
>Reporter: Anand Mazumdar
>Assignee: Anand Mazumdar
>Priority: Critical
>  Labels: maintenance, mesosphere
> Attachments: crash-log-master.gz
>
>
> We noticed a Mesos master invariant check failing leading to a crash while 
> accepting an inverse offer. The {{HEAD}} is : 
> {{c7fc1377b33c4eb83a01167bdb53c102c06b9a99}} from Jan 11. 
> https://github.com/apache/mesos/commit/c7fc1377b33c4eb83a01167bdb53c102c06b9a99
> {noformat}
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.564393 27362 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0002
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.564457 27362 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0003
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.564517 27362 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0003
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.566793 27367 master.cpp:6664] Sending 1 
> offers to framework 2d45d0b7-0d58-43e4-9662-d876a100a055-0009 (hello-
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.567001 27367 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0001
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.567091 27367 master.cpp:6754] Sending 1 
> inverse offers to framework 2d45d0b7-0d58-43e4-9662-d876a100a055-0009
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.567168 27367 master.cpp:6754] Sending 1 
> inverse offers to framework 2d45d0b7-0d58-43e4-9662-d876a100a055-0018
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.567234 27367 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0061
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.567322 27367 master.cpp:6754] Sending 1 
> inverse offers to framework 2d45d0b7-0d58-43e4-9662-d876a100a055-0012
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.567405 27367 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0003
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.567876 27363 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0061
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.567975 27363 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0062
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.568056 27363 master.cpp:6754] Sending 1 
> inverse offers to framework 98b4f7a3-fc41-48c8-a37d-ed85ed371929-0003
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: I0211 17:00:41.584126 27369 http.cpp:410] HTTP POST for 
> /master/api/v1/scheduler from 10.10.0.68:41428
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: W0211 17:00:41.584228 27369 master.cpp:4601] Ignoring 
> accept of inverse offer 01021b50-55f0-420e-8744-1ba1eceb3f55-O135611 s
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: F0211 17:00:41.584259 27369 master.cpp:4605] 
> CHECK_SOME(slaveId): is NONE
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: *** Check failure stack trace: ***
> Feb 11 17:00:41 ip-10-10-0-215.us-west-2.compute.internal 
> mesos-master[27357]: @ 0x7f9

[jira] [Commented] (MESOS-7102) Crash when sending a SIGUSR1 signal to the agent.

2017-02-15 Thread Anand Mazumdar (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15868485#comment-15868485
 ] 

Anand Mazumdar commented on MESOS-7102:
---

Commit to 1.2.x branch
{noformat}
commit 7e5439d55fd89cb9336220d9a1847391384ea8d5
Author: Anand Mazumdar 
Date:   Fri Feb 10 15:41:11 2017 -0800

Fixed a crash on the agent when handling the SIGUSR1 signal.

There were some actors that were not being destructed when
`finalize()` was being invoked. Also fixed the order of the
destruction of objects i.e., in the reverse order of their
creation.

Review: https://reviews.apache.org/r/56525/
{noformat}

> Crash when sending a SIGUSR1 signal to the agent.
> -
>
> Key: MESOS-7102
> URL: https://issues.apache.org/jira/browse/MESOS-7102
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.2.0
> Environment: ubuntu 16.04
>Reporter: Anand Mazumdar
>Assignee: Anand Mazumdar
>Priority: Critical
>  Labels: mesosphere
> Fix For: 1.3.0
>
>
> Looks like sending a {{SIGUSR1}} to the agent crashes it. This is a 
> regression and used to work fine in the 1.1 release. Note that the agent does 
> unregisters with the master and the crash happens after that.
> Steps to reproduce:
> - Start the agent.
> - Send it a {{SIGUSR1}} signal.
> The agent should crash with a stack trace similar to this:
> {noformat}
> I0209 16:19:46.210819 31977472 slave.cpp:851] Received SIGUSR1 signal from 
> user gmann; unregistering and shutting down
> I0209 16:19:46.210960 31977472 slave.cpp:803] Agent terminating
> *** Aborted at 1486685986 (unix time) try "date -d @1486685986" if you are 
> using GNU date ***
> PC: @ 0x7fffbc4904fc _pthread_key_global_init
> *** SIGSEGV (@0x38) received by PID 88894 (TID 0x7fffc50c83c0) stack trace: 
> ***
> @ 0x7fffbc488bba _sigtramp
> @ 0x7fe8a5d03f38 (unknown)
> @0x10b6d67d9 
> _ZZ11synchronizeINSt3__115recursive_mutexEE12SynchronizedIT_EPS3_ENKUlPS1_E_clES6_
> @0x10b6d67b8 
> _ZZ11synchronizeINSt3__115recursive_mutexEE12SynchronizedIT_EPS3_ENUlPS1_E_8__invokeES6_
> @0x10b6d6889 Synchronized<>::Synchronized()
> @0x10b6d678d Synchronized<>::Synchronized()
> @0x10b6a708a synchronize<>()
> @0x10e2f148d process::ProcessManager::wait()
> @0x10e2e9a78 process::wait()
> @0x10b30614f process::wait()
> @0x10c9619dc 
> mesos::internal::slave::StatusUpdateManager::~StatusUpdateManager()
> @0x10c961a55 
> mesos::internal::slave::StatusUpdateManager::~StatusUpdateManager()
> @0x10b1ab035 main
> @ 0x7fffbc27b255 start
> [1]88894 segmentation fault  bin/mesos-agent.sh —master=127.0.0.1:5050
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (MESOS-7102) Crash when sending a SIGUSR1 signal to the agent.

2017-02-15 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-7102:
--
Target Version/s:   (was: 1.2.1)
   Fix Version/s: (was: 1.3.0)
  1.2.0

> Crash when sending a SIGUSR1 signal to the agent.
> -
>
> Key: MESOS-7102
> URL: https://issues.apache.org/jira/browse/MESOS-7102
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.2.0
> Environment: ubuntu 16.04
>Reporter: Anand Mazumdar
>Assignee: Anand Mazumdar
>Priority: Critical
>  Labels: mesosphere
> Fix For: 1.2.0
>
>
> Looks like sending a {{SIGUSR1}} to the agent crashes it. This is a 
> regression and used to work fine in the 1.1 release. Note that the agent does 
> unregisters with the master and the crash happens after that.
> Steps to reproduce:
> - Start the agent.
> - Send it a {{SIGUSR1}} signal.
> The agent should crash with a stack trace similar to this:
> {noformat}
> I0209 16:19:46.210819 31977472 slave.cpp:851] Received SIGUSR1 signal from 
> user gmann; unregistering and shutting down
> I0209 16:19:46.210960 31977472 slave.cpp:803] Agent terminating
> *** Aborted at 1486685986 (unix time) try "date -d @1486685986" if you are 
> using GNU date ***
> PC: @ 0x7fffbc4904fc _pthread_key_global_init
> *** SIGSEGV (@0x38) received by PID 88894 (TID 0x7fffc50c83c0) stack trace: 
> ***
> @ 0x7fffbc488bba _sigtramp
> @ 0x7fe8a5d03f38 (unknown)
> @0x10b6d67d9 
> _ZZ11synchronizeINSt3__115recursive_mutexEE12SynchronizedIT_EPS3_ENKUlPS1_E_clES6_
> @0x10b6d67b8 
> _ZZ11synchronizeINSt3__115recursive_mutexEE12SynchronizedIT_EPS3_ENUlPS1_E_8__invokeES6_
> @0x10b6d6889 Synchronized<>::Synchronized()
> @0x10b6d678d Synchronized<>::Synchronized()
> @0x10b6a708a synchronize<>()
> @0x10e2f148d process::ProcessManager::wait()
> @0x10e2e9a78 process::wait()
> @0x10b30614f process::wait()
> @0x10c9619dc 
> mesos::internal::slave::StatusUpdateManager::~StatusUpdateManager()
> @0x10c961a55 
> mesos::internal::slave::StatusUpdateManager::~StatusUpdateManager()
> @0x10b1ab035 main
> @ 0x7fffbc27b255 start
> [1]88894 segmentation fault  bin/mesos-agent.sh —master=127.0.0.1:5050
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (MESOS-7130) port_mapping isolator: executor hangs when running on EC2

2017-02-15 Thread Anand Mazumdar (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15868493#comment-15868493
 ] 

Anand Mazumdar commented on MESOS-7130:
---

[~gilbert] [~avinash.mesos] Do you have any insights into this?

> port_mapping isolator: executor hangs when running on EC2
> -
>
> Key: MESOS-7130
> URL: https://issues.apache.org/jira/browse/MESOS-7130
> Project: Mesos
>  Issue Type: Bug
>  Components: ec2, executor
>Reporter: Pierre Cheynier
>
> Hi,
> I'm experiencing a weird issue: I'm using a CI to do testing on 
> infrastructure automation.
> I recently activated the {{network/port_mapping}} isolator.
> I'm able to make the changes work and pass the test for bare-metal servers 
> and virtualbox VMs using this configuration.
> But when I try on EC2 (on which my CI pipeline rely) it systematically fails 
> to run any container.
> It appears that the sandbox is created and the port_mapping isolator seems to 
> be OK according to the logs in stdout and stderr and the {tc} output :
> {noformat}
> + mount --make-rslave /run/netns
> + test -f /proc/sys/net/ipv6/conf/all/disable_ipv6
> + echo 1
> + ip link set lo address 02:44:20:bb:42:cf mtu 9001 up
> + ethtool -K eth0 rx off
> (...)
> + tc filter show dev eth0 parent :0
> + tc filter show dev lo parent :0
> I0215 16:01:13.941375 1 exec.cpp:161] Version: 1.0.2
> {noformat}
> Then the executor never come back in REGISTERED state and hang indefinitely.
> {GLOG_v=3} doesn't help here.
> My skills in this area are limited, but trying to load the symbols and attach 
> a gdb to the mesos-executor process, I'm able to print this stack:
> {noformat}
> #0  0x7feffc1386d5 in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /usr/lib64/libpthread.so.0
> #1  0x7feffbed69ec in 
> std::condition_variable::wait(std::unique_lock&) () from 
> /usr/lib64/libstdc++.so.6
> #2  0x7ff0003dd8ec in void synchronized_wait std::mutex>(std::condition_variable*, std::mutex*) () from 
> /usr/lib64/libmesos-1.0.2.so
> #3  0x7ff0017d595d in Gate::arrive(long) () from 
> /usr/lib64/libmesos-1.0.2.so
> #4  0x7ff0017c00ed in process::ProcessManager::wait(process::UPID const&) 
> () from /usr/lib64/libmesos-1.0.2.so
> #5  0x7ff0017c5c05 in process::wait(process::UPID const&, Duration 
> const&) () from /usr/lib64/libmesos-1.0.2.so
> #6  0x004ab26f in process::wait(process::ProcessBase const*, Duration 
> const&) ()
> #7  0x004a3903 in main ()
> {noformat}
> I concluded that the underlying shell script launched by the isolator or the 
> task itself is just .. blocked. But I don't understand why.
> Here is a process tree to show that I've no task running but the executor is:
> {noformat}
> root 28420  0.8  3.0 1061420 124940 ?  Ssl  17:56   0:25 
> /usr/sbin/mesos-slave --advertise_ip=127.0.0.1 
> --attributes=platform:centos;platform_major_version:7;type:base 
> --cgroups_enable_cfs --cgroups_hierarchy=/sys/fs/cgroup 
> --cgroups_net_cls_primary_handle=0xC370 
> --container_logger=org_apache_mesos_LogrotateContainerLogger 
> --containerizers=mesos,docker 
> --credential=file:///etc/mesos-chef/slave-credential 
> --default_container_info={"type":"MESOS","volumes":[{"host_path":"tmp","container_path":"/tmp","mode":"RW"}]}
>  --default_role=default --docker_registry=/usr/share/mesos/users 
> --docker_store_dir=/var/opt/mesos/store/docker 
> --egress_unique_flow_per_container --enforce_container_disk_quota 
> --ephemeral_ports_per_container=128 
> --executor_environment_variables={"PATH":"/bin:/usr/bin:/usr/sbin","CRITEO_DC":"par","CRITEO_ENV":"prod"}
>  --image_providers=docker --image_provisioner_backend=copy 
> --isolation=cgroups/cpu,cgroups/mem,cgroups/net_cls,namespaces/pid,disk/du,filesystem/shared,filesystem/linux,docker/runtime,network/cni,network/port_mapping
>  --logging_level=INFO 
> --master=zk://mesos:test@localhost.localdomain:2181/mesos 
> --modules=file:///etc/mesos-chef/slave-modules.json --port=5051 
> --recover=reconnect 
> --resources=ports:[31000-32000];ephemeral_ports:[32768-57344] --strict 
> --work_dir=/var/opt/mesos
> root 28484  0.0  2.3 433676 95016 ?Ssl  17:56   0:00  \_ 
> mesos-logrotate-logger --help=false 
> --log_filename=/var/opt/mesos/slaves/cdf94219-87b2-4af2-9f61-5697f0442915-S0/frameworks/366e8ed2-730e-4423-9324-086704d182b0-/executors/group_simplehttp.16f7c2ee-f3a8-11e6-be1c-0242b44d071f/runs/1d3e6b1c-cda8-47e5-92c4-a161429a7ac6/stdout
>  --logrotate_options=rotate 5 --logrotate_path=logrotate --max_size=10MB
> root 28485  0.0  2.3 499212 94724 ?Ssl  17:56   0:00  \_ 
> mesos-logrotate-logger --help=false 
> --log_filename=/var/opt/mesos/slaves/cdf94219-87b2-4af2-9f61-5697f0442915-S0/frameworks/366e8ed2-730e-4423-9324-086704d182b0-/executors/group

[jira] [Updated] (MESOS-5186) mesos.interface: Allow using protobuf 3.x

2017-02-16 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-5186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-5186:
--
Shepherd: Anand Mazumdar

> mesos.interface: Allow using protobuf 3.x
> -
>
> Key: MESOS-5186
> URL: https://issues.apache.org/jira/browse/MESOS-5186
> Project: Mesos
>  Issue Type: Improvement
>  Components: python api
>Reporter: Myautsai PAN
>Assignee: Yong Tang
>  Labels: protobuf, python
>
> We're working on integrating TensorFlow(https://www.tensorflow.org) with 
> mesos. Both the two require {{protobuf}}. The python package 
> {{mesos.interface}} requires {{protobuf>=2.6.1,<3}}, but {{tensorflow}} 
> requires {{protobuf>=3.0.0}} . Though protobuf 3.x is not compatible with 
> protobuf 2.x, but anyway we modify the {{setup.py}} 
> (https://github.com/apache/mesos/blob/66cddaf/src/python/interface/setup.py.in#L29)
> from {{'install_requires': [ 'google-common>=0.0.1', 'protobuf>=2.6.1,<3' 
> ],}} to {{'install_requires': [ 'google-common>=0.0.1', 'protobuf>=2.6.1' ],}}
> It works fine. Would you please consider support protobuf 3.x officially in 
> the next release? Maybe just remove the {{,<3}} restriction is enough.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

< 3 4 5 6 7 8 9 10 11 >

701 - 800 of 1073 matches

Mail list logo