[jira] [Created] (MESOS-10159) Running unit test command hangs

2020-07-07 Thread Jinesh Patel (Jira)
Jinesh Patel created MESOS-10159:


 Summary: Running unit test command hangs
 Key: MESOS-10159
 URL: https://issues.apache.org/jira/browse/MESOS-10159
 Project: Mesos
  Issue Type: Bug
  Components: test
 Environment: OS: Ubuntu 20.04
Arch: Intel
Reporter: Jinesh Patel


Running the `make check` command to execute mesos test cases hangs after 
printing failed test results. The process doesn't hang if all test cases pass.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10158) Mesos Agent gets stuck in Draining due to pending unacknowledged status updates

2020-07-07 Thread Andrei Budnik (Jira)
Andrei Budnik created MESOS-10158:
-

 Summary: Mesos Agent gets stuck in Draining due to pending 
unacknowledged status updates
 Key: MESOS-10158
 URL: https://issues.apache.org/jira/browse/MESOS-10158
 Project: Mesos
  Issue Type: Bug
  Components: master
Reporter: Andrei Budnik


A Mesos agent can get stuck in the Draining mode caused by pending 
unacknowledged status updates. When the framework becomes disconnected, the 
agent keeps sending task status updates for terminated tasks of that framework. 
This leads to a problem when the agent gets stuck in the Draining state because 
the master transitions the agent from DRAINING to DRAINED state only after all 
task status updates get acknowledged.

This problem can be resolved by sending ["Teardown" 
operation|https://github.com/apache/mesos/blob/8ce5d30808f3744eeded09d530f226079d569a94/include/mesos/v1/master/master.proto#L299-L303]
 for all lost frameworks. However, it would be much better if this situation 
could be handled automatically by the Master. At least, we should make it 
easier for an operator to find out what prevents draining operation to complete.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10140) CMake Error: Problem with archive_read_open_file(): Unrecognized archive format

2020-07-07 Thread Greg Mann (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152815#comment-17152815
 ] 

Greg Mann commented on MESOS-10140:
---

[~QuellaZhang] could you try building again on latest master branch of Mesos? 
We believe the issue should be fixed now. If so, please close out this ticket, 
otherwise let us know. Thanks!

> CMake Error: Problem with archive_read_open_file(): Unrecognized archive 
> format
> ---
>
> Key: MESOS-10140
> URL: https://issues.apache.org/jira/browse/MESOS-10140
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Reporter: QuellaZhang
>Priority: Major
>  Labels: windows
> Attachments: mesos_build.log
>
>
> Hi All,
> We tried to build Mesos on Windows with VS2019. It failed to build due to 
> "CUSTOMBUILD : CMake error : Problem with archive_read_open_file(): 
> Unrecognized archive format 
> [F:\gitP\apache\mesos\build_amd64\3rdparty\wclayer-WIP.vcxproj]" on Windows 
> using MSVC. It can be reproduced on latest reversion d4634f4 on master 
> branch. Could you help confirm? We use cmake version 3.17.2.
>  
> Reproduce steps:
> 1.  git clone -c core.autocrlf=true [https://github.com/apache/mesos] 
> F:\gitP\apache\mesos
>  2.  Open a VS 2019 x64 command prompt as admin and browse to 
> F:\gitP\apache\mesos
>  3.  mkdir build_amd64 && pushd build_amd64
> 4.  cmake -G "Visual Studio 16 2019" -A x64 
> -DCMAKE_SYSTEM_VERSION=10.0.18362.0 -DENABLE_LIBEVENT=1 
> -DHAS_AUTHENTICATION=0 -DPATCHEXE_PATH="F:\tools\gnuwin32\bin" -T host=x64 ..
> 5.  set _CL_=/D_SILENCE_TR1_NAMESPACE_DEPRECATION_WARNING
> 6.  msbuild /maxcpucount:4 /p:Platform=x64 /p:Configuration=Debug Mesos.sln 
> /t:Rebuild
>  
> ErrorMessage:
> *manual run:*
> F:\gitP\apache\mesos\build_amd64\3rdparty\wclayer-WIP\src>cmake --version
>  cmake version 3.17.2
> CMake suite maintained and supported by Kitware (kitware.com/cmake).
> F:\gitP\apache\mesos\build_amd64\3rdparty\wclayer-WIP\src>cmake -E tar xjf 
> archive.tar
>  CMake Error: Problem with archive_read_open_file(): Unrecognized archive 
> format
>  CMake Error: Problem extracting tar: archive.tar
> *build log: (see attachment)*
> 59>CUSTOMBUILD : CMake error : Problem with archive_read_open_file(): 
> Unrecognized archive format 
> [F:\gitP\apache\mesos\build_amd64\3rdparty\wclayer-WIP.vcxproj]
>  59>CUSTOMBUILD : CMake error : Problem extracting tar: 
> F:/gitP/apache/mesos/build_amd64/3rdparty/wclayer-WIP/src/archive.tar 
> [F:\gitP\apache\mesos\build_amd64\3rdparty\wclayer-WIP.vcxproj]
>  – extracting... [error clean up]
>  CMake Error at wclayer-WIP-stamp/extract-wclayer-WIP.cmake:33 (message):
>  59>CUSTOMBUILD : error : extract of 
> [F:\gitP\apache\mesos\build_amd64\3rdparty\wclayer-WIP.vcxproj]
>  'F:/gitP/apache/mesos/build_amd64/3rdparty/wclayer-WIP/src/archive.tar'
>  failed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10143) Outstanding Offers accumulating

2020-07-07 Thread Greg Mann (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152811#comment-17152811
 ] 

Greg Mann commented on MESOS-10143:
---

[~puneetku287] it's unclear to me from the description if this is an issue in 
Mesos or in your scheduler. A more precise description of the framework's 
behavior during the incidents would help - what does the scheduler do with the 
offers during this time? Feel free to find us on Mesos Slack, that might be an 
easier place to have a synchronous discussion about your issue.

> Outstanding Offers accumulating
> ---
>
> Key: MESOS-10143
> URL: https://issues.apache.org/jira/browse/MESOS-10143
> Project: Mesos
>  Issue Type: Bug
>  Components: master, scheduler driver
>Affects Versions: 1.7.0
> Environment: Mesos Version 1.7.0
> JDK 8.0
>Reporter: Puneet Kumar
>Priority: Minor
>
> We manage an Apache Mesos cluster version 1.7.0. We have written a framework 
> in Java that schedules tasks to Mesos master at a rate of 300 TPS. Everything 
> works fine for almost 24 hours but then outstanding offers accumulate & 
> saturate within 15 minutes. Outstanding offers aren't reclaimed by Mesos 
> master. We observe "RescindOffer" messages in verbose (GLOG v=3) framework 
> logs but outstanding offers don't reduce. New resources aren't offered to 
> framework when outstanding offers saturate. We have to restart the scheduler 
> to reset outstanding offers to zero.
> Any suggestions to debug this issue are welcome.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10146) Removing task from slave when framework is disconnected causes master to crash

2020-07-07 Thread Greg Mann (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152809#comment-17152809
 ] 

Greg Mann commented on MESOS-10146:
---

[~sunshine123] thank you for the bug report! Would it be possible to get a full 
verbose master log from an incident? The logs surrounding the check failure may 
help us pinpoint the issue more precisely.

> Removing task from slave when framework is disconnected causes master to crash
> --
>
> Key: MESOS-10146
> URL: https://issues.apache.org/jira/browse/MESOS-10146
> Project: Mesos
>  Issue Type: Bug
>  Components: c++ api, framework
>Affects Versions: 1.9.0
> Environment: Mesos master with three master nodes
>Reporter: Naveen
>Priority: Major
>
> Hello, 
>     we want to report an issue we observed when remove tasks from slave. 
> There is condition to check for valid framework before tasks can be removed. 
> There can be several reasons framework can be disconnected. This check fails 
> and crashes mesos master node. 
> [https://github.com/apache/mesos/blob/1.9.0/src/master/master.cpp#L11842]
> There is also unguarded access to the internal framework state on line 11853.
> Error logs - 
> {noformat}
> mesos-master[5483]: I0618 14:05:20.859189 5491 master.cpp:9512] Marked agent 
> 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-S303 (10.160.73.79) unreachable: health 
> check timed out
> mesos-master[5483]: F0618 14:05:20.859347 5491 master.cpp:11842] Check 
> failed: framework != nullptr Framework 
> 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-0067 not found while removing agent 
> 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-S303 at slave(1)@10.160.73.79:5051 
> (10.160.73.79); agent tasks: { 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-0067: { } 
> }
> mesos-master[5483]: *** Check failure stack trace: ***
> mesos-master[5483]: I0618 14:05:20.859781 5490 hierarchical.cpp:1013] Removed 
> all filters for agent 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-S303
> mesos-master[5483]: I0618 14:05:20.872217 5490 hierarchical.cpp:890] Removed 
> agent 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-S303
> mesos-master[5483]: I0618 14:05:20.859922 5487 replica.cpp:695] Replica 
> received learned notice for position 42070 from 
> log-network(1)@10.160.73.212:5050
> mesos-master[5483]: @ 0x7f2fdf6a5b1d google::LogMessage::Fail()
> mesos-master[5483]: @ 0x7f2fdf6a7dfd google::LogMessage::SendToLog()
> mesos-master[5483]: @ 0x7f2fdf6a56ab google::LogMessage::Flush()
> mesos-master[5483]: @ 0x7f2fdf6a8859 
> google::LogMessageFatal::~LogMessageFatal()
> mesos-master[5483]: @ 0x7f2fde2677f2 
> mesos::internal::master::Master::__removeSlave()
> mesos-master[5483]: @ 0x7f2fde267ebe 
> mesos::internal::master::Master::_markUnreachable()
> mesos-master[5483]: @ 0x7f2fde268215 
> _ZNO6lambda12CallableOnceIFN7process6FutureIbEEvEE10CallableFnINS_8internal7PartialIZN5mesos8internal6master6Master15markUnreachableERKNS9_9SlaveInfoEbRKSsEUlbE_JbclEv
> mesos-master[5483]: @ 0x7f2fddf30688 
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchINS1_6FutureIbEEEclINS0_IFSC_vESC_RKNS1_4UPIDEOT_EUlSt10unique_ptrINS1_7PromiseIbEESt14default_deleteISO_EEOSG_S3_E_ISR_SG_St12_PlaceholderILi1EEclEOS3_
> mesos-master[5483]: @ 0x7f2fdf5e3b91 process::ProcessBase::consume()
> mesos-master[5483]: @ 0x7f2fdf608f77 process::ProcessManager::resume()
> mesos-master[5483]: @ 0x7f2fdf60cb36 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> mesos-master[5483]: @ 0x7f2fdf8c34d0 execute_native_thread_routine
> mesos-master[5483]: @ 0x7f2fdba02ea5 start_thread
> mesos-master[5483]: @ 0x7f2fdb20e8dd __clone
> systemd[1]: mesos-master.service: main process exited, code=killed, 
> status=6/ABRT
> systemd[1]: Unit mesos-master.service entered failed state.
> systemd[1]: mesos-master.service failed.
> systemd[1]: mesos-master.service holdoff time over, scheduling restart.
> systemd[1]: Stopped Mesos Master.
> systemd[1]: Started Mesos Master.
> mesos-master[28757]: I0618 14:05:41.461403 28748 logging.cpp:201] INFO level 
> logging started!
> mesos-master[28757]: I0618 14:05:41.461712 28748 main.cpp:243] Build: 
> 2020-05-09 10:42:00 by centos
> mesos-master[28757]: I0618 14:05:41.461721 28748 main.cpp:244] Version: 1.9.0
> mesos-master[28757]: I0618 14:05:41.461726 28748 main.cpp:247] Git tag: 1.9.0
> mesos-master[28757]: I0618 14:05:41.461730 28748 main.cpp:251] Git SHA: 
> 5e79a584e6ec3e9e2f96e8bf418411df9dafac2e{noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10157) Add document for the `volume/csi` isolator

2020-07-07 Thread Qian Zhang (Jira)
Qian Zhang created MESOS-10157:
--

 Summary: Add document for the `volume/csi` isolator
 Key: MESOS-10157
 URL: https://issues.apache.org/jira/browse/MESOS-10157
 Project: Mesos
  Issue Type: Task
Reporter: Qian Zhang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10156) Enable the `volume/csi` isolator in UCR

2020-07-07 Thread Qian Zhang (Jira)
Qian Zhang created MESOS-10156:
--

 Summary: Enable the `volume/csi` isolator in UCR
 Key: MESOS-10156
 URL: https://issues.apache.org/jira/browse/MESOS-10156
 Project: Mesos
  Issue Type: Task
Reporter: Qian Zhang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10155) Implement the `recover` method of the `volume/csi` isolator

2020-07-07 Thread Qian Zhang (Jira)
Qian Zhang created MESOS-10155:
--

 Summary: Implement the `recover` method of the `volume/csi` 
isolator
 Key: MESOS-10155
 URL: https://issues.apache.org/jira/browse/MESOS-10155
 Project: Mesos
  Issue Type: Task
Reporter: Qian Zhang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10154) Implement the `cleanup` method of the `volume/csi` isolator

2020-07-07 Thread Qian Zhang (Jira)
Qian Zhang created MESOS-10154:
--

 Summary: Implement the `cleanup` method of the `volume/csi` 
isolator
 Key: MESOS-10154
 URL: https://issues.apache.org/jira/browse/MESOS-10154
 Project: Mesos
  Issue Type: Task
Reporter: Qian Zhang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10153) Implement the `prepare` method of the `volume/csi` isolator

2020-07-07 Thread Qian Zhang (Jira)
Qian Zhang created MESOS-10153:
--

 Summary: Implement the `prepare` method of the `volume/csi` 
isolator
 Key: MESOS-10153
 URL: https://issues.apache.org/jira/browse/MESOS-10153
 Project: Mesos
  Issue Type: Task
Reporter: Qian Zhang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10152) Implement the `create` method of the `volume/csi` isolator

2020-07-07 Thread Qian Zhang (Jira)
Qian Zhang created MESOS-10152:
--

 Summary: Implement the `create` method of the `volume/csi` isolator
 Key: MESOS-10152
 URL: https://issues.apache.org/jira/browse/MESOS-10152
 Project: Mesos
  Issue Type: Task
Reporter: Qian Zhang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10151) Introduce a new agent flag `--csi_plugin_config_dir`

2020-07-07 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152554#comment-17152554
 ] 

Qian Zhang commented on MESOS-10151:


See 
[here|https://docs.google.com/document/d/1NfWLS2OdiYjXZa2dpd_DOWOK4eou-SedY396Jl68s9Y/edit#heading=h.iobmmefa9bop]
 for the detailed design.

> Introduce a new agent flag `--csi_plugin_config_dir`
> 
>
> Key: MESOS-10151
> URL: https://issues.apache.org/jira/browse/MESOS-10151
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10151) Implement the `create` method of the `volume/csi` isolator

2020-07-07 Thread Qian Zhang (Jira)
Qian Zhang created MESOS-10151:
--

 Summary: Implement the `create` method of the `volume/csi` isolator
 Key: MESOS-10151
 URL: https://issues.apache.org/jira/browse/MESOS-10151
 Project: Mesos
  Issue Type: Task
Reporter: Qian Zhang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10150) Refactor CSI volume manager to support pre-provisioned CSI volumes

2020-07-07 Thread Qian Zhang (Jira)
Qian Zhang created MESOS-10150:
--

 Summary: Refactor CSI volume manager to support pre-provisioned 
CSI volumes
 Key: MESOS-10150
 URL: https://issues.apache.org/jira/browse/MESOS-10150
 Project: Mesos
  Issue Type: Task
Reporter: Qian Zhang


The existing 
[VolumeManager|https://github.com/apache/mesos/blob/1.10.0/src/csi/volume_manager.hpp#L55:L138]
 is like a wrapper for various CSI gRPC calls, we could consider leveraging it 
to call CSI plugins rather than making raw CSI gRPC calls in `volume/csi` 
isolator. But there is a problem, the lifecycle of the volumes managed by 
VolumeManager starts from the 
`[createVolume|https://github.com/apache/mesos/blob/1.10.0/src/csi/v1_volume_manager.cpp#L281:L329]`
 CSI call, but what we plan to support in MVP is pre-provisioned volumes, so we 
need to refactor VolumeManager by making it support pre-provisioned volumes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10149) Refactor CSI service manager to support unmanaged CSI plugins

2020-07-07 Thread Qian Zhang (Jira)
Qian Zhang created MESOS-10149:
--

 Summary: Refactor CSI service manager to support unmanaged CSI 
plugins
 Key: MESOS-10149
 URL: https://issues.apache.org/jira/browse/MESOS-10149
 Project: Mesos
  Issue Type: Task
Reporter: Qian Zhang


Refactor [CSI service 
manager|https://github.com/apache/mesos/blob/1.10.0/src/csi/service_manager.hpp#L50:L81]
 by making it support unmanaged plugins (i.e. the plugin deployed out of Mesos) 
and make it’s `getServiceEndpoint` method can also return unmanaged plugins's 
endpoint.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)