from:"haosdent"

Re: New 1.5 Marathon deb package - no documentation

2018-01-10 Thread haosdent

marathon 1.5 use /usr/share/marathon/conf/application.ini as configure file.

On Wed, Jan 10, 2018 at 4:59 PM, Adam Cecile  wrote:

> Hello,
>
>
> I'm testing Mesos 1.4 + marathon 1.5 update but I cannot understand how
> marathon 1.5 deb package works.
>
> Marathon binary seems to completely ignore my /etc/marathon/* config files
> used by previous version and when looking at the systemd file, I do not
> understand how to pass startup command line switches in this version.
>
>
> Is there any documentation I missed ?
>
>
> Thanks.
>
>


-- 
Best Regards,
Haosdent Huang

Re: Data locality in Mesos

2017-06-28 Thread haosdent

Hi, @tobias I think a lot of people encounter such problems. And I saw in
CSI
<https://docs.google.com/document/d/125YWqg_5BB5OY9a6M7LZcby5RSqBwo2PZzpVLuxYXh4/edit>
(From
@jieyu) design document, Mesos is adding a new component resource provider,
I think it may help to resolve data locality problem.

For dynamic attributes, I think it is also doable, we could expose it via
HTTP APIs just like the dynamic reservation.

On Wed, Jun 28, 2017 at 8:22 AM, Tobias Pfeiffer  wrote:

> Hi,
>
> one of the major selling points of HDFS is (was?) that it is possible to
> schedule a Hadoop job close to where the data that it operates on is.  I am
> not using HDFS, but I was wondering if/how Mesos supports an approach to
> schedule a job to a machine that has a certain file/dataset already locally
> as opposed to scheduling it to a machine that would have to access it via
> the network or download to the local disk first.
>
> I was wondering if Mesos attributes could be used:  I could have an
> attribute `datasets` of type `set` and then node A could have {dataset1,
> dataset17, dataset3} and node B could have {dataset17, dataset5} and during
> scheduling I could decide based on this attribute where to run a task.
> However, I was wondering if there are dynamic changes of such attributes
> possible.  Imagine that node A deletes dataset17 from the local cache and
> downloads dataset5 instead, then I would like to update the `datasets`
> attribute dynamically, but without affecting the jobs that are running on
> node A.  Is such a thing possible?
>
> Is there an approach other than attributes to describe the data that
> resides on a node in order to achieve data locality?
>
> Thanks
> Tobias
>
>


-- 
Best Regards,
Haosdent Huang

Re: Test framework stalled

2017-06-20 Thread haosdent

Seems the mailing list drop your image. May you share you image via
http://imgur.com/ or any other website?

On Tue, Jun 20, 2017 at 9:49 PM, Joao Costa  wrote:

> Hi guys,
>
> Can anyone help me with this problem:
>
> Every time I try to run the test framework examples (pytho, java, c++) on
> mesos-1.2.0, I get the following messages on the console:
> [image: Imagem intercalada 1]
> and then the systems just freezes.
>
> The master is working, I can access the dashboard, the agents are
> registered in the master and appearing on the dashboard. I have enough
> resources available.
>
> Any idea what is happening?
>
> Thanks
>



-- 
Best Regards,
Haosdent Huang

Re: [WebUI] Sandboxes proxy

2017-06-20 Thread haosdent

Hi, @Jean yes, for now, all requests still go to Mesos agents directly
unless we resolve MESOS-2131.

Our workaround is we have a Nginx proxy, it looks like

```
server {
list 5050;
list 5051;

location / {
resolver   internal-dns ipv6=off;

auth_request @auth;
error_page 401 = @sign_in;

proxy_pass http://$http_host;
}
}
```

So all the domains would bind to the IP of this proxy, but the internal-dns
would return different internal IP and https/auth/forward your requests.
Hope this could give you some ideas to resolve your problems.

On Tue, Jun 20, 2017 at 7:56 PM, Tomek Janiszewski 
wrote:

> There was a try to make reverse proxy on Master https://issues.apache.
> org/jira/browse/MESOS-2131
> Agents are called directly by the browser so you need to create proxy that
> will capture requests to agents and pass it to them.
>
> wt., 20 cze 2017 o 10:14 użytkownik Jean-Baptiste 
> napisał:
>
>> Hi there!
>>
>> Is someone knows if it’s possible to access (from the *Mesos web UI*) to "
>> *sandboxes*" by proxifying agent requests in any way?
>>
>> *Context: *Our network topology doesn’t allow users to access the agents
>> subnet so when they try to reach the Mesos agent port (eg:
>> `172.x.x.x:5051`), it doesn’t (obviously) work.
>>
>> *Web UI error:*
>>
>> 
>> Thanks!
>>
>> --
>>
>> Jean-Baptiste FAREZ
>>
>


-- 
Best Regards,
Haosdent Huang

Re: Failed to create the cgroup at ...

2017-06-11 Thread haosdent

hi, have you mount devices subsystem on it /sys/fs/cgroup/devices/? What's
the content of `/proc/self/mountinfo`?

On Sun, Jun 11, 2017 at 9:36 PM, Happy每一天 <527779...@qq.com> wrote:

> Hi,
> I have encountered a problem when using mesos 1.2.0, details as below:
>
> "Task XX-X is in unexpected state TASK_FAILED with reason
> 'REASON_CONTAINER_LAUNCH_FAILED' from source 'SOURCE_AGENT' with message
> 'Failed to launch container: Failed to create the cgroup at
> '/sys/fs/cgroup/devices/mesos/f70bbbda-bcd0-42dc-bb47-740c99c5b2a9':
> Failed to create directory '/sys/fs/cgroup/devices/mesos/
> f70bbbda-bcd0-42dc-bb47-740c99c5b2a9': No such file or directory'"
>
> I found the directory /sys/fs/cgroup/devices/mesos was missing, but
> it was exist when mesos agent started. I don't know why it is missing, and
> it has occured very frequently on my server.
> How can I resolve this problem ?
>
>  Best Regards,
>  Weijia.Liu
>
>



-- 
Best Regards,
Haosdent Huang

Re: mesos container cluster came across health check coredump log

2017-04-04 Thread haosdent

Really apologize, I am in China and could not connect VPN recent days.
Would check as soon as possible once back.

On Fri, Mar 31, 2017 at 4:20 PM, Alex Rukletsov  wrote:

> Cool, looking forward to it!
>
> On Fri, Mar 31, 2017 at 4:30 AM, tommy xiao  wrote:
>
>> Alex，Yes, let me have a try.
>>
>> 2017-03-31 3:16 GMT+08:00 Alex Rukletsov :
>>
>>> This is https://issues.apache.org/jira/browse/MESOS-7210. Deshi, do you
>>> want to send the patch? I or Haosdent can shepherd.
>>>
>>> A.
>>>
>>> On Thu, Mar 30, 2017 at 12:27 PM, tommy xiao  wrote:
>>>
>>>> interesting for the specified case.
>>>>
>>>> 2017-03-30 7:52 GMT+08:00 Jie Yu :
>>>>
>>>>> + AlexR, haosdent
>>>>>
>>>>> For posterity, the root cause of this problem is that when agent is
>>>>> running inside a docker container and `--docker_mesos_image` flag is
>>>>> specified, the pid namespace of the executor container (which initiate the
>>>>> health check) is different than the root pid namespace. Therefore, getting
>>>>> the network namespace handle using `/proc//ns/net` does not work
>>>>> because the 'pid' here is in the root pid namespace (reported by docker
>>>>> daemon).
>>>>>
>>>>> Alex and haosdent, I think we should fix this issue. As suggested
>>>>> above, we can launch the executor container with --pid=host if
>>>>> `--docker_mesos_image` is specified.
>>>>>
>>>>> - Jie
>>>>>
>>>>> On Wed, Mar 29, 2017 at 3:56 AM, tommy xiao  wrote:
>>>>>
>>>>>> it resolved by add --pid=host.  thanks for community guys supports.
>>>>>> thanks a lot.
>>>>>>
>>>>>> 2017-03-29 9:52 GMT+08:00 tommy xiao :
>>>>>>
>>>>>>> My Environment is specified:
>>>>>>>
>>>>>>> mesos 1.2 in docker containerized.
>>>>>>>
>>>>>>> send a sample nginx docker container with mesos native health check.
>>>>>>>
>>>>>>> then get sandbox core dump.
>>>>>>>
>>>>>>> i have digg into more information for your reference:
>>>>>>>
>>>>>>> in mesos slave container, i can only see task container pid. but i
>>>>>>> can't found process nginx pid.
>>>>>>>
>>>>>>> but in host console, i can found the nginx pid. so how can i get the
>>>>>>> pid in container?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2017-03-28 13:49 GMT+08:00 tommy xiao :
>>>>>>>
>>>>>>>> https://issues.apache.org/jira/browse/MESOS-6184
>>>>>>>>
>>>>>>>> anyone give some hint?
>>>>>>>>
>>>>>>>> ```
>>>>>>>>
>>>>>>>> I0328 11:48:12.922181 48 exec.cpp:162] Version: 1.2.0
>>>>>>>> I0328 11:48:12.929252 54 exec.cpp:237] Executor registered on agent
>>>>>>>> a29dc3a5-3e3f-4058-8ab4-dd7de2ae58d1-S4
>>>>>>>> I0328 11:48:12.931640 54 docker.cpp:850] Running docker -H
>>>>>>>> unix:///var/run/docker.sock run --cpu-shares 10 --memory 33554432
>>>>>>>> --env-file /tmp/gvqGyb -v /data/mesos/slaves/a29dc3a5-3e
>>>>>>>> 3f-4058-8ab4-dd7de2ae58d1-S4/frameworks/d7ef5d2b-f924-42d9-a
>>>>>>>> 274-c020afba6bce-/executors/0-hc-xychu-datamanmesos-2f3b
>>>>>>>> 47f9ffc048539c7b22baa6c32d8f/runs/458189b8-2ff4-4337-ad3a-67321e96f5cb:/mnt/mesos/sandbox
>>>>>>>> --net bridge --label=USER_NAME=xychu --label=GROUP_NAME=groupautotest
>>>>>>>> --label=APP_ID=hc --label=VCLUSTER=clusterautotest
>>>>>>>> --label=USER=xychu --label=CLUSTER=datamanmesos --label=SLOT=0
>>>>>>>> --label=APP=hc -p 31000:80/tcp --name mesos-a29dc3a5-3e3f-4058-8ab4-
>>>>>>>> dd7de2ae58d1-S4.458189b8-2ff4-4337-ad3a-67321e96f5cb nginx
>>>>>>>> I0328 11:48:16.145714 53 health_checker.cpp:196] Ignoring failure
>>>>>>>> as health check still in grace period
>>>>>>>> W0328 11:48:26.289958 49 heal

Re: How to write acls for mesos

2017-02-05 Thread haosdent

May we have the Mesos master log after launch task in Marathon?

On Sun, Feb 5, 2017 at 12:02 PM, 梦开始的地方 <382607...@qq.com> wrote:

>  Hi，I'd like to write MESOS_acls  to control resources，but I don't know
> how to write。
> I use mesos + marathon
> marathon configuration：
> export MARATHON_mesos_user=user_00
> export MARATHON_mesos_authentication_principal=tjop_marathon
> export MARATHON_mesos_authentication_secret_file=$CUR_DIR/../conf/
> marathon.secret
> export MARATHON_mesos_leader_ui_url=http://mesos.webdev.com
> export MARATHON_hostname=$hostipaddr
> export MARATHON_framework_name=tjop_marathon
> export MARATHON_webui_url=http://marathon.webdev.com
>
> mesos configuration：
>
> export MESOS_acls=file://$prefix/etc/mesos/acls
> export MESOS_authenticate_agents=true
> export MESOS_authenticate_frameworks=true
> export MESOS_authorizers=local
> export MESOS_authenticate_http_frameworks=true
> export MESOS_http_framework_authenticators=basic
> export MESOS_authenticators=crammd5
> export MESOS_credentials=file://$prefix/etc/mesos/credentials
>
> acls：
> {
>   "register_frameworks": [
> {
>   "principals": { "values": ["tjop_marathon"] },
>   "roles": { "values": ["tjop_streammonitors"] }
> }
>   ],
>   "run_tasks": [
> {
>   "principals": { "values": ["tjop_marathon"] },
>   "users": { "values": ["user_00","root"] }
> }
>   ]
> }
> credentials:
> {
>   "credentials": [
> {
>   "principal": "tjop_marathon",
>   "secret": "tjop_marathon"
> },
> {
>   "principal": "tjop_mesos_slave",
>   "secret": "tjop_mesos_slave"
> }
>   ]
> }
> the marathon is active framework in mesos,but whlile I create app in
> marahon ,the app keep in waiting status
>
> please help me,thanks
>



-- 
Best Regards,
Haosdent Huang

Re: compile mesos failed

2017-01-22 Thread haosdent

Hi, could you provide error logs?

On Mon, Jan 23, 2017 at 8:24 AM, 梦开始的地方 <382607...@qq.com> wrote:

> mesos-1.1.0
> ./configure --with-protobuf=/usr/local/services/tjop_protobuf --with-ssl
> --with-sasl  --with-zlib --with-zookeeper=/usr/local/services/tjop_zookeeper
> --with-curl --enable-static --prefix=/usr/local/services/tjop_mesos
>  --disable-shared --with-leveldb --disable-python
>
> in config.status:
> S["CPPFLAGS"]="-Iyes/include -I/usr/include/subversion-1 -Iyes/include
> -Iyes/include -Iyes/include -I/usr/include/apr-1 -I/usr/inclu
> de/apr-1.0  -Iyes/include -I/us"\
> "r/local/services/tjop_protobuf/include 
> -I/usr/local/services/tjop_protobuf/include
> -I/usr/local/services/tjop_zookeeper/include/zoo
> keeper"
> S["LDFLAGS"]="-Lyes/lib -Lyes/lib -Lyes/lib -Lyes/lib  -Lyes/lib
> -L/usr/local/services/tjop_protobuf/lib -L/usr/local/services/tjop_
> protobuf/lib -L/usr/local/serv"\
> "ices/tjop_zookeeper/lib"
>
> but yes/include and yes/lib  path do not exsit
>
>
> -- 原始邮件 --
> *发件人:* "Benjamin Mahler";;
> *发送时间:* 2017年1月23日(星期一) 上午6:16
> *收件人:* "user";
> *抄送:* "dev";
> *主题:* Re: Welcome Neil Conway as Mesos Committer and PMC member!
>
>
-- 
Best Regards,
Haosdent Huang

Re: Welcome Neil Conway as Mesos Committer and PMC member!

2017-01-22 Thread haosdent

Congrats Neil !!

On Sun, Jan 22, 2017 at 3:29 AM, Gabriel Hartmann 
wrote:

> Congrats Neil.
>
> On Sat, Jan 21, 2017 at 7:08 AM Deepak Vij (A) 
> wrote:
>
>> Congrats Neil.
>>
>> Deepak Vij
>>
>> Sent from HUAWEI AnyOffice
>> From:Vinod Kone
>> To:dev,user
>> Date:2017-01-20 23:04:30
>> Subject:Welcome Neil Conway as Mesos Committer and PMC member!
>>
>> Hi folks,
>>
>> Please welcome Neil Conway as the newest committer and PMC member of the
>> Apache Mesos project.
>>
>> Neil has been an active contributor to Mesos for more than a year now. As
>> part of his work, he has contributed some major features (Partition aware
>> frameworks, floating point operations for resources). Neil also took the
>> initiative to improve the documentation of our project and shepherded
>> several improvements over time. Doing that even without being a committer,
>> shows that he takes ownership of the project seriously.
>>
>> Here is his more formal checklist for your perusal.
>>
>> https://docs.google.com/document/d/137MYwxEw9QCZRH09CXfn1544p1LuM
>> uoj9LxS-sk2_F4/edit
>> <https://docs.google.com/document/d/137MYwxEw9QCZRH09CXfn1544p1LuMuoj9LxS-sk2_F4/edit>
>>
>> Thanks,
>> Vinod
>>
>


-- 
Best Regards,
Haosdent Huang

Re: Default executor grace period

2017-01-16 Thread haosdent

It looks like default-executor have not yet handle
`--executor_shutdown_grace_period`。

On Mon, Jan 16, 2017 at 7:41 PM, Tomek Janiszewski 
wrote:

> Hi
>
> I tried to use grace period with default Mesos executor. I assumed it
> works as follow:
>
>1. Start command: sh -c "command ..."
>2. Sent SIGSTOP to process tree: sh, command
>3. Sent SIGTERM to process tree: sh, command
>4. Wait for processes to finish or grace period to elapse
>5. sh finish while command could be still running and attached to init
>6. Sent SIGKILL to process tree: command
>
> I notice that SIGKILL is not sent and executor finished when sh returns.
> When Mesos is running with POSIX contenerizer this leads command to live
> forever (if it ignores SIGTERM). When contenerizer is used command is
> killed when it's container is destroyed.
>
> Is this desired behavior? How to use grace period with default executor?
>
> Thanks
> Tomek
>



-- 
Best Regards,
Haosdent Huang

Re: Question on dynamic reservations

2017-01-16 Thread haosdent

Hi, @Povilas It is possible to dynamic reserve unreserved resources on
those agents.

On Fri, Jan 13, 2017 at 2:47 PM, Povilas Versockas 
wrote:

> Hi,
>
> Maybe someone can help me with a problem I'm having. Short version of the
> question is:
> Is it possible to use dynamic reservation on statically reserved Mesos
> agents?
>
> The current situation is that we have Mesos cluster which runs many
> frameworks (aurora, spark, cassandra) and we are developing a custom
> framework for stateful tasks. Our framework manages stateful tasks for many
> users. Currently we statically reserved our hardware which has good disks
> only to be used by our framework (via --resources flag on Mesos Agents).
>
> The problem we are facing is that if one stateful task fails we would like
> to relaunch it on the same host with the same port, cpu, disk and memory.
> With dynamic reservations we would put a label with task id on a
> reservation and on failure would just simply reuse the reserved offer.
> On the other hand with statically reserved Mesos agents we cannot put any
> labels and so we cannot distinguish offers which should have been reserved
> for a task and a new offer.
> This leaves us in the situation that if one stateful task fails and there
> are new stateful tasks, the new tasks can be scheduled on failed task's
> Mesos agent, filling it up and taking it's port, cpu and memory.
>
>
> --
> Regards
> Povilas Versockas
>



-- 
Best Regards,
Haosdent Huang

Re: Sigkill while running mesos agent (1.0.1) in docker

2017-01-16 Thread haosdent

As the log show, it failed when perform below command to find the container
status.
```

docker -H unix:///var/run/docker.sock inspect
mesos-498ff8de-782e-482a-9478-69d3faf5a853-S5.a242fc24-0d32-46e6-af63-299cb82fc01c

```

have you mount the sock file from host to your agent container?

On Fri, Jan 13, 2017 at 8:20 PM, Giulio Eulisse 
wrote:

>
> Actually, no. The docker containers seem to be running just fine. Looks
> like mesos is not able to notice that. Did anything change in the way mesos
> looks up for them? Notice I've both renamed my container to "agent" and
> added MESOS_DOCKER_KILL_ORPHANS=false.
>
>
>
> On 13 Jan 2017, 02:14 +0100, haosdent , wrote:
>
> Is it caused by your container riemann-elasticsearch could not start
> successfully?
>
> On Fri, Jan 13, 2017 at 9:10 AM, Giulio Eulisse 
> wrote:
>
>> MMm... it improved things, but now I get a bunch of:
>>
>> ```
>> W0113 01:06:24.757287 17811 slave.cpp:5220] Failed to get resource
>> statistics for executor 'riemann-elasticsearch.7fc1bc0b-d92c-11e6-9
>> 367-02426821a225' of framework 20150626-112246-2475462272-5050-5-:
>> Failed to run 'docker -H unix:///var/run/docker.sock inspect me
>> sos-498ff8de-782e-482a-9478-69d3faf5a853-S5.a242fc24-0d32-46e6-af63-299cb82fc01c':
>> exited with status 1; stderr='Error: No such image,
>>  container or task: mesos-498ff8de-782e-482a-9478-
>> 69d3faf5a853-S5.a242fc24-0d32-46e6-af63-299cb82fc01c
>> ```
>>
>> and then leaves out a bunch of running containers.
>>
>> On 13 Jan 2017, 01:51 +0100, Joseph Wu , wrote:
>>
>> If Apache JIRA were up, I'd point you to a JIRA noting the problem with
>> naming docker containers `mesos-*`, as Mesos reserves that prefix (and
>> kills everything it considers "unknown").
>>
>> As a quick workaround, try setting this flag to false:
>> https://github.com/apache/mesos/blob/1.1.x/src/slave/flags.cpp#L590-L596
>>
>> On Thu, Jan 12, 2017 at 4:41 PM, Giulio Eulisse > > wrote:
>>
>>> MMm... it seems to die after a long sequence of forks, and mesos itself
>>> seems to be issuing the sigkill. I wonder if it's trying to do some cleanup
>>> and it does not realise one of the containers is the agent itself??? Notice
>>> I do have `MESOS_DOCKER_MESOS_IMAGE=alisw/mesos-slave:1.0.1` set.
>>>
>>> On 13 Jan 2017, 01:23 +0100, Giulio Eulisse ,
>>> wrote:
>>>
>>> Ciao,
>>>
>>> the only thing I could find is by running a parallel `docker events`
>>>
>>> ```
>>> 2017-01-13T01:18:20.766593692+01:00 network connect
>>> 32441cb5f42b009580e104a8360e544beec7120bb6fff800f16dbee421454267
>>> (container=1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71,
>>> name=host, type=host)
>>> 2017-01-13T01:18:20.846137793+01:00 container start
>>> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71
>>> (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2,
>>> name=mesos-slave, vendor=CentOS)
>>> 2017-01-13T01:18:20.847965921+01:00 container resize
>>> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71
>>> (build-date=20161214, height=16, image=alisw/mesos-slave:1.0.1,
>>> license=GPLv2, name=mesos-slave, vendor=CentOS, width=134)
>>> 2017-01-13T01:18:21.610141857+01:00 container kill
>>> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71
>>> (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2,
>>> name=mesos-slave, signal=15, vendor=CentOS)
>>> 2017-01-13T01:18:21.610491564+01:00 container kill
>>> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71
>>> (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2,
>>> name=mesos-slave, signal=9, vendor=CentOS)
>>> 2017-01-13T01:18:21.646229213+01:00 container die
>>> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71
>>> (build-date=20161214, exitCode=143, image=alisw/mesos-slave:1.0.1,
>>> license=GPLv2, name=mesos-slave, vendor=CentOS)
>>> 2017-01-13T01:18:21.652894124+01:00 network disconnect
>>> 32441cb5f42b009580e104a8360e544beec7120bb6fff800f16dbee421454267
>>> (container=1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71,
>>> name=host, type=host)
>>> 2017-01-13T01:18:21.705874041+01:00 container stop
>>> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71
>>> (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2,
>>> name=mesos-slave

Re: Sigkill while running mesos agent (1.0.1) in docker

2017-01-12 Thread haosdent

Is it caused by your container riemann-elasticsearch could not start
successfully?

On Fri, Jan 13, 2017 at 9:10 AM, Giulio Eulisse 
wrote:

> MMm... it improved things, but now I get a bunch of:
>
> ```
> W0113 01:06:24.757287 17811 slave.cpp:5220] Failed to get resource
> statistics for executor 'riemann-elasticsearch.7fc1bc0b-d92c-11e6-9
> 367-02426821a225' of framework 20150626-112246-2475462272-5050-5-:
> Failed to run 'docker -H unix:///var/run/docker.sock inspect me
> sos-498ff8de-782e-482a-9478-69d3faf5a853-S5.a242fc24-0d32-46e6-af63-299cb82fc01c':
> exited with status 1; stderr='Error: No such image,
>  container or task: mesos-498ff8de-782e-482a-9478-
> 69d3faf5a853-S5.a242fc24-0d32-46e6-af63-299cb82fc01c
> ```
>
> and then leaves out a bunch of running containers.
>
> On 13 Jan 2017, 01:51 +0100, Joseph Wu , wrote:
>
> If Apache JIRA were up, I'd point you to a JIRA noting the problem with
> naming docker containers `mesos-*`, as Mesos reserves that prefix (and
> kills everything it considers "unknown").
>
> As a quick workaround, try setting this flag to false:
> https://github.com/apache/mesos/blob/1.1.x/src/slave/flags.cpp#L590-L596
>
> On Thu, Jan 12, 2017 at 4:41 PM, Giulio Eulisse 
> wrote:
>
>> MMm... it seems to die after a long sequence of forks, and mesos itself
>> seems to be issuing the sigkill. I wonder if it's trying to do some cleanup
>> and it does not realise one of the containers is the agent itself??? Notice
>> I do have `MESOS_DOCKER_MESOS_IMAGE=alisw/mesos-slave:1.0.1` set.
>>
>> On 13 Jan 2017, 01:23 +0100, Giulio Eulisse ,
>> wrote:
>>
>> Ciao,
>>
>> the only thing I could find is by running a parallel `docker events`
>>
>> ```
>> 2017-01-13T01:18:20.766593692+01:00 network connect
>> 32441cb5f42b009580e104a8360e544beec7120bb6fff800f16dbee421454267
>> (container=1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71,
>> name=host, type=host)
>> 2017-01-13T01:18:20.846137793+01:00 container start
>> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71
>> (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2,
>> name=mesos-slave, vendor=CentOS)
>> 2017-01-13T01:18:20.847965921+01:00 container resize
>> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71
>> (build-date=20161214, height=16, image=alisw/mesos-slave:1.0.1,
>> license=GPLv2, name=mesos-slave, vendor=CentOS, width=134)
>> 2017-01-13T01:18:21.610141857+01:00 container kill
>> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71
>> (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2,
>> name=mesos-slave, signal=15, vendor=CentOS)
>> 2017-01-13T01:18:21.610491564+01:00 container kill
>> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71
>> (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2,
>> name=mesos-slave, signal=9, vendor=CentOS)
>> 2017-01-13T01:18:21.646229213+01:00 container die
>> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71
>> (build-date=20161214, exitCode=143, image=alisw/mesos-slave:1.0.1,
>> license=GPLv2, name=mesos-slave, vendor=CentOS)
>> 2017-01-13T01:18:21.652894124+01:00 network disconnect
>> 32441cb5f42b009580e104a8360e544beec7120bb6fff800f16dbee421454267
>> (container=1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71,
>> name=host, type=host)
>> 2017-01-13T01:18:21.705874041+01:00 container stop
>> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71
>> (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2,
>> name=mesos-slave, vendor=CentOS)
>> ```
>>
>> Ciao,
>> Giulio
>>
>> On 13 Jan 2017, 01:06 +0100, haosdent , wrote:
>>
>> Hi, @Giuliio According to your log, it looks normal. Do you have any logs
>> related to "SIGKILL"?
>>
>> On Fri, Jan 13, 2017 at 8:00 AM, Giulio Eulisse > > wrote:
>>
>>> Hi,
>>>
>>> I’ve a setup where I run mesos in docker which works perfectly when I
>>> use 0.28.2. I now migrated to 1.0.1 (but it’s the same with 1.1.0 and
>>> 1.0.0) and it seems to receive a sigkill right after saying:
>>>
>>> WARNING: Logging before InitGoogleLogging() is written to STDERR
>>> I0112 23:22:09.889120  4934 main.cpp:243] Build: 2016-08-26 23:06:27 by 
>>> centos
>>> I0112 23:22:09.889181  4934 main.cpp:244] Version: 1.0.1
>>> I0112 23:22:09.889184  4934 main.cpp:247] Git tag: 1.0.1
>>> I0112 23:22:09.889188  4934 main.cpp:251] Git SHA:

Re: Sigkill while running mesos agent (1.0.1) in docker

2017-01-12 Thread haosdent

yep, it fixed in 1.1.0
https://www.mail-archive.com/issues@mesos.apache.org/msg33959.html

On Fri, Jan 13, 2017 at 8:51 AM, Joseph Wu  wrote:

> If Apache JIRA were up, I'd point you to a JIRA noting the problem with
> naming docker containers `mesos-*`, as Mesos reserves that prefix (and
> kills everything it considers "unknown").
>
> As a quick workaround, try setting this flag to false:
> https://github.com/apache/mesos/blob/1.1.x/src/slave/flags.cpp#L590-L596
>
> On Thu, Jan 12, 2017 at 4:41 PM, Giulio Eulisse 
> wrote:
>
>> MMm... it seems to die after a long sequence of forks, and mesos itself
>> seems to be issuing the sigkill. I wonder if it's trying to do some cleanup
>> and it does not realise one of the containers is the agent itself??? Notice
>> I do have `MESOS_DOCKER_MESOS_IMAGE=alisw/mesos-slave:1.0.1` set.
>>
>> On 13 Jan 2017, 01:23 +0100, Giulio Eulisse ,
>> wrote:
>>
>> Ciao,
>>
>> the only thing I could find is by running a parallel `docker events`
>>
>> ```
>> 2017-01-13T01:18:20.766593692+01:00 network connect
>> 32441cb5f42b009580e104a8360e544beec7120bb6fff800f16dbee421454267
>> (container=1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71,
>> name=host, type=host)
>> 2017-01-13T01:18:20.846137793+01:00 container start
>> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71
>> (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2,
>> name=mesos-slave, vendor=CentOS)
>> 2017-01-13T01:18:20.847965921+01:00 container resize
>> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71
>> (build-date=20161214, height=16, image=alisw/mesos-slave:1.0.1,
>> license=GPLv2, name=mesos-slave, vendor=CentOS, width=134)
>> 2017-01-13T01:18:21.610141857+01:00 container kill
>> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71
>> (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2,
>> name=mesos-slave, signal=15, vendor=CentOS)
>> 2017-01-13T01:18:21.610491564+01:00 container kill
>> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71
>> (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2,
>> name=mesos-slave, signal=9, vendor=CentOS)
>> 2017-01-13T01:18:21.646229213+01:00 container die
>> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71
>> (build-date=20161214, exitCode=143, image=alisw/mesos-slave:1.0.1,
>> license=GPLv2, name=mesos-slave, vendor=CentOS)
>> 2017-01-13T01:18:21.652894124+01:00 network disconnect
>> 32441cb5f42b009580e104a8360e544beec7120bb6fff800f16dbee421454267
>> (container=1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71,
>> name=host, type=host)
>> 2017-01-13T01:18:21.705874041+01:00 container stop
>> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71
>> (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2,
>> name=mesos-slave, vendor=CentOS)
>> ```
>>
>> Ciao,
>> Giulio
>>
>> On 13 Jan 2017, 01:06 +0100, haosdent , wrote:
>>
>> Hi, @Giuliio According to your log, it looks normal. Do you have any logs
>> related to "SIGKILL"?
>>
>> On Fri, Jan 13, 2017 at 8:00 AM, Giulio Eulisse > > wrote:
>>
>>> Hi,
>>>
>>> I’ve a setup where I run mesos in docker which works perfectly when I
>>> use 0.28.2. I now migrated to 1.0.1 (but it’s the same with 1.1.0 and
>>> 1.0.0) and it seems to receive a sigkill right after saying:
>>>
>>> WARNING: Logging before InitGoogleLogging() is written to STDERR
>>> I0112 23:22:09.889120  4934 main.cpp:243] Build: 2016-08-26 23:06:27 by 
>>> centos
>>> I0112 23:22:09.889181  4934 main.cpp:244] Version: 1.0.1
>>> I0112 23:22:09.889184  4934 main.cpp:247] Git tag: 1.0.1
>>> I0112 23:22:09.889188  4934 main.cpp:251] Git SHA: 
>>> 3611eb0b7eea8d144e9b2e840e0ba16f2f659ee3
>>> W0112 23:22:09.890808  4934 openssl.cpp:398] Failed SSL connections will be 
>>> downgraded to a non-SSL socket
>>> W0112 23:22:09.891237  4934 process.cpp:881] Failed SSL connections will be 
>>> downgraded to a non-SSL socket
>>> E0112 23:22:10.129096  4934 shell.hpp:106] Command 'hadoop version 2>&1' 
>>> failed; this is the output:
>>> sh: hadoop: command not found
>>> 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@726: Client 
>>> environment:zookeeper.version=zookeeper C client 3.4.8
>>> 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@730: Client 
>>> environment

Re: Sigkill while running mesos agent (1.0.1) in docker

2017-01-12 Thread haosdent

Hi, what the docker command you use to start agents, I remember mesos would
try to recover containers which names start with mesos-slave and kill them
if could not recover successfully.

On Jan 13, 2017 8:43 AM, "Giulio Eulisse"  wrote:

MMm... it seems to die after a long sequence of forks, and mesos itself
seems to be issuing the sigkill. I wonder if it's trying to do some cleanup
and it does not realise one of the containers is the agent itself??? Notice
I do have `MESOS_DOCKER_MESOS_IMAGE=alisw/mesos-slave:1.0.1` set.

On 13 Jan 2017, 01:23 +0100, Giulio Eulisse ,
wrote:

Ciao,

the only thing I could find is by running a parallel `docker events`

```
2017-01-13T01:18:20.766593692+01:00 network connect
32441cb5f42b009580e104a8360e544beec7120bb6fff800f16dbee421454267 (container=
1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71,
name=host, type=host)
2017-01-13T01:18:20.846137793+01:00 container start
1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71
(build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2,
name=mesos-slave, vendor=CentOS)
2017-01-13T01:18:20.847965921+01:00 container resize
1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71
(build-date=20161214, height=16, image=alisw/mesos-slave:1.0.1,
license=GPLv2, name=mesos-slave, vendor=CentOS, width=134)
2017-01-13T01:18:21.610141857+01:00 container kill
1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71
(build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2,
name=mesos-slave, signal=15, vendor=CentOS)
2017-01-13T01:18:21.610491564+01:00 container kill
1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71
(build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2,
name=mesos-slave, signal=9, vendor=CentOS)
2017-01-13T01:18:21.646229213+01:00 container die
1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71
(build-date=20161214, exitCode=143, image=alisw/mesos-slave:1.0.1,
license=GPLv2, name=mesos-slave, vendor=CentOS)
2017-01-13T01:18:21.652894124+01:00 network disconnect
32441cb5f42b009580e104a8360e544beec7120bb6fff800f16dbee421454267 (container=
1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71,
name=host, type=host)
2017-01-13T01:18:21.705874041+01:00 container stop
1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71
(build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2,
name=mesos-slave, vendor=CentOS)
```

Ciao,
Giulio

On 13 Jan 2017, 01:06 +0100, haosdent , wrote:

Hi, @Giuliio According to your log, it looks normal. Do you have any logs
related to "SIGKILL"?

On Fri, Jan 13, 2017 at 8:00 AM, Giulio Eulisse 
wrote:

> Hi,
>
> I’ve a setup where I run mesos in docker which works perfectly when I use
> 0.28.2. I now migrated to 1.0.1 (but it’s the same with 1.1.0 and 1.0.0)
> and it seems to receive a sigkill right after saying:
>
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> I0112 23:22:09.889120  4934 main.cpp:243] Build: 2016-08-26 23:06:27 by centos
> I0112 23:22:09.889181  4934 main.cpp:244] Version: 1.0.1
> I0112 23:22:09.889184  4934 main.cpp:247] Git tag: 1.0.1
> I0112 23:22:09.889188  4934 main.cpp:251] Git SHA: 
> 3611eb0b7eea8d144e9b2e840e0ba16f2f659ee3
> W0112 23:22:09.890808  4934 openssl.cpp:398] Failed SSL connections will be 
> downgraded to a non-SSL socket
> W0112 23:22:09.891237  4934 process.cpp:881] Failed SSL connections will be 
> downgraded to a non-SSL socket
> E0112 23:22:10.129096  4934 shell.hpp:106] Command 'hadoop version 2>&1' 
> failed; this is the output:
> sh: hadoop: command not found
> 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@726: Client 
> environment:zookeeper.version=zookeeper C client 3.4.8
> 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@730: Client 
> environment:host.name=.XXX.ch
> 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@737: Client 
> environment:os.name=Linux
> 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@738: Client 
> environment:os.arch=3.10.0-229.14.1.el7.x86_64
> 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@739: Client 
> environment:os.version=#1 SMP Tue Sep 15 15:05:51 UTC 2015
> 2017-01-12 23:22:10,131:4934(0x7f950503b700):ZOO_INFO@log_env@747: Client 
> environment:user.name=(null)
> 2017-01-12 23:22:10,131:4934(0x7f950503b700):ZOO_INFO@log_env@755: Client 
> environment:user.home=/root
> 2017-01-12 23:22:10,131:4934(0x7f950503b700):ZOO_INFO@log_env@767: Client 
> environment:user.dir=/
> 2017-01-12 23:22:10,131:4934(0x7f950503b700):ZOO_INFO@zookeeper_init@800: 
> Initiating client connection, 
> host=XXX1.YYY.ch:2181,XXX2.YYY.ch:2181,XXX3.YYY.ch:2181 sessionTimeout=1 
> watcher=0x7f950ee20300 sessionId=0 sessionPasswd= context=0x
> 7f94fc60 flag

Re: Sigkill while running mesos agent (1.0.1) in docker

2017-01-12 Thread haosdent

Hi, @Giuliio According to your log, it looks normal. Do you have any logs
related to "SIGKILL"?

On Fri, Jan 13, 2017 at 8:00 AM, Giulio Eulisse 
wrote:

> Hi,
>
> I’ve a setup where I run mesos in docker which works perfectly when I use
> 0.28.2. I now migrated to 1.0.1 (but it’s the same with 1.1.0 and 1.0.0)
> and it seems to receive a sigkill right after saying:
>
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> I0112 23:22:09.889120  4934 main.cpp:243] Build: 2016-08-26 23:06:27 by centos
> I0112 23:22:09.889181  4934 main.cpp:244] Version: 1.0.1
> I0112 23:22:09.889184  4934 main.cpp:247] Git tag: 1.0.1
> I0112 23:22:09.889188  4934 main.cpp:251] Git SHA: 
> 3611eb0b7eea8d144e9b2e840e0ba16f2f659ee3
> W0112 23:22:09.890808  4934 openssl.cpp:398] Failed SSL connections will be 
> downgraded to a non-SSL socket
> W0112 23:22:09.891237  4934 process.cpp:881] Failed SSL connections will be 
> downgraded to a non-SSL socket
> E0112 23:22:10.129096  4934 shell.hpp:106] Command 'hadoop version 2>&1' 
> failed; this is the output:
> sh: hadoop: command not found
> 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@726: Client 
> environment:zookeeper.version=zookeeper C client 3.4.8
> 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@730: Client 
> environment:host.name=.XXX.ch
> 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@737: Client 
> environment:os.name=Linux
> 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@738: Client 
> environment:os.arch=3.10.0-229.14.1.el7.x86_64
> 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@739: Client 
> environment:os.version=#1 SMP Tue Sep 15 15:05:51 UTC 2015
> 2017-01-12 23:22:10,131:4934(0x7f950503b700):ZOO_INFO@log_env@747: Client 
> environment:user.name=(null)
> 2017-01-12 23:22:10,131:4934(0x7f950503b700):ZOO_INFO@log_env@755: Client 
> environment:user.home=/root
> 2017-01-12 23:22:10,131:4934(0x7f950503b700):ZOO_INFO@log_env@767: Client 
> environment:user.dir=/
> 2017-01-12 23:22:10,131:4934(0x7f950503b700):ZOO_INFO@zookeeper_init@800: 
> Initiating client connection, 
> host=XXX1.YYY.ch:2181,XXX2.YYY.ch:2181,XXX3.YYY.ch:2181 sessionTimeout=1 
> watcher=0x7f950ee20300 sessionId=0 sessionPasswd= context=0x
> 7f94fc60 flags=0
> 2017-01-12 23:22:10,134:4934(0x7f9501fd7700):ZOO_INFO@check_events@1728: 
> initiated connection to server [XX.YY.ZZ.WW:2181]
> 2017-01-12 23:22:10,146:4934(0x7f9501fd7700):ZOO_INFO@check_events@1775: 
> session establishment complete on server [XX.YY.ZZ.WW:2181
> ], sessionId=0x35828ae70fb2065, negotiated timeout=1
>
> Any idea of what might be going on? Looks like an OOM, but I do not see it
> in /var/log/messages and it also happens with --oom-kill-disable.
>
> --
> Ciao,
> Giulio
>



-- 
Best Regards,
Haosdent Huang

Re: Error running Application in Marathon

2017-01-11 Thread haosdent

Have you try use `/usr/local/sbin/mesos-agent` instead of
`./bin/mesos-agent.sh`?

On Wed, Jan 11, 2017 at 6:20 PM, Joaquin Alzola 
wrote:

> Hi
>
>
>
> I compiled mesos from source and then did ‘make install’
>
> So my libmesos-1.1.0.so is under /usr/local/lib
>
>
>
> I launch the mesos agents with the root user with the following command
> (it is a standalone node).
>
>
>
> # ./bin/mesos-master.sh --ip=192.168.1.69 --work_dir=/var/lib/mesos
> –log_dir=/opt/mesos/logs –log_level=INFO –quiet
>
> # ./bin/mesos-agent.sh --master=192.168.1.69:5050
> --work_dir=/var/lib/mesos –log_dir=/opt/mesos/logs –log_level=INFO –quiet
>
>
>
> Marathon I launch it via:
>
> #export MESOS_NATIVE_JAVA_LIBRARY=/usr/loca/lib/libmesos-1.1.0.so
>
> #export MESOS_WORK_DIR=/tmp/mesos/local
>
> # ./bin/start --master local
>
>
>
> BR
>
>
>
> Joaquin
>
>
>
> *From:* haosdent [mailto:haosd...@gmail.com]
> *Sent:* 11 January 2017 10:13
> *To:* user 
> *Subject:* Re: Error running Application in Marathon
>
>
>
> hi, @joaquin how you launch mesos agents. It looks like the library search
> path of your agents are incorrect.
>
>
>
> On Wed, Jan 11, 2017 at 5:44 PM, Joaquin Alzola 
> wrote:
>
> Hi Guys
>
>
>
> I have the following error running an application on Marathon.
>
>
>
> I0110 22:16:29.048617 31888 containerizer.cpp:1489] Checkpointing
> container's forked pid 32028 to '/tmp/mesos/local/0/meta/
> slaves/2060e189-a0c7-42b6-aa07-95828d2065a4-S0/
> frameworks/2f3fb1d0-d990-4013-8f23-326de597e4e2-/
> executors/basic-joaquin.6f0026c2-d782-11e6-923d-
> ea09eed2770d/runs/dcd58aed-0717-4078-9919-b3993d088ef5/pids/forked.pid'
>
> mesos-containerizer: error while loading shared libraries:
> libmesos-1.1.0.so: cannot open shared object file: No such file or
> directory
>
> I0110 22:16:29.133458 31890 containerizer.cpp:2313] Container
> dcd58aed-0717-4078-9919-b3993d088ef5 has exited
>
>
>
> Running the test-framework for java and python work perfectly, but failing
> for c++.
>
>
>
> BR
>
>
>
> Joaquin
>
> This email is confidential and may be subject to privilege. If you are not
> the intended recipient, please do not copy or disclose its content but
> contact the sender immediately upon receipt.
>
>
>
>
>
> --
>
> Best Regards,
>
> Haosdent Huang
> This email is confidential and may be subject to privilege. If you are not
> the intended recipient, please do not copy or disclose its content but
> contact the sender immediately upon receipt.
>



-- 
Best Regards,
Haosdent Huang

Re: Error running Application in Marathon

2017-01-11 Thread haosdent

hi, @joaquin how you launch mesos agents. It looks like the library search
path of your agents are incorrect.

On Wed, Jan 11, 2017 at 5:44 PM, Joaquin Alzola 
wrote:

> Hi Guys
>
>
>
> I have the following error running an application on Marathon.
>
>
>
> I0110 22:16:29.048617 31888 containerizer.cpp:1489] Checkpointing
> container's forked pid 32028 to '/tmp/mesos/local/0/meta/
> slaves/2060e189-a0c7-42b6-aa07-95828d2065a4-S0/
> frameworks/2f3fb1d0-d990-4013-8f23-326de597e4e2-/
> executors/basic-joaquin.6f0026c2-d782-11e6-923d-
> ea09eed2770d/runs/dcd58aed-0717-4078-9919-b3993d088ef5/pids/forked.pid'
>
> mesos-containerizer: error while loading shared libraries:
> libmesos-1.1.0.so: cannot open shared object file: No such file or
> directory
>
> I0110 22:16:29.133458 31890 containerizer.cpp:2313] Container
> dcd58aed-0717-4078-9919-b3993d088ef5 has exited
>
>
>
> Running the test-framework for java and python work perfectly, but failing
> for c++.
>
>
>
> BR
>
>
>
> Joaquin
> This email is confidential and may be subject to privilege. If you are not
> the intended recipient, please do not copy or disclose its content but
> contact the sender immediately upon receipt.
>



-- 
Best Regards,
Haosdent Huang

Re: cron-like scheduling in mesos framework?

2017-01-06 Thread haosdent

You may try https://github.com/mesos/chronos or
https://github.com/farmapromlab/rundeck-mesos-plugin

On Sat, Jan 7, 2017 at 12:40 AM, l vic  wrote:

> Hi,
> Is there a way to schedule mesos framework task for execution at certain
> day/time?
> Thank youm
> -V
>



-- 
Best Regards,
Haosdent Huang

Re: compile failed on mesos upstream master branch

2016-12-30 Thread haosdent

Usually try again would resolve the problem. It is gcc bug.

On Fri, Dec 30, 2016 at 6:25 PM, tommy xiao  wrote:

> ```
>
> `test -f 'tests/container_logger_tests.cpp' || echo
> '/mesos/src/'`tests/container_logger_tests.cpp
>
> mv -f examples/.deps/disk_full_framework-disk_full_framework.Tpo
> examples/.deps/disk_full_framework-disk_full_framework.Po
>
> g++ -DPACKAGE_NAME=\"mesos\" -DPACKAGE_TARNAME=\"mesos\"
> -DPACKAGE_VERSION=\"1.2.0\" -DPACKAGE_STRING=\"mesos\ 1.2.0\"
> -DPACKAGE_BUGREPORT=\"\" -DPACKAGE_URL=\"\" -DPACKAGE=\"mesos\"
> -DVERSION=\"1.2.0\" -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1
> -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1
> -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1
> -DHAVE_DLFCN_H=1 -DLT_OBJDIR=\".libs/\" -DHAVE_CXX11=1
> -DHAVE_PTHREAD_PRIO_INHERIT=1 -DHAVE_PTHREAD=1 -DHAVE_LIBZ=1 -DHAVE_FTS_H=1
> -DHAVE_APR_POOLS_H=1 -DHAVE_LIBAPR_1=1 -DHAVE_LIBCURL=1 -DHAVE_LIBSASL2=1
> -DHAVE_SVN_VERSION_H=1 -DHAVE_LIBSVN_SUBR_1=1 -DHAVE_SVN_DELTA_H=1
> -DHAVE_LIBSVN_DELTA_1=1 -DHAVE_LIBZ=1 -I. -I/mesos/src   -Werror
> -DLIBDIR=\"/usr/local/lib\" -DPKGLIBEXECDIR=\"/usr/local/libexec/mesos\"
> -DPKGDATADIR=\"/usr/local/share/mesos\" 
> -DPKGMODULEDIR=\"/usr/local/lib/mesos/modules\"
> -I/mesos/include -I../include -I../include/mesos -DPICOJSON_USE_INT64
> -D__STDC_FORMAT_MACROS -isystem ../3rdparty/boost-1.53.0
> -I../3rdparty/elfio-3.2 -I../3rdparty/glog-0.3.3/src
>
> -I../3rdparty/leveldb-1.4/include -I/mesos/3rdparty/libprocess/include
> -I../3rdparty/nvml-352.79 -I../3rdparty/picojson-1.3.0
> -I../3rdparty/protobuf-2.6.1/src -I/mesos/3rdparty/stout/include
> -I../3rdparty/zookeeper-3.4.8/src/c/include 
> -I../3rdparty/zookeeper-3.4.8/src/c/generated
> -DHAS_AUTHENTICATION=1 -DSOURCE_DIR=\"/mesos\" 
> -DBUILD_DIR=\"/home/vagrant/build\"
> -I../3rdparty/gmock-1.7.0/gtest/include -isystem 
> ../3rdparty/gmock-1.7.0/include
> -DTESTLIBEXECDIR=\"/usr/local/libexec/mesos/tests\"
> -DSBINDIR=\"/usr/local/sbin\"  -I/usr/include/subversion-1
> -I/usr/include/apr-1 -I/usr/include/apr-1.0  -pthread -Wall -Wsign-compare
> -Wformat-security -fstack-protector-strong -fPIC -fPIE -g1 -O0
> -Wno-unused-local-typedefs -std=c++11 -MT tests/mesos_tests-containerizer.o
> -MD -MP -MF tests/.deps/mesos_tests-containerizer.Tpo -c -o
> tests/mesos_tests-containerizer.o `test -f 'tests/containerizer.cpp' ||
> echo '/mesos/src/'`tests/containerizer.cpp
>
> g++: 编译器内部错误：已杀死(程序 cc1plus)
>
> Please submit a full bug report,
>
> with preprocessed source if appropriate.
>
> See <http://bugzilla.redhat.com/bugzilla> for instructions.
>
> make[3]: *** [tests/mesos_tests-api_tests.o] 错误 4
>
> make[3]: *** 正在等待未完成的任务
>
> g++: 编译器内部错误：已杀死(程序 cc1plus)
>
> Please submit a full bug report,
>
> with preprocessed source if appropriate.
>
> See <http://bugzilla.redhat.com/bugzilla> for instructions.
>
> make[3]: *** [tests/mesos_tests-container_logger_tests.o] 错误 4
>
> mv -f tests/.deps/mesos_tests-containerizer.Tpo tests/.deps/mesos_tests-
> containerizer.Po
>
> mv -f tests/.deps/mesos_tests-command_executor_tests.Tpo
> tests/.deps/mesos_tests-command_executor_tests.Po
>
> mv -f tests/.deps/mesos_tests-cluster.Tpo tests/.deps/mesos_tests-
> cluster.Po
>
> mv -f tests/.deps/mesos_tests-authorization_tests.Tpo
> tests/.deps/mesos_tests-authorization_tests.Po
>
> mv -f tests/.deps/mesos_tests-anonymous_tests.Tpo tests/.deps/mesos_tests-
> anonymous_tests.Po
>
> mv -f tests/.deps/mesos_tests-authentication_tests.Tpo
> tests/.deps/mesos_tests-authentication_tests.Po
>
> make[3]: 离开目录“/home/vagrant/build/src”
>
> make[2]: *** [check-am] 错误 2
>
> make[2]: 离开目录“/home/vagrant/build/src”
>
> make[1]: *** [check] 错误 2
>
> make[1]: 离开目录“/home/vagrant/build/src”
>
> make: *** [check-recursive] 错误 1
>
> ```
>
> 2016-12-30 14:22 GMT+08:00 tommy xiao :
>
>> hi haosdent,
>>
>> remove the mesos, and git clone again, it work like a charm
>>
>> 2016-12-30 13:03 GMT+08:00 tommy xiao :
>>
>>> always output missing
>>>
>>> gcc: error: ../../3rdparty/http-parser-2.6.2/http_parser.c: No such
>>> file or directory
>>>
>>> gcc: fatal error: no input files
>>>
>>>
>>> use below env:
>>>
>>> https://github.com/xiaods/mesos-vagrant-env
>>>
>>>
>>>
>>> 2016-12-30 10:33 GMT+08:00 haosdent :
>>>
>>

Re: compile failed on mesos upstream master branch

2016-12-29 Thread haosdent

have you try

$ ./bootstrap
$ mkdir build
$ cd build
$ ../configure
$ make



On Fri, Dec 30, 2016 at 6:19 AM, tommy xiao  wrote:

> today i refresh the code base on my vagrant develop branch, build, make
> then get failed, anyone can do me a hint on the failed file.
>
> ```shell
>
> make[3]: 进入目录“/home/vagrant/build/3rdparty”
>
> /bin/sh ../libtool  --tag=CC   --mode=compile gcc -DPACKAGE_NAME=\"mesos\"
> -DPACKAGE_TARNAME=\"mesos\" -DPACKAGE_VERSION=\"1.2.0\"
> -DPACKAGE_STRING=\"mesos\ 1.2.0\" -DPACKAGE_BUGREPORT=\"\"
> -DPACKAGE_URL=\"\" -DPACKAGE=\"mesos\" -DVERSION=\"1.2.0\" -DSTDC_HEADERS=1
> -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1
> -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1
> -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=\".libs/\"
> -DHAVE_CXX11=1 -DHAVE_PTHREAD_PRIO_INHERIT=1 -DHAVE_PTHREAD=1 -DHAVE_LIBZ=1
> -DHAVE_FTS_H=1 -DHAVE_APR_POOLS_H=1 -DHAVE_LIBAPR_1=1
> -DHAVE_LIBCURL=1-DHAVE_LIBSASL2=1 -DHAVE_SVN_VERSION_H=1
> -DHAVE_LIBSVN_SUBR_1=1 -DHAVE_SVN_DELTA_H=1 -DHAVE_LIBSVN_DELTA_1=1
> -DHAVE_LIBZ=1 -I. -I/mesos/3rdparty  -Ihttp-parser-2.6.2
> -DHTTP_PARSER_STRICT=0 -I/usr/include/subversion-1 -I/usr/include/apr-1
> -I/usr/include/apr-1.0   -g1-O0 -Wno-unused-local-typedefs -MT
> libry_http_parser_la-http_parser.lo -MD -MP -MF
> .deps/libry_http_parser_la-http_parser.Tpo -c -o
> libry_http_parser_la-http_parser.lo `test -f 'http-parser-2.6.2/http_parser.c'
> || echo '/mesos/3rdparty/'`http-parser-2.6.2/http_parser.c
>
> libtool: compile:  gcc -DPACKAGE_NAME=\"mesos\"
> -DPACKAGE_TARNAME=\"mesos\" -DPACKAGE_VERSION=\"1.2.0\"
> "-DPACKAGE_STRING=\"mesos 1.2.0\"" -DPACKAGE_BUGREPORT=\"\"
> -DPACKAGE_URL=\"\" -DPACKAGE=\"mesos\" -DVERSION=\"1.2.0\" -DSTDC_HEADERS=1
> -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1
> -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1
> -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=\".libs/\"
> -DHAVE_CXX11=1 -DHAVE_PTHREAD_PRIO_INHERIT=1 -DHAVE_PTHREAD=1 -DHAVE_LIBZ=1
> -DHAVE_FTS_H=1 -DHAVE_APR_POOLS_H=1 -DHAVE_LIBAPR_1=1 -DHAVE_LIBCURL=1
> -DHAVE_LIBSASL2=1 -DHAVE_SVN_VERSION_H=1 -DHAVE_LIBSVN_SUBR_1=1
> -DHAVE_SVN_DELTA_H=1 -DHAVE_LIBSVN_DELTA_1=1 -DHAVE_LIBZ=1 -I.
> -I/mesos/3rdparty -Ihttp-parser-2.6.2 -DHTTP_PARSER_STRICT=0
> -I/usr/include/subversion-1 -I/usr/include/apr-1 -I/usr/include/apr-1.0 -g1
> -O0 -Wno-unused-local-typedefs -MT libry_http_parser_la-http_parser.lo
> -MD -MP -MF .deps/libry_http_parser_la-http_parser.Tpo -c
> /mesos/3rdparty/http-parser-2.6.2/http_parser.c  -fPIC -DPIC -o
> .libs/libry_http_parser_la-http_parser.o
>
> gcc: error: /mesos/3rdparty/http-parser-2.6.2/http_parser.c: No such file
> or directory
>
> gcc: fatal error: no input files
>
> compilation terminated.
>
> ```
>
>
>
> --
> Deshi Xiao
> Twitter: xds2000
> E-mail: xiaods(AT)gmail.com
>



-- 
Best Regards,
Haosdent Huang

Re: Too many files

2016-12-29 Thread haosdent

No. It means executor/task only could open 4000 files exclude child
processes.

On Fri, Dec 30, 2016 at 2:32 AM, Kiril Menshikov 
wrote:

> It’s 4000. Does this mean that sum of all child open files should not
> exceed this value? Because I see that task limit is also 4000.
>
>
> On Dec 29, 2016, at 19:48, haosdent  wrote:
>
> Hi, @Kiril It sounds like your executor reaches max open file limit. May
> you show the result of
>
> $ cat /proc/${your_executor_pid}/limits
>
>
>
> On Fri, Dec 30, 2016 at 1:07 AM, Kiril Menshikov 
> wrote:
>
>> Hi,
>>
>> I have executor which run java programs. One executor execute around 800
>> tasks. Last ~100 failed with "Too many open files". I increased nofile
>> and nproc limits. During my debug I could not say that problem in in the
>> tasks. But some times linux reach limits. I see some boxes are good with
>> out 'Too many open files errors'. But some has such errors.
>>
>> I run executor through mesos-containerizer and isolation is posix/cpu,
>> posix/mem.
>>
>> Can some one explain why this happens? Is it better to create separate
>> executor for each task? Tasks have common code but has different commands.
>>
>> Any help are welcomed.
>>
>> mesos-containerizer launch --command={"shell":true,"value":"java -cp
>> executor-all-1.0.jar com.stone.mesos.MesosTestExecutor"} --environment={"
>> LIBPROCESS_IP":"10.10.10.10","LIBPROCESS_PORT":"0","MESOS
>> _AGENT_ENDPOINT":"10.10.10.10:5051","MESOS_CHECKPOINT":"1","MESOS_DIRECTO
>> RY":"\/var\/lib\/mesos\/slaves\/7e30a916-1296-4f47-813
>> a-0972030b6907-S14\/frameworks\/7e30a916-1296-4f47-813a-
>> 0972030b6907-0020\/executors\/client_b12.tar-0f9e8f80-a217-
>> 4b28-bb5e-4dd7cc587381\/runs\/ee8857e2-ac19-4f07-810d-c2e71fbf522e","
>> MESOS_EXECUTOR_ID":"client_b12.tar-0f9e8f80-
>> a217-4b28-bb5e-4dd7cc587381","MESOS_EXECUTOR_SHUTDOWN_GRACE_
>> PERIOD":"5secs","MESOS_FRAMEWORK_ID":"7e30a916-1296-4f47-
>> 813a-0972030b6907-0020","MESOS_HTTP_COMMAND_EXECUTOR":"0","MESOS
>> _NATIVE_JAVA_LIBRARY":"\/usr\/local\/lib\/libmesos-1.1.0.so","MESOS
>> _NATIVE_LIBRARY":"\/usr\/local\/lib\/libmesos-1.1.0.so","MESOS_RECOVERY_
>> TIMEOUT":"15mins","MESOS_SANDBOX":"\/var\/lib\/mesos\/
>> slaves\/7e30a916-1296-4f47-813a-0972030b6907-S14\/framewo
>> rks\/7e30a916-1296-4f47-813a-0972030b6907-0020\/executors\/
>> client_b12.tar-0f9e8f80-a217-4b28-bb5e-4dd7cc587381\/runs\/
>> ee8857e2-ac19-4f07-810d-c2e71fbf522e","MESOS_SLAVE_ID"
>> :"7e30a916-1296-4f47-813a-0972030b6907-S14","MESOS_SLAVE_PID":"slave(1)@
>> 10.10.10.10:5051","MESOS_SUBSCRIPTION_BACKOFF_MAX":"2secs","PATH":"\/
>> usr\/local\/sbin:\/usr\/local\/bin:\/usr\/sbin:\/usr\/bin:\/sbin:\/bin"}
>> --help=false --pipe_read=12 --pipe_write=13 --pre_exec_commands=[] --
>> runtime_directory=/var/run/mesos/containers/ee8857e2-ac19-4f07-810d-c2e71fbf522e
>> --unshare_namespace_mnt=false --user=ec2-user
>> --working_directory=/var/lib/mesos/slaves/7e30a916-1296-4f47
>> -813a-0972030b6907-S14/frameworks/7e30a916-1296-4f47-813a-
>> 0972030b6907-0020/executors/client_b12.tar-0f9e8f80-a217-
>> 4b28-bb5e-4dd7cc587381/runs/ee8857e2-ac19-4f07-810d-c2e71fbf522e
>>
>> Thanks,
>> -Kiril
>>
>
>
>
> --
> Best Regards,
> Haosdent Huang
>
>
>


-- 
Best Regards,
Haosdent Huang

Re: Too many files

2016-12-29 Thread haosdent

Hi, @Kiril It sounds like your executor reaches max open file limit. May
you show the result of

$ cat /proc/${your_executor_pid}/limits



On Fri, Dec 30, 2016 at 1:07 AM, Kiril Menshikov 
wrote:

> Hi,
>
> I have executor which run java programs. One executor execute around 800
> tasks. Last ~100 failed with "Too many open files". I increased nofile
> and nproc limits. During my debug I could not say that problem in in the
> tasks. But some times linux reach limits. I see some boxes are good with
> out 'Too many open files errors'. But some has such errors.
>
> I run executor through mesos-containerizer and isolation is posix/cpu,
> posix/mem.
>
> Can some one explain why this happens? Is it better to create separate
> executor for each task? Tasks have common code but has different commands.
>
> Any help are welcomed.
>
> mesos-containerizer launch --command={"shell":true,"value":"java -cp
> executor-all-1.0.jar com.stone.mesos.MesosTestExecutor"} --environment={"
> LIBPROCESS_IP":"10.10.10.10","LIBPROCESS_PORT":"0","MESOS_AGENT_
> ENDPOINT":"10.10.10.10:5051","MESOS_CHECKPOINT":"1","MESOS_
> DIRECTORY":"\/var\/lib\/mesos\/slaves\/7e30a916-1296-4f47-
> 813a-0972030b6907-S14\/frameworks\/7e30a916-1296-
> 4f47-813a-0972030b6907-0020\/executors\/client_b12.tar-
> 0f9e8f80-a217-4b28-bb5e-4dd7cc587381\/runs\/ee8857e2-
> ac19-4f07-810d-c2e71fbf522e","MESOS_EXECUTOR_ID":"client_
> b12.tar-0f9e8f80-a217-4b28-bb5e-4dd7cc587381","MESOS_
> EXECUTOR_SHUTDOWN_GRACE_PERIOD":"5secs","MESOS_
> FRAMEWORK_ID":"7e30a916-1296-4f47-813a-0972030b6907-0020","MESOS
> _HTTP_COMMAND_EXECUTOR":"0","MESOS_NATIVE_JAVA_LIBRARY"
> :"\/usr\/local\/lib\/libmesos-1.1.0.so","MESOS_NATIVE_
> LIBRARY":"\/usr\/local\/lib\/libmesos-1.1.0.so","MESOS_
> RECOVERY_TIMEOUT":"15mins","MESOS_SANDBOX":"\/var\/lib\/
> mesos\/slaves\/7e30a916-1296-4f47-813a-0972030b6907-S14\/
> frameworks\/7e30a916-1296-4f47-813a-0972030b6907-0020\/
> executors\/client_b12.tar-0f9e8f80-a217-4b28-bb5e-
> 4dd7cc587381\/runs\/ee8857e2-ac19-4f07-810d-c2e71fbf522e","MESOS
> _SLAVE_ID":"7e30a916-1296-4f47-813a-0972030b6907-S14","MESOS_SLAVE_PID
> ":"slave(1)@10.10.10.10:5051","MESOS_SUBSCRIPTION_BACKOFF_MAX":"
> 2secs","PATH":"\/usr\/local\/sbin:\/usr\/local\/bin:\/usr\/sbin
> :\/usr\/bin:\/sbin:\/bin"} --help=false --pipe_read=12 --pipe_write=13 --
> pre_exec_commands=[] --runtime_directory=/var/run/mesos
> /containers/ee8857e2-ac19-4f07-810d-c2e71fbf522e --unshare_namespace_mnt=false
> --user=ec2-user --working_directory=/var/lib/mesos/slaves/7e30a916-1296-
> 4f47-813a-0972030b6907-S14/frameworks/7e30a916-1296-4f47-
> 813a-0972030b6907-0020/executors/client_b12.tar-0f9e8f80-a217-4b28-bb5e-
> 4dd7cc587381/runs/ee8857e2-ac19-4f07-810d-c2e71fbf522e
>
> Thanks,
> -Kiril
>



-- 
Best Regards,
Haosdent Huang

Re: mesos-execute stuck in subscribe loop

2016-12-27 Thread haosdent

Hi, @Frank May I have your master log? Usually we check the master log
first.

On Tue, Dec 27, 2016 at 9:02 PM, Frank Scholten 
wrote:

> Hi,
>
> I am running Mesos 0.26.2 and the following command is stuck in a
> subscribe loop
>
> # GLOG_v=1 LIBPROCESS_IP=10.2.0.219  mesos-execute
> --master=10.2.1.116:5050 --name="cluster-test" --command="sleep 5"
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> I1227 04:24:32.049841 41523 process.cpp:965] libprocess is initialized
> on 10.2.0.219:39749 for 48 cpus
> I1227 04:24:32.04 41523 logging.cpp:198] Logging to STDERR
> I1227 04:24:32.051537 41523 sched.cpp:166] Version: 0.26.2
> I1227 04:24:32.055392 41543 sched.cpp:264] New master detected at
> master@10.2.1.116:5050
> I1227 04:24:32.01 41543 sched.cpp:274] No credentials provided.
> Attempting to register without authentication
> I1227 04:24:32.055565 41543 sched.cpp:716] Sending SUBSCRIBE call to
> master@10.2.1.116:5050
> I1227 04:24:32.055624 41543 sched.cpp:749] Will retry registration in
> 587.356137ms if necessary
> I1227 04:24:32.644569 41543 sched.cpp:716] Sending SUBSCRIBE call to
> master@10.2.1.116:5050
> I1227 04:24:32.644649 41543 sched.cpp:749] Will retry registration in
> 2.884023157secs if necessary
>
> I ran tpcdump on the master
>
> # tcpdump -i bond0.7 dst 10.2.0.219
> tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
> listening on bond0.7, link-type EN10MB (Ethernet), capture size 65535 bytes
> 05:01:35.207079 IP cnlvr02r04c31.mmcc > agent1.cncf.io.52860: Flags
> [F.], seq 2751495045, ack 539538316, win 336, options [nop,nop,TS val
> 868111506 ecr 867845120], length 0
> 0x:  4500 0034 0b6f 4000 4006 1903 0a02 0174  E..4.o@.@..t
> 0x0010:  0a02 00db 13ba ce7c a400 7b85 2028 b38c  ...|..{..(..
> 0x0020:  8011 0150 899d  0101 080a 33be 5492  ...P3.T.
> 0x0030:  33ba 44003.D.
> 05:01:35.549544 IP cnlvr02r04c31.mmcc > agent1.cncf.io.52862: Flags
> [S.], seq 3372848870, ack 3473989761, win 26844, options [mss
> 8960,sackOK,TS val 868111591 ecr 867845205,nop,wscale 7], length 0
> 0x:  4500 003c  4000 4006 246a 0a02 0174  E..<..@.@.$j...t
> 0x0010:  0a02 00db 13ba ce7e c909 96e6 cf10 e081  ...~
> 0x0020:  a012 68dc b904  0204 2300 0402 080a  ..h...#.
> 0x0030:  33be 54e7 33ba 4455 0103 03073.T.3.DU
> 05:01:35.549746 IP cnlvr02r04c31.mmcc > agent1.cncf.io.52862: Flags
> [.], ack 318, win 219, options [nop,nop,TS val 868111591 ecr
> 867845205], length 0
> 0x:  4500 0034 c200 4000 4006 6271 0a02 0174  E..4..@.@.bq...t
> 0x0010:  0a02 00db 13ba ce7e c909 96e7 cf10 e1be  ...~
> 0x0020:  8010 00db 6be1  0101 080a 33be 54e7  k...3.T.
> 0x0030:  33ba 44553.DU
> 05:01:35.780313 IP cnlvr02r04c31.mmcc > agent1.cncf.io.52862: Flags
> [.], ack 635, win 227, options [nop,nop,TS val 868111649 ecr
> 867845263], length 0
> 0x:  4500 0034 c201 4000 4006 6270 0a02 0174  E..4..@.@.bp...t
> 0x0010:  0a02 00db 13ba ce7e c909 96e7 cf10 e2fb  ...~
> 0x0020:  8010 00e3 6a28  0101 080a 33be 5521  j(..3.U!
> 0x0030:  33ba 448f    3.D.
>
> but I can't see what's in the packet since it is an encoded protobuf
> message.
>
> How do you debug these kind of issues? What kind of tools do you use?
>
> Cheers,
>
> Frank
>



-- 
Best Regards,
Haosdent Huang

Re: Let MesosContainerizer support ramdisk.

2016-12-26 Thread haosdent

@bingqiang This patch looks may take a few time to review. Could you create
an associate ticket in https://issues.apache.org/jira/browse/MESOS ? Thank
you!

On Tue, Dec 27, 2016 at 10:51 AM, pangbingqiang 
wrote:

> Hi All:
>
>   As now mesoscontainer don’t support ramdisk, we have support this
> feature, please have a review, If have any question please let me know,
> thanks.
>
> https://reviews.apache.org/r/55042/
>
> [image: cid:image001.png@01D0E8C5.8D08F440]
>
>
>
> Bingqiang Pang(庞兵强)
>
>
>
> Distributed and Parallel Software Lab
>
> Huawei Technologies Co., Ltd.
>
> Email:pangbingqi...@huawei.com 
>
>
>
>
>

-- 
Best Regards,
Haosdent Huang

Re: how to debug when a task is killed

2016-12-19 Thread haosdent

Do you configure health check? If you configure health check and it could
not pass, the task would be killed.

On Tue, Dec 20, 2016 at 2:23 PM, Luke Adolph  wrote:

> Hi all:
>
> I have set up a mesos cluster with on mesos master and five mesos agents.
> I use Marathon to depoy an app across mesos agents, which reads process
> info from /proc.
> About every 40 minuntes, my apps will be killed and Marathon restart them.
> The stderr info in sandbox is:
> 
>
> I1220 05:05:12.014192 28736 exec.cpp:143] Version: 0.28.1
> I1220 05:05:12.017397 28740 exec.cpp:217] Executor registered on slave 
> 83e33a06-5794-4baa-a654-dd2ecfcd426d-S5
> 2016/12/20 05:05:12 status read fail.
> 2016/12/20 05:05:12 process id is: 8208
> 2016/12/20 05:05:12 open /proc/8208/status: no such file or directory
> 2016/12/20 05:06:16 status read fail.
> 2016/12/20 05:06:16 process id is: 8742
> 2016/12/20 05:06:16 open /proc/8742/status: no such file or directory
> 2016/12/20 05:07:16 status read fail.
> 2016/12/20 05:07:16 process id is: 9005
> 2016/12/20 05:07:16 open /proc/9005/status: no such file or directory
> 2016/12/20 05:25:50 status read fail.
> 2016/12/20 05:25:50 open /proc/17284/stat: no such file or directory
> Killed
>
> 
>
> In addition to above stderr info, I have no meaningful info to provide or
> debug.
> May you share your experience on solving similar situation.
>
> Thanks very much！
>
> --
> Thanks & Best Regards
> 卢文泉 | Adolph Lu
> TEL：+86 15651006559 <+86%20156%205100%206559>
> Linker Networks(http://www.linkernetworks.com/)
>



-- 
Best Regards,
Haosdent Huang

Re: Mesos 1.1 web ui issues

2016-12-19 Thread haosdent

Hi, @haripriya Ping me in Mesos Slack (https://mesos.slack.com/) when you
are available, I think it would speed up the progress to solve your
problem. My id is @haosdent. If you have not join Mesos Slack before, you
could join it via https://mesos-slackin.herokuapp.com .

On Tue, Dec 20, 2016 at 2:22 AM, Haripriya Ayyalasomayajula <
aharipriy...@gmail.com> wrote:

> Hi @Haosdent,
>
> We have multiple networks- that could be one of the problems. I tried with
> all 3 of them and it still shows the same error. Can you help me understand
> what hostname exactly expects in such scenario?
>
> On Thu, Dec 15, 2016 at 6:08 PM, haosdent  wrote:
>
>> Hi, @haripriya What's the hostname flag that you use to start master?
>> According to the screenshot you posted before, I think you need to set it
>> to something like `socrates-nid000xxx.us.cray.com`.
>> However, the error log you post above, you set the hostname flag to
>> nid00016 which could not be resolved.
>>
>> On Fri, Dec 16, 2016 at 6:51 AM, Haripriya Ayyalasomayajula <
>> aharipriy...@gmail.com> wrote:
>>
>>> Hello @Haosdent,
>>>
>>> After I tried to use hostname, I still see the error. This is the output
>>> I see in developer tools for chrome:
>>>
>>> Failed to load resource: the server responded with a status of 404 (Not
>>> Found)
>>> http://nid00016:5050/metrics/snapshot?jsonp=angular.callbacks._2 Failed
>>> to load resource: net::ERR_NAME_NOT_RESOLVED
>>> http://nid00016:5050/master/state?jsonp=angular.callbacks._3 Failed to
>>> load resource: net::ERR_NAME_NOT_RESOLVED
>>> http://nid00016:5050/master/state?jsonp=angular.callbacks._4 Failed to
>>> load resource: net::ERR_NAME_NOT_RESOLVED
>>> http://nid00016:5050/metrics/snapshot?jsonp=angular.callbacks._5 Failed
>>> to load resource: net::ERR_NAME_NOT_RESOLVED
>>> http://nid00016:5050/master/state?jsonp=angular.callbacks._6 Failed to
>>> load resource: net::ERR_NAME_NOT_RESOLVED
>>> http://nid00016:5050/metrics/snapshot?jsonp=angular.callbacks._7 Failed
>>> to load resource: net::ERR_NAME_NOT_RESOLVED
>>> http://nid00016:5050/master/state?jsonp=angular.callbacks._8 Failed to
>>> load resource: net::ERR_NAME_NOT_RESOLVED
>>> http://nid00016:5050/metrics/snapshot?jsonp=angular.callbacks._9 Failed
>>> to load resource: net::ERR_NAME_NOT_RESOLVED
>>> http://nid00016:5050/master/state?jsonp=angular.callbacks._a Failed to
>>> load resource: net::ERR_NAME_NOT_RESOLVED
>>> http://nid00016:5050/metrics/snapshot?jsonp=angular.callbacks._b Failed
>>> to load resource: net::ERR_NAME_NOT_RESOLVED
>>> http://nid00016:5050/master/state?jsonp=angular.callbacks._c Failed to
>>> load resource: net::ERR_NAME_NOT_RESOLVED
>>> http://nid00016:5050/metrics/snapshot?jsonp=angular.callbacks._d Failed
>>> to load resource: net::ERR_NAME_NOT_RESOLVED
>>> http://nid00016:5050/master/state?jsonp=angular.callbacks._e Failed to
>>> load resource: net::ERR_NAME_NOT_RESOLVED
>>> http://nid00016:5050/metrics/snapshot?jsonp=angular.callbacks._f Failed
>>> to load resource: net::ERR_NAME_NOT_RESOLVED
>>> http://nid00016:5050/master/state?jsonp=angular.callbacks._g Failed to
>>> load resource: net::ERR_NAME_NOT_RESOLVED
>>> http://nid00016:5050/metrics/snapshot?jsonp=angular.callbacks._h Failed
>>> to load resource: net::ERR_NAME_NOT_RESOLVED
>>> angular-1.2.3.min.js:70 GET http://nid00016:5050/master/st
>>> ate?jsonp=angular.callbacks._i net::ERR_NAME_NOT_RESOLVEDg @
>>> angular-1.2.3.min.js:70(anonymous function) @ angular-1.2.3.min.js:71D
>>> @ angular-1.2.3.min.js:68h @ angular-1.2.3.min.js:66D @
>>> angular-1.2.3.min.js:91D @ angular-1.2.3.min.js:91(anonymous function)
>>> @ angular-1.2.3.min.js:93$eval @ angular-1.2.3.min.js:101$digest @
>>> angular-1.2.3.min.js:98$apply @ angular-1.2.3.min.js:101(anonymous
>>> function) @ angular-1.2.3.min.js:111e @ angular-1.2.3.min.js:33(anonymous
>>> function) @ angular-1.2.3.min.js:37
>>> angular-1.2.3.min.js:70 GET http://nid00016:5050/metrics/s
>>> napshot?jsonp=angular.callbacks._j net::ERR_NAME_NOT_RESOLVEDg @
>>> angular-1.2.3.min.js:70(anonymous function) @ angular-1.2.3.min.js:71D
>>> @ angular-1.2.3.min.js:68h @ angular-1.2.3.min.js:66D @
>>> angular-1.2.3.min.js:91D @ angular-1.2.3.min.js:91(anonymous function)
>>> @ angular-1.2.3.min.js:93$eval @ angular-1.2.3.min.js:101$digest @
>>> angular-1.2.3.min.js:98$apply @ angular-1.2.3.min.js:101(anonymous
>>> function) @ angular-1.2.3.min.js:111e

Re: Libraries to access Mesos HTTP endpoints

2016-12-19 Thread haosdent

As I know, don't have libraries for v1 operator APIs so far.

> DC/OS CLI [2] seems including those features, but it's too much and not
programmable for me (unless I try parsing its output).
dcos cli is open source as well. You may refer its implementation.
https://github.com/dcos/dcos-cli/blob/master/dcos/mesos.py

On Mon, Dec 19, 2016 at 1:22 PM, Kota UENISHI <
ueni...@nautilus-technologies.com> wrote:

> Hi all,
>
> I've just started setting up and operating a small Mesos cluster. To
> watch Mesos cluster status, opening the web console with a browser is
> not just enough and I want programmable libraries that is easy to take
> values from all HTTP endpoints in the document [1]. Does anybody know
> a good library (in Java, Python or Go) to fetch arbitrary data from
> endpoints?
>
> My use cases are fetch list of agents, tasks and fetch various stats
> needed for operation.
>
> As long as I googled and walked through the document, there is no such
> library that supports v1 operator APIs in a good manner such as
> generating code from mesos.proto - even in C++ or Java.
>
> DC/OS CLI [2] seems including those features, but it's too much and
> not programmable for me (unless I try parsing its output).
>
> mesos.interface [3] seems maintained recently, but list of features
> not sufficient (or I just couldn't find documentation).
>
> [1] http://mesos.apache.org/documentation/latest/endpoints/
> [2] https://dcos.io/docs/1.8/usage/cli/command-reference/
> [3] https://pypi.python.org/pypi/mesos.interface/
>
> Kota UENISHI
>



-- 
Best Regards,
Haosdent Huang

Re: [MESOS-6240] Allow executor/agent communication over non-TCP/IP stream socket.

2016-12-19 Thread haosdent

> what reason for executors need communication with agent
Executors need to report task statuses to the agent. Agent needs to send
launch task command to the executors.
Suppose executors and agents locate in different network namespaces, they
could not communicate with each other unless we support communication via
domain socket.

On Tue, Dec 20, 2016 at 7:22 AM, tommy xiao  wrote:

> don't understand what reason for executors need communication with agent?
>
>
> 2016-12-19 19:54 GMT+08:00 pangbingqiang :
>
>> Hi all:
>>
>>What’s the latest information about MESOS-6240
>> https://issues.apache.org/jira/browse/MESOS-6240 ,have any demo or
>> design achieve?
>>
>> I see libprocess have support domain socket communication, does agent and
>> executor have support communication by domain socket too?
>>
>> If have any related imformation, please let me know, thanks~.
>>
>>
>>
>> [image: cid:image001.png@01D0E8C5.8D08F440]
>>
>>
>>
>> Bingqiang Pang(庞兵强)
>>
>>
>>
>> Distributed and Parallel Software Lab
>>
>> Huawei Technologies Co., Ltd.
>>
>> Email:pangbingqi...@huawei.com 
>>
>>
>>
>>
>>
>
>
>
> --
> Deshi Xiao
> Twitter: xds2000
> E-mail: xiaods(AT)gmail.com
>



-- 
Best Regards,
Haosdent Huang

Re: Welcome Haosdent Huang as Mesos Committer and PMC member!

2016-12-18 Thread haosdent

Thank you all! I learn a lot from Mesos users and developers in the
community. Looking forward to contributing more to the community!

On Sun, Dec 18, 2016 at 10:02 PM, Klaus Ma  wrote:

> Congratulations!!
>
> On Sun, Dec 18, 2016 at 3:51 AM Dick Davies 
> wrote:
>
>> Well earned! Haosdent seems always be around to help , I'm honestly
>> surprised he's wasn't part of core already.
>>
>> On 16 December 2016 at 18:59, Vinod Kone  wrote:
>> > Hi folks,
>> >
>> > Please join me in formally welcoming Haosdent Huang as Mesos Committer
>> and
>> > PMC member.
>> >
>> > Haosdent has been an active contributor to the project for more than a
>> year
>> > now. He has contributed a number of patches and features to the Mesos
>> code
>> > base, most notably the unified cgroups isolator and health check
>> > improvements. The most impressive thing about him is that he always
>> > volunteers to help out people in the community, be it on slack/IRC or
>> > mailing lists. The fact that he does all this even though working on
>> Mesos
>> > is not part of his day job is even more impressive.
>> >
>> > Here is his more formal checklist for your perusal.
>> >
>> > Thanks,
>> > Vinod
>> >
>> > P.S: Sorry for the delay in sending the welcome email.
>>
> --
>
> Regards,
> 
> Da (Klaus), Ma (马达), PMP® | Software Architect
> IBM Platform Development & Support, STG, IBM GCG
> +86-10-8245 4084 <+86%2010%208245%204084> | mad...@cn.ibm.com |
> http://k82.me
>



-- 
Best Regards,
Haosdent Huang

Re: Welcome Guangya Liu as Mesos Committer and PMC member!

2016-12-18 Thread haosdent

Congrats Guangya!

On Sun, Dec 18, 2016 at 10:02 PM, Klaus Ma  wrote:

> Congratulations!!
>
> On Sat, Dec 17, 2016 at 1:23 PM Dharmesh Kakadia 
> wrote:
>
>> Congrats Guangya !
>>
>> Thanks,
>> Dharmesh
>>
>> On Fri, Dec 16, 2016 at 5:03 PM, Dario Rexin  wrote:
>>
>> Congrats!
>>
>> > On Dec 16, 2016, at 4:27 PM, Vinod Kone  wrote:
>> >
>> > Congrats Guangya! Welcome to the PMC!
>> >
>> >> On Fri, Dec 16, 2016 at 7:03 PM, Sam  wrote:
>> >> congratulations Guangya
>> >>
>> >> Sent from my iPhone
>> >>
>> >>> On 17 Dec 2016, at 3:23 AM, Avinash Sridharan 
>> wrote:
>> >>>
>> >>> Congrats Guangya !!
>> >>>
>> >>>> On Fri, Dec 16, 2016 at 11:20 AM, Greg Mann 
>> wrote:
>> >>>> Congratulations Guangya!!! :D
>> >>>>
>> >>>>> On Fri, Dec 16, 2016 at 11:10 AM, Jie Yu 
>> wrote:
>> >>>>> Hi folks,
>> >>>>>
>> >>>>> Please join me in formally welcoming Guangya Liu as Mesos Committer
>> and PMC
>> >>>>> member.
>> >>>>>
>> >>>>> Guangya has worked on the project for more than a year now and has
>> been a
>> >>>>> very active contributor to the project. I think one of the most
>> important
>> >>>>> contribution he has for the community is that he helped grow the
>> Mesos
>> >>>>> community in China. He initiated the Xian-Mesos-User-Group and
>> successfully
>> >>>>> organized two meetups which attracted more than 100 people from
>> Xi’an
>> >>>>> China. He wrote a handful of blogs and articles in Chinese tech
>> media which
>> >>>>> attracted a lot of interests in Mesos. He had given several talks
>> about
>> >>>>> Mesos at conferences in China.
>> >>>>>
>> >>>>> His major coding contribution to the project was the docker volume
>> driver
>> >>>>> isolator. He has also been involved in allocator performance
>> improvement,
>> >>>>> gpu support for docker containerizer, Mesos Tiers/Optimistic Offer
>> design,
>> >>>>> scarce resources discussion, and many others.
>> >>>>>
>> >>>>> His formal checklist is here:
>> >>>>> https://docs.google.com/document/d/1tot79kyJCTTgJHBhzStFKrVkDK4pX
>> >>>>> qfl-LHCLOovNtI/edit?usp=sharing
>> >>>>>
>> >>>>> Thanks,
>> >>>>> - Jie
>> >>>>
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Avinash Sridharan, Mesosphere
>> >>> +1 (323) 702 5245
>> >
>>
>>
>> --
>
> Regards,
> 
> Da (Klaus), Ma (马达), PMP® | Software Architect
> IBM Platform Development & Support, STG, IBM GCG
> +86-10-8245 4084 <+86%2010%208245%204084> | mad...@cn.ibm.com |
> http://k82.me
>



-- 
Best Regards,
Haosdent Huang

Re: Mesos on AWS

2016-12-16 Thread haosdent

>  sometimes Mesos agent is launched but master doesn’t show them.
It sounds like the Master Master could not connect to your Agents. May you
mind paste your Mesos Master log? Any information show Mesos agents are
disconnected in it?

On Sat, Dec 17, 2016 at 4:08 AM, Kiril Menshikov 
wrote:

> I have my own framework. Sometimes I get TASK_LOST status with message
> slave lost during health check.
>
> Also I found sometimes Mesos agent is launched but master doesn’t show
> them. From agent I see that it found master and connected. After agent
> restart it start working.
>
> -Kiril
>
>
> On Dec 16, 2016, at 21:58, Zameer Manji  wrote:
>
> Hey,
>
> Could you detail on what you mean by "delays and health check problems"?
> Are you using your own framework or an existing one? How are you launching
> the tasks?
>
> Could you share logs from Mesos that show timeouts to ZK?
>
> For reference, I operate a large Mesos cluster and I have never
> encountered problems when running 1k tasks concurrently so I think sharing
> data would help everyone debug this problem.
>
> On Fri, Dec 16, 2016 at 6:05 AM, Kiril Menshikov 
> wrote:
>
>> Hi,
>>
>> Does any body try to run Mesos on AWS instances? Can you give me
>> recommendations.
>>
>> I am developing elastic (scale aws instances on demand) Mesos cluster.
>> Currently I have 3 master instances. I run about 1000 tasks simultaneously.
>> I see delays and health check problems.
>>
>> ~400 tasks fits in one m4.10xlarge instance. (160GB RAM, 40 CPU).
>>
>> At the moment I increase time out in ZooKeeper cluster. What can I do to
>> decrease timeouts?
>>
>> Also how can I increase performance? The main bottleneck is what I have
>> the big amount of tasks(run simultaneously) for an hour after I shutdown
>> them or restart (depends how good them perform).
>>
>> -Kiril
>>
>> --
>> Zameer Manji
>>
>
>


-- 
Best Regards,
Haosdent Huang

Re: Mesos 1.1 web ui issues

2016-12-15 Thread haosdent

Hi, @haripriya What's the hostname flag that you use to start master?
According to the screenshot you posted before, I think you need to set it
to something like `socrates-nid000xxx.us.cray.com`.
However, the error log you post above, you set the hostname flag to
nid00016 which could not be resolved.

On Fri, Dec 16, 2016 at 6:51 AM, Haripriya Ayyalasomayajula <
aharipriy...@gmail.com> wrote:

> Hello @Haosdent,
>
> After I tried to use hostname, I still see the error. This is the output I
> see in developer tools for chrome:
>
> Failed to load resource: the server responded with a status of 404 (Not
> Found)
> http://nid00016:5050/metrics/snapshot?jsonp=angular.callbacks._2 Failed
> to load resource: net::ERR_NAME_NOT_RESOLVED
> http://nid00016:5050/master/state?jsonp=angular.callbacks._3 Failed to
> load resource: net::ERR_NAME_NOT_RESOLVED
> http://nid00016:5050/master/state?jsonp=angular.callbacks._4 Failed to
> load resource: net::ERR_NAME_NOT_RESOLVED
> http://nid00016:5050/metrics/snapshot?jsonp=angular.callbacks._5 Failed
> to load resource: net::ERR_NAME_NOT_RESOLVED
> http://nid00016:5050/master/state?jsonp=angular.callbacks._6 Failed to
> load resource: net::ERR_NAME_NOT_RESOLVED
> http://nid00016:5050/metrics/snapshot?jsonp=angular.callbacks._7 Failed
> to load resource: net::ERR_NAME_NOT_RESOLVED
> http://nid00016:5050/master/state?jsonp=angular.callbacks._8 Failed to
> load resource: net::ERR_NAME_NOT_RESOLVED
> http://nid00016:5050/metrics/snapshot?jsonp=angular.callbacks._9 Failed
> to load resource: net::ERR_NAME_NOT_RESOLVED
> http://nid00016:5050/master/state?jsonp=angular.callbacks._a Failed to
> load resource: net::ERR_NAME_NOT_RESOLVED
> http://nid00016:5050/metrics/snapshot?jsonp=angular.callbacks._b Failed
> to load resource: net::ERR_NAME_NOT_RESOLVED
> http://nid00016:5050/master/state?jsonp=angular.callbacks._c Failed to
> load resource: net::ERR_NAME_NOT_RESOLVED
> http://nid00016:5050/metrics/snapshot?jsonp=angular.callbacks._d Failed
> to load resource: net::ERR_NAME_NOT_RESOLVED
> http://nid00016:5050/master/state?jsonp=angular.callbacks._e Failed to
> load resource: net::ERR_NAME_NOT_RESOLVED
> http://nid00016:5050/metrics/snapshot?jsonp=angular.callbacks._f Failed
> to load resource: net::ERR_NAME_NOT_RESOLVED
> http://nid00016:5050/master/state?jsonp=angular.callbacks._g Failed to
> load resource: net::ERR_NAME_NOT_RESOLVED
> http://nid00016:5050/metrics/snapshot?jsonp=angular.callbacks._h Failed
> to load resource: net::ERR_NAME_NOT_RESOLVED
> angular-1.2.3.min.js:70 GET http://nid00016:5050/master/
> state?jsonp=angular.callbacks._i net::ERR_NAME_NOT_RESOLVEDg @
> angular-1.2.3.min.js:70(anonymous function) @ angular-1.2.3.min.js:71D @
> angular-1.2.3.min.js:68h @ angular-1.2.3.min.js:66D @
> angular-1.2.3.min.js:91D @ angular-1.2.3.min.js:91(anonymous function) @
> angular-1.2.3.min.js:93$eval @ angular-1.2.3.min.js:101$digest @
> angular-1.2.3.min.js:98$apply @ angular-1.2.3.min.js:101(anonymous
> function) @ angular-1.2.3.min.js:111e @ angular-1.2.3.min.js:33(anonymous
> function) @ angular-1.2.3.min.js:37
> angular-1.2.3.min.js:70 GET http://nid00016:5050/metrics/
> snapshot?jsonp=angular.callbacks._j net::ERR_NAME_NOT_RESOLVEDg @
> angular-1.2.3.min.js:70(anonymous function) @ angular-1.2.3.min.js:71D @
> angular-1.2.3.min.js:68h @ angular-1.2.3.min.js:66D @
> angular-1.2.3.min.js:91D @ angular-1.2.3.min.js:91(anonymous function) @
> angular-1.2.3.min.js:93$eval @ angular-1.2.3.min.js:101$digest @
> angular-1.2.3.min.js:98$apply @ angular-1.2.3.min.js:101(anonymous
> function) @ angular-1.2.3.min.js:111e @ angular-1.2.3.min.js:33(anonymous
> function) @ angular-1.2.3.min.js:37
>
>
> Also, regarding the "cluster flag", here is my output:
>
> nid00016: root 14940  2.5  0.0 2080192 85012 ?   Ssl  16:44   0:08
> /usr/sbin/mesos-master --zk=zk://192.168.0.1:2181,192.168.0.17:2181,
> 192.168.0.33:2181/mesos --port=5050 --log_dir=/var/log/mesos
> --acls=/etc/mesos_acls.json --authenticate_frameworks=true
> --cluster="socrates" --credentials=/etc/marathon-auth/credentials
> --hostname=nid00016 --quorum=2 --work_dir=/var/lib/mesos
>
> nid00016: root 14965  0.0  0.0 107892   612 ?S16:44   0:00
> logger -p user.info -t mesos-master[14940]
>
> nid00016: root 14966  0.0  0.0 107892   692 ?S16:44   0:00
> logger -p user.err -t mesos-master[14940]
>
> nid00016: root 15892  0.0  0.0 113116  1604 ?Ss   16:50   0:00
> bash -c ps -aux | grep mesos-master
>
> nid00016: root 15959  0.0  0.0 112644   948 ?S16:50   0:00
> grep mesos-master
>
> nid00032: root 30018  2.5  0.0 2670032 26480 ?   Ssl  16:44   0:08
> /usr/sbin/mesos-master --zk

Re: Proposal: mesosadm, the command to bootstrap the mesos cluster.

2016-12-13 Thread haosdent

We have a discussion in China User Group before.
And Jay Guo mentioned that a better way may be just to remove zookeeper,
and use the replicate log to do election.
So for new comer, users just need to start masters and agents in production
without zookeeper or etcd.
The only necessary configuration item is the master address list, which
would reduce a big overload to get starting Mesos.

On Tue, Dec 13, 2016 at 4:20 PM, Stephen Gran 
wrote:

> Hi,
>
> I'm quite happy with the current approach of bootstrapping a new agent
> with the location of zookeeper and a set of credentials.  This allows
> our automation code to make new agents join the cluster automatically.
>
> Not that I'm opposed to the two step process you propose, I'm sure we
> can make that happen automatically as well, but aside from making mesos
> look more like other solutions, does it bring semantics that would be
> useful?  ie, are there actions that 'mesosadm init' would initiate?  Or
> would this be purely an interactive way to do the same things you can do
> now by seeding out config files?
>
> Cheers,
>
> On 13/12/16 05:14, tommy xiao wrote:
> > Hi team,
> >
> >
> > I came from china mesos community. in today's group discussion, we came
> > across a topic: Howto enhance user's cluster experience?
> >
> > Because newcome user is top resource for a community. if we can enhance
> > currently mesos cluster installation steps, it will help us fastly
> > bootstrap in user community.
> >
> > why mesosadm?
> >
> > such as Swarm cluster setup steps:
> >
> > 1. docker init
> > 2. docker join
> >
> > another kuberenetes 1.5 cluster setup steps:
> >
> > 1. kubeadm init
> > 2. kubeadm join --token  
> >
> > So i think the init, join style is good experience for normal user. How
> > about you think?
> >
> >
> >
> > --
> > Deshi Xiao
> > Twitter: xds2000
> > E-mail: xiaods(AT)gmail.com <http://gmail.com>
>
> --
> Stephen Gran
> Senior Technical Architect
>
> picture the possibilities | piksel.com
>



-- 
Best Regards,
Haosdent Huang

Re: Can I consider other framework tasks as a resource? Does it make sense?

2016-12-13 Thread haosdent

Hi, @Petr.

> Like if I want to run my task collocated with some other tasks on the
same node I have to make this decision somewhere.
Do you mean "POD" here?

For my cases, if there are some dependencies between my tasks, I use
database, message queue or zookeeper to implement my requirement.

On Wed, Dec 14, 2016 at 3:09 AM, Petr Novak  wrote:

> Hello,
>
> I want to execute tasks which requires some other tasks from other
> framework(s) already running. I’m thinking where such logic/strategy/policy
> belongs in principle. I understand scheduling as a process to decide where
> to execute task according to some resources availability, typically CPU,
> mem, net, hdd etc.
>
>
>
> If my task require other tasks running could I generalize and consider
> that those tasks from other frameworks are kind of required resources and
> put this logic/strategy decisions into scheduler? Like if I want to run my
> task collocated with some other tasks on the same node I have to make this
> decision somewhere.
>
>
>
> Does it make any sense? I’m asking because I have never thought about
> other frameworks/tasks as “resources” so that I could put them into
> scheduler to satisfy my understanding of a scheduler. Or it rather belongs
> higher like to a framework, or lower to an executor? Should scheduler be
> dedicated to decisions about resources which are offered and am I mixing
> concepts?
>
>
>
> Or I just should keep distinction between resources and
> requirements/policies or whatever but anyway does this kind of logic still
> belongs to scheduler or it should be somewhere else? I’m trying to
> understand which logic should be in scheduler and what should go somewhere
> else.
>
>
>
> Many thanks,
>
> Petr
>
>
>



-- 
Best Regards,
Haosdent Huang

Re: Mesos 1.1 web ui issues

2016-12-06 Thread haosdent

Hi, @Haripriya It looks like there are some problems in your master flags.

> I'm attaching a snapshot of the error I've seen in Chrome with this
email. It'll be great if you can suggest if I'm missing any configuration
or if its some bug.
According to the screenshot you attached, the hostnames are incorrect on
your servers. Mesos WebUI depends on that to find the leading master.
A workaround is to specific the `--hostname` flag when starting your
masters. For example, launch your masters with

```
$ mesos-master --hostname=socrates-nid000xxx.us.cray.com xxx
```

> Is it something to do with a stale state of mesos anywhere or the way I'm
passing cluster? I have a config file named cluster in /etc/mesos-master/
and when I restart the cluster it picks up the config files.

You need to ensure the flags of every master contains
`--cluster=your_cluster_name`.

Could you perform `ps aux |grep mesos-master` on every master and paste
their outputs here?


On Wed, Dec 7, 2016 at 4:39 AM, Haripriya Ayyalasomayajula <
aharipriy...@gmail.com> wrote:

> Hello, @Haosdent,
>
> Thanks for suggesting these.
> I'm attaching a snapshot of the error I've seen in Chrome with this email.
> It'll be great if you can suggest if I'm missing any configuration or if
> its some bug.
>
> And for the second part, my `/master/state` end point does not return
> "cluster" anywhere. It returned 75k lines of json so I'm not pasting all of
> it.
> {
> "activated_slaves": 37.0,
> "build_date": "2016-11-16 01:31:49",
> "build_time": 1479259909.0,
> "build_user": "centos",
> "completed_frameworks": [
> {
> "active": true,
>   ..
>
>
>
> "start_time": 1480967418.42687,
> "unregistered_frameworks": [],
> "version": "1.1.0"
> }
>
> Is it something to do with a stale state of mesos anywhere or the way I'm
> passing cluster? I have a config file named cluster in /etc/mesos-master/
> and when I restart the cluster it picks up the config files.
>
> On Mon, Dec 5, 2016 at 6:24 PM, haosdent  wrote:
>
>> Hi, @Haripriya
>>
>> > (less than 1 min though the  jobs are running just fine).
>> > Is there any new configuration that has to be added?
>>
>> We change to use JSONP to send requests in WebUI since 1.0 May I have
>> your error log in Safari, Chrome and Firefox?
>> You could open it via https://developers.google.
>> com/web/tools/chrome-devtools/console/
>>
>> > The UI does not display the name of the cluster despite using the
>> --cluster flag.
>> --cluster flag works fine for me. May you paste your `/master/state`
>> endpoint at the email, I would like to check the value of `cluster` field
>> in it.
>>
>> On Tue, Dec 6, 2016 at 5:34 AM, Haripriya Ayyalasomayajula <
>> aharipriy...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I have two issues with the web UI in Mesos 1.1
>>>
>>> 1.
>>>
>>> Earlier when I was using Mesos 0.28, mesos web UI would try to reconnect
>>> only when there are network issues or when there is a newly elected leader.
>>> After upgrade to 1.1, we see that it won't work (shows no leader is elected
>>> even when there is a leader elected and jobs are running happily ) on
>>> safari, works on chrome and firefox but tries to re-connect very often
>>> (less than 1 min though the  jobs are running just fine).
>>>
>>> Is there any new configuration that has to be added?
>>>
>>>
>>> 2. The UI does not display the name of the cluster despite using the
>>> --cluster flag.
>>>
>>> /usr/sbin/mesos-master --zk=zk://mesos1:2181,mesos2:2181,mesos3:2181/
>>> mesos --port=5050 --log_dir=/var/log/mesos --acls=/etc/mesos_acls.json
>>> --authenticate_frameworks=true --cluster="cluster1"
>>> --credentials=/etc/auth/credentials --quorum=2 --work_dir=/var/lib/mesos
>>>
>>>
>>> I also tried adding the name of the cluster without quotes: cluster1
>>> instead of "cluster1", but that doesn't work either.
>>>
>>> /usr/sbin/mesos-master --zk=zk://mesos1:2181,mesos2:2181,mesos3:2181/
>>> mesos --port=5050 --log_dir=/var/log/mesos --acls=/etc/mesos_acls.json
>>> --authenticate_frameworks=true --cluster=cluster1
>>> --credentials=/etc/auth/credentials --quorum=2 --work_dir=/var/lib/mesos
>>> I greatly appreciate any help!
>>>
>>> --
>>> Thanks,
>>> Haripriya
>>>
>>
>>
>>
>> --
>> Best Regards,
>> Haosdent Huang
>>
>
>
>
> --
> Thanks,
> Haripriya
>
>


-- 
Best Regards,
Haosdent Huang

Re: Mesos 1.1 web ui issues

2016-12-05 Thread haosdent

Hi, @Haripriya

> (less than 1 min though the  jobs are running just fine).
> Is there any new configuration that has to be added?

We change to use JSONP to send requests in WebUI since 1.0 May I have your
error log in Safari, Chrome and Firefox?
You could open it via
https://developers.google.com/web/tools/chrome-devtools/console/

> The UI does not display the name of the cluster despite using the
--cluster flag.
--cluster flag works fine for me. May you paste your `/master/state`
endpoint at the email, I would like to check the value of `cluster` field
in it.

On Tue, Dec 6, 2016 at 5:34 AM, Haripriya Ayyalasomayajula <
aharipriy...@gmail.com> wrote:

> Hi all,
>
> I have two issues with the web UI in Mesos 1.1
>
> 1.
>
> Earlier when I was using Mesos 0.28, mesos web UI would try to reconnect
> only when there are network issues or when there is a newly elected leader.
> After upgrade to 1.1, we see that it won't work (shows no leader is elected
> even when there is a leader elected and jobs are running happily ) on
> safari, works on chrome and firefox but tries to re-connect very often
> (less than 1 min though the  jobs are running just fine).
>
> Is there any new configuration that has to be added?
>
>
> 2. The UI does not display the name of the cluster despite using the
> --cluster flag.
>
> /usr/sbin/mesos-master --zk=zk://mesos1:2181,mesos2:2181,mesos3:2181/mesos
> --port=5050 --log_dir=/var/log/mesos --acls=/etc/mesos_acls.json
> --authenticate_frameworks=true --cluster="cluster1" 
> --credentials=/etc/auth/credentials
> --quorum=2 --work_dir=/var/lib/mesos
>
>
> I also tried adding the name of the cluster without quotes: cluster1
> instead of "cluster1", but that doesn't work either.
>
> /usr/sbin/mesos-master --zk=zk://mesos1:2181,mesos2:2181,mesos3:2181/mesos
>  --port=5050 --log_dir=/var/log/mesos --acls=/etc/mesos_acls.json
> --authenticate_frameworks=true --cluster=cluster1 
> --credentials=/etc/auth/credentials
> --quorum=2 --work_dir=/var/lib/mesos
> I greatly appreciate any help!
>
> --
> Thanks,
> Haripriya
>



-- 
Best Regards,
Haosdent Huang

Re: Failure reason documentation

2016-12-04 Thread haosdent

Ohoh, sorry for misunderstanding the question. As far as I know, there is
no documentation for that. We should add some comments to the reason enums.
Create a ticket here https://issues.apache.org/jira/browse/MESOS-6686 to
track it.

On Mon, Dec 5, 2016 at 2:27 AM, Erik Weathers  wrote:

> I think he's looking for documentation about what precisely each reason
> *means*. A la how there are comments beside the TaskState list in
> mesos.proto.
>
> - Erik
>
> On Sun, Dec 4, 2016 at 10:07 AM haosdent  wrote:
>
> Hi @Wil You could find them here https://github.com/
> apache/mesos/blob/1.1.0/include/mesos/mesos.proto#L1577-L1609
>
> On Sat, Dec 3, 2016 at 6:09 AM, Wil Yegelwel  wrote:
>
> No I'm referring to the values of the enum Reason.
>
> On Fri, Dec 2, 2016, 4:52 PM Tomek Janiszewski  wrote:
>
> Hi
>
> Are you referring to task state? If yes then take a look at comments in
> proto https://github.com/apache/mesos/blob/master/include/
> mesos/mesos.proto#L1552  http://mesos.apache.org/api/
> latest/java/org/apache/mesos/Protos.TaskState.html
>
> Best
>
> Tomek
>
> pt., 2.12.2016, 21:31 użytkownik Wil Yegelwel 
> napisał:
>
> Hey mesos users!
>
> I can't seem to find any documentation about the various reasons mesos
> includes when a job fails. Is there a place that describes what the reasons
> mean?
>
> Thanks,
> Wil
>
>
>
>
> --
> Best Regards,
> Haosdent Huang
>
>


-- 
Best Regards,
Haosdent Huang

Re: Failure reason documentation

2016-12-04 Thread haosdent

Hi @Wil You could find them here
https://github.com/apache/mesos/blob/1.1.0/include/mesos/mesos.proto#L1577-L1609

On Sat, Dec 3, 2016 at 6:09 AM, Wil Yegelwel  wrote:

> No I'm referring to the values of the enum Reason.
>
> On Fri, Dec 2, 2016, 4:52 PM Tomek Janiszewski  wrote:
>
>> Hi
>>
>> Are you referring to task state? If yes then take a look at comments in
>> proto https://github.com/apache/mesos/blob/master/include/
>> mesos/mesos.proto#L1552  http://mesos.apache.org/api/
>> latest/java/org/apache/mesos/Protos.TaskState.html
>>
>> Best
>>
>> Tomek
>>
>> pt., 2.12.2016, 21:31 użytkownik Wil Yegelwel 
>> napisał:
>>
>> Hey mesos users!
>>
>> I can't seem to find any documentation about the various reasons mesos
>> includes when a job fails. Is there a place that describes what the reasons
>> mean?
>>
>> Thanks,
>> Wil
>>
>>


-- 
Best Regards,
Haosdent Huang

Re: MESOS-6233 Allow agents to re-register post a host reboot

2016-12-04 Thread haosdent

> we can have the agent remove `rm -f /meta/slaves/latest`
automatically upon recovery failure but only after the host has rebooted.
This sounds dangerous. When the different of AgentInfo is caused by
operator's typo, I think the operator would prefer to correct them and try
to start agent again. Rather than remove them automatically.

But if we decide to do that, please make sure email this behavior change to
the mailing lists in a separate email. Thank you!

On Wed, Nov 30, 2016 at 6:24 AM, tommy xiao  wrote:

> agree with james's options.
>
> 2016-11-30 0:48 GMT+08:00 James Peach :
>
> >
> > > On Nov 28, 2016, at 6:09 PM, Yan Xu  wrote:
> > >
> > > So one thing that was brought up during offline conversations was that
> > if the host reboot is associated with hardware change (e.g., a new memory
> > stick):
> > >
> > >   • Currently: the agent would skip the recovery (and the chance of
> > running into incompatible agent info) and register as a new agent.
> > >   • With the change: the agent could run into incompatible agent
> > info due to resource change and flap indefinitely until the operator
> > intervenes.
> > >
> > > To mitigate this and maintain the current behavior, we can have the
> > agent remove `rm -f /meta/slaves/latest` automatically upon
> > recovery failure but only after the host has rebooted. This way the agent
> > can restart as a new agent without operator intervention.
> > >
> > > Any thoughts?
> >
> > I still think you need a mechanism for the master/agent to tell you
> > whether it will honor the restart policy. Without this, you have to lock
> > the framework to a Mesos version.
> >
> > An empty RestartPolicy is also problematic since it precludes using
> > RestartPolicy in pods. If you later want to restart a task inside a pod
> but
> > not across agent restarts you would have no way to express that.
> >
> > J
>
>
>
>
> --
> Deshi Xiao
> Twitter: xds2000
> E-mail: xiaods(AT)gmail.com
>



-- 
Best Regards,
Haosdent Huang

Re: mesos leader switch may lead orphan tasks

2016-11-29 Thread haosdent

As check with chengwei privately. The orphan tasks should not exist after
framework subscribed success, because could not find them in the
`orphan_tasks` field in the master/state endpoint.

On Tue, Nov 29, 2016 at 9:53 PM, Chengwei Yang 
wrote:

> On Tue, Nov 29, 2016 at 09:31:08PM +0800, haosdent wrote:
> > Do your jobs scheduled by marathon or your framework?
>
> We started 3 frameworks(marathon, storm, chronos) before upgrading.
>
> Here is the relative logs from the leading master
>
> -8<
> ...
> I1129 14:11:44.009774  6862 master.cpp:7460] Adding task
> ct:TEST_JOB0_1480396486890:4 with resources cpus(*):4.9; mem(*):64;
> disk(*):256 on agent 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0
> (mesos-master-dev051-cqdx.qiyi.virtual)
> I1129 14:11:44.009842  6862 master.cpp:7460] Adding task
> mesos-master-dev051-cqdx.qiyi.virtual-31000 with resources cpus(*):1;
> mem(*):1024; ports(*):[31000-31000] on agent 
> 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0
> (mesos-master-dev051-cqdx.qiyi.virtual)
> I1129 14:11:44.009891  6862 master.cpp:7460] Adding task
> mesos-master-dev051-cqdx.qiyi.virtual-31000 with resources cpus(*):1;
> mem(*):1024; ports(*):[31000-31000] on agent 
> 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0
> (mesos-master-dev051-cqdx.qiyi.virtual)
> I1129 14:11:44.009953  6862 master.cpp:7460] Adding task
> test-all.35819d0c-b5df-11e6-971e-02429c7d09a1 with resources cpus(*):1;
> mem(*):128; ports(*):[31417-31418] on agent 
> 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0
> (mesos-master-dev051-cqdx.qiyi.virtual)
> I1129 14:11:44.010197  6860 leveldb.cpp:341] Persisting action (18 bytes)
> to leveldb took 455974ns
> W1129 14:11:44.010202  6862 master.cpp:6569] Possibly orphaned task
> test-all.35819d0c-b5df-11e6-971e-02429c7d09a1 of framework
> 39b8a1b0-5ab0-478b-8175-479fb8737942-0002 running on agent
> 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@10.23.85.18:5051
> (mesos-master-dev051-cqdx.qiyi.virtual)
> I1129 14:11:44.010213  6860 replica.cpp:712] Persisted action at 102
> W1129 14:11:44.010249  6862 master.cpp:6569] Possibly orphaned task
> mesos-master-dev051-cqdx.qiyi.virtual-31000 of framework
> 8e87ed68-434d-4267-b83d-c6a509266a03- running on agent
> 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@10.23.85.18:5051
> (mesos-master-dev051-cqdx.qiyi.virtual)
> W1129 14:11:44.010406  6862 master.cpp:6569] Possibly orphaned task
> mesos-master-dev051-cqdx.qiyi.virtual-31000 of framework
> 39b8a1b0-5ab0-478b-8175-479fb8737942-0004 running on agent
> 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@10.23.85.18:5051
> (mesos-master-dev051-cqdx.qiyi.virtual)
> W1129 14:11:44.010429  6862 master.cpp:6569] Possibly orphaned task
> ct:TEST_JOB0_1480396486890:4 of framework 
> 39b8a1b0-5ab0-478b-8175-479fb8737942-0003
> running on agent 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@
> 10.23.85.18:5051 (mesos-master-dev051-cqdx.qiyi.virtual)
> W1129 14:11:44.010447  6862 master.cpp:6596] Possibly orphaned completed
> task mesos-master-dev051-cqdx.qiyi.virtual-31000 of framework
> 39b8a1b0-5ab0-478b-8175-479fb8737942- that ran on agent
> 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@10.23.85.18:5051
> (mesos-master-dev051-cqdx.qiyi.virtual)
> I1129 14:11:44.010645  6860 hierarchical.cpp:476] Added agent
> 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 
> (mesos-master-dev051-cqdx.qiyi.virtual)
> with cpus(*):8; mem(*):14604; disk(*):297130; ports(*):[31000-32000]
> (allocated: cpus(*):8; mem(*):2280; ports(*):[31000-31000, 31417-31418];
> disk(*):256)
> I1129 14:11:44.010646  6862 master.cpp:4885] Re-registered agent
> 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@10.23.85.18:5051
> (mesos-master-dev051-cqdx.qiyi.virtual) with cpus(*):8; mem(*):14604;
> disk(*):297130; ports(*):[31000-32000]
> I1129 14:11:44.010764  6862 master.cpp:4953] Sending updated checkpointed
> resources  to agent 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@
> 10.23.85.18:5051 (mesos-master-dev051-cqdx.qiyi.virtual)
> I1129 14:11:44.011076  6860 replica.cpp:691] Replica received learned
> notice for position 102 from @0.0.0.0:0
> I1129 14:11:44.011338  6861 master.cpp:5015] Received update of agent
> 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@10.23.85.18:5051
> (mesos-master-dev051-cqdx.qiyi.virtual) with total oversubscribed
> resources
> I1129 14:11:44.011404  6861 hierarchical.cpp:540] Agent
> 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 
> (mesos-master-dev051-cqdx.qiyi.virtual)
> updated with oversubscribed resources  (total: cpus(*):8; mem(*):14604;
> disk(*):297130; ports(*):[31000-32000], allocated: cpus(*):8; mem(*):2280;
> ports(*):[31000-31000, 31417-31418]; disk(*):256)
> I1129 14:11

Re: mesos leader switch may lead orphan tasks

2016-11-29 Thread haosdent

Do your jobs scheduled by marathon or your framework?

On Tue, Nov 29, 2016 at 7:20 PM, Chengwei Yang 
wrote:

> Hi there,
>
> We're upgrading mesos from 0.28.2 to 1.0.2 and we found an interesting
> problem.
>
> We followed the official upgrade guide so first upgrade 2 following
> mesos-master, and then the leading master.
>
> Once the leading master upgraded, the leader switched to another 1.0.2
> mesos-master.
>
> Now, stop here.
>
> we found that the leading master does below from its log.
>
> ```
> ...
> Adding task ...
> Adding task ...
> ...
> SUBSRIBE framework
> SUBSRIBE framework
> ...
> ```
>
> So the problem is when it adding existed tasks, it can not found
> corresponding
> framework, so the task becomes **Orphan**.
>
> Is this a known preempt issue or am I missing anything?
>
> --
> Thanks,
> Chengwei
>



-- 
Best Regards,
Haosdent Huang

Re: Force offer from all of the slaves

2016-11-27 Thread haosdent

> I choose the right offer and decline the rest.
Hi, @krishnanvr Do you use up all available resources in that agent's
offer? If so, that agent could not provide offers anymore until the
resource release.

And you may consider starting the master with the `GLOG_v=1` environment
variable which would print more detail logs to help you debug this.

On Sat, Nov 26, 2016 at 5:05 PM, Krishnanarayanan VR  wrote:

> Hello:
>
> Is there a way to force ResourceOffers to get offers from all available
> slaves ?
>
> Let me clarify:
>
> I have a single framework in my cluster. Each time ResourceOffers gets the
> list of offers, I choose the right offer and decline the rest. But I notice
> that next time a callback to ResourceOffers occurs, only a subset of slaves
> is present in the offer. The slave from offer that was chosen in the
> previous iteration is invariably absent.
>
> I also tried to set refuse_seconds to 0 in  both LaunchTasks and
> Decline(egs below):
>
> driver.DeclineOffer(offer.Id, &mesos.Filters{RefuseSeconds:
> proto.Float64(0)})
>
> ^^ but that didn't seem to help.
>
> Any pointers how I can make sure am presented with offers from all the
> slaves all the time ?
>
> Thanks
>
>
>
>


-- 
Best Regards,
Haosdent Huang

Re: Question on Mesos 1.1.0 LaunchGroup

2016-11-21 Thread haosdent

Hi, @Qi Feng. Thank you for your reply. I afraid Mesos may not have plans
to support POD for Docker Containerizer since users could continue to use
POD with Docker images via Mesos Containerizer which no big differences.

Mesos containerizer is stable and production ready for several years. If
your company have any concerns and requirements about it which I could help
on, please let me know.  :)

On Mon, Nov 21, 2016 at 5:39 PM, Qi Feng  wrote:

> Thanks for the reply.
>
>
> To be honest, I'm not start from mesos container era. We use docker first,
> and then try mesos to be a scheduler.
>
> It's true, I could control more on mesos framwork. But it's much more
> costly than k8s.
>
> Switch to another containerizer technology may be simple and easy to me,
> but seems impassible to my company in a period of time.
> --
> *From:* haosdent 
> *Sent:* Monday, November 21, 2016 9:08:23 AM
> *To:* user
> *Subject:* Re: Question on Mesos 1.1.0 LaunchGroup
>
> Hi, @Qi Feng. Actually you could continue to use docker image via Mesos
> container. You could refer to https://github.com/apache/
> mesos/blob/master/docs/container-image.md for more details.
> <https://github.com/apache/mesos/blob/master/docs/container-image.md>
> mesos/container-image.md at master · apache/mesos · GitHub
> <https://github.com/apache/mesos/blob/master/docs/container-image.md>
> github.com
> mesos - Mirror of Apache Mesos ... release-0.11.0-incubating-RC3
> release-0.11.0-incubating-RC2 release-0.11.0-incubating-RC1 release ...
>
>
> On Mon, Nov 21, 2016 at 5:04 PM, Qi Feng  wrote:
>
>> I don't understand why leave docker. Would we could have launch_group for
>> docker in the future?
>> Or we can only write an executor for that.
>>
>> Thanks.
>>
>> --
>> *From:* haosdent 
>> *Sent:* Monday, November 21, 2016 8:18:13 AM
>>
>> *To:* user
>> *Subject:* Re: Question on Mesos 1.1.0 LaunchGroup
>>
>> Yep, only mesos container is supported.
>>
>> On Mon, Nov 21, 2016 at 4:14 PM, Qi Feng  wrote:
>>
>>> Thanks haosdent.
>>>
>>>
>>> I tried to use docker containerInfo to launch group task, but got
>>> "Docker ContainerInfo is not supported on the task".
>>> Is it support mesos container only?
>>> --
>>> *From:* haosdent 
>>> *Sent:* Friday, November 18, 2016 4:54:07 PM
>>> *To:* user
>>> *Subject:* Re: Question on Mesos 1.1.0 LaunchGroup
>>>
>>> Hi, @Qi You may refer `mesos-executor` about how to build `LaunchGroup`
>>> https://github.com/apache/mesos/blob/master/sr
>>> c/cli/execute.cpp#L498-L524
>>>
>>> ```
>>>  operation->set_type(Offer::Operation::LAUNCH_GROUP);
>>>
>>>  ExecutorInfo* executorInfo =
>>>operation->mutable_launch_group()->mutable_executor();
>>>
>>>  executorInfo->set_type(ExecutorInfo::DEFAULT);
>>>  executorInfo->mutable_executor_id()->set_value(
>>>  "default-executor");
>>> ...
>>> ```
>>> As you see, executor-id is a string here and you could use any string to
>>> identify the executor.
>>>
>>> On Fri, Nov 18, 2016 at 3:47 PM, Qi Feng  wrote:
>>>
>>>> I'm trying the LaunchGroup feature.
>>>>
>>>> But I find the an executorInfo is required.
>>>>
>>>>
>>>> message LaunchGroup {
>>>>   required ExecutorInfo executor =3D 1;
>>>>   required TaskGroupInfo task_group =3D 2;
>>>> }
>>>>
>>>> What's more, an executor id is required in executorInfo. How would I
>>>> build =
>>>> the executorInfo if I use the default executor of mesos?
>>>> https://github.com/apache/mesos/blob/1.1.x/include/mesos/mes
>>>> os.proto#L566
>>>>
>>>> Thanks for any reply.
>>>>
>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Haosdent Huang
>>>
>>
>>
>>
>> --
>> Best Regards,
>> Haosdent Huang
>>
>
>
>
> --
> Best Regards,
> Haosdent Huang
>



-- 
Best Regards,
Haosdent Huang

Re: Question on Mesos 1.1.0 LaunchGroup

2016-11-21 Thread haosdent

Hi, @Qi Feng. Actually you could continue to use docker image via Mesos
container. You could refer to
https://github.com/apache/mesos/blob/master/docs/container-image.md for
more details.

On Mon, Nov 21, 2016 at 5:04 PM, Qi Feng  wrote:

> I don't understand why leave docker. Would we could have launch_group for
> docker in the future?
> Or we can only write an executor for that.
>
> Thanks.
>
> ------
> *From:* haosdent 
> *Sent:* Monday, November 21, 2016 8:18:13 AM
>
> *To:* user
> *Subject:* Re: Question on Mesos 1.1.0 LaunchGroup
>
> Yep, only mesos container is supported.
>
> On Mon, Nov 21, 2016 at 4:14 PM, Qi Feng  wrote:
>
>> Thanks haosdent.
>>
>>
>> I tried to use docker containerInfo to launch group task, but got "Docker
>> ContainerInfo is not supported on the task".
>> Is it support mesos container only?
>> --
>> *From:* haosdent 
>> *Sent:* Friday, November 18, 2016 4:54:07 PM
>> *To:* user
>> *Subject:* Re: Question on Mesos 1.1.0 LaunchGroup
>>
>> Hi, @Qi You may refer `mesos-executor` about how to build `LaunchGroup`
>> https://github.com/apache/mesos/blob/master/src/cli/execute.cpp#L498-L524
>>
>> ```
>>  operation->set_type(Offer::Operation::LAUNCH_GROUP);
>>
>>  ExecutorInfo* executorInfo =
>>operation->mutable_launch_group()->mutable_executor();
>>
>>  executorInfo->set_type(ExecutorInfo::DEFAULT);
>>  executorInfo->mutable_executor_id()->set_value(
>>  "default-executor");
>> ...
>> ```
>> As you see, executor-id is a string here and you could use any string to
>> identify the executor.
>>
>> On Fri, Nov 18, 2016 at 3:47 PM, Qi Feng  wrote:
>>
>>> I'm trying the LaunchGroup feature.
>>>
>>> But I find the an executorInfo is required.
>>>
>>>
>>> message LaunchGroup {
>>>   required ExecutorInfo executor =3D 1;
>>>   required TaskGroupInfo task_group =3D 2;
>>> }
>>>
>>> What's more, an executor id is required in executorInfo. How would I
>>> build =
>>> the executorInfo if I use the default executor of mesos?
>>> https://github.com/apache/mesos/blob/1.1.x/include/mesos/mes
>>> os.proto#L566
>>>
>>> Thanks for any reply.
>>>
>>
>>
>>
>> --
>> Best Regards,
>> Haosdent Huang
>>
>
>
>
> --
> Best Regards,
> Haosdent Huang
>



-- 
Best Regards,
Haosdent Huang

Re: Question on Mesos 1.1.0 LaunchGroup

2016-11-21 Thread haosdent

Yep, only mesos container is supported.

On Mon, Nov 21, 2016 at 4:14 PM, Qi Feng  wrote:

> Thanks haosdent.
>
>
> I tried to use docker containerInfo to launch group task, but got "Docker 
> ContainerInfo
> is not supported on the task".
> Is it support mesos container only?
> ------
> *From:* haosdent 
> *Sent:* Friday, November 18, 2016 4:54:07 PM
> *To:* user
> *Subject:* Re: Question on Mesos 1.1.0 LaunchGroup
>
> Hi, @Qi You may refer `mesos-executor` about how to build `LaunchGroup`
> https://github.com/apache/mesos/blob/master/src/cli/execute.cpp#L498-L524
>
> ```
>  operation->set_type(Offer::Operation::LAUNCH_GROUP);
>
>  ExecutorInfo* executorInfo =
>operation->mutable_launch_group()->mutable_executor();
>
>  executorInfo->set_type(ExecutorInfo::DEFAULT);
>  executorInfo->mutable_executor_id()->set_value(
>  "default-executor");
> ...
> ```
> As you see, executor-id is a string here and you could use any string to
> identify the executor.
>
> On Fri, Nov 18, 2016 at 3:47 PM, Qi Feng  wrote:
>
>> I'm trying the LaunchGroup feature.
>>
>> But I find the an executorInfo is required.
>>
>>
>> message LaunchGroup {
>>   required ExecutorInfo executor =3D 1;
>>   required TaskGroupInfo task_group =3D 2;
>> }
>>
>> What's more, an executor id is required in executorInfo. How would I
>> build =
>> the executorInfo if I use the default executor of mesos?
>> https://github.com/apache/mesos/blob/1.1.x/include/mesos/mesos.proto#L566
>>
>> Thanks for any reply.
>>
>
>
>
> --
> Best Regards,
> Haosdent Huang
>



-- 
Best Regards,
Haosdent Huang

Re: Running mesos-slave in the docker that leave many zombie process

2016-11-20 Thread haosdent

Pass the `--pid=host` flag when starting the docker container  may resolve
this.
>start the mesos_slave container with "--pid=host" so that it uses the
process namespace of the host.

On Mon, Nov 21, 2016 at 2:30 PM, haosdent  wrote:

> No sure if it related to this issue https://github.com/
> mesosphere/docker-containers/issues/9
>
> On Mon, Nov 21, 2016 at 12:27 PM, X Brick  wrote:
>
>> Hi,
>>
>> I meet a problem when running mesos-slave in the docker. Here are some
>> zombie process in this way.
>>
>> ```
>> root 10547 19464  0 Oct25 ?00:00:00 [docker] 
>> root 14505 19464  0 Oct25 ?00:00:00 [docker] 
>> root 16069 19464  0 Oct25 ?00:00:00 [docker] 
>> root 19962 19464  0 Oct25 ?00:00:00 [docker] 
>> root 23346 19464  0 Oct25 ?00:00:00 [docker] 
>> root 24544 19464  0 Oct25 ?00:00:00 [docker] 
>> ```
>>
>> And I find the zombies come from mesos-slave process:
>>
>> ```
>> pstree -p -s 10547
>> systemd(1)───docker-containe(19448)───mesos-slave(19464)───docker(10547)
>> ```
>>
>> The logs has been deleted by the cron job a few weeks ago, but I remember
>> so many `Failed to shutdown socket with fd xx: Transport endpoint is not
>> connected` in the log.
>>
>> I report this to the JIRA: https://issues.apache.org/jira
>> /browse/MESOS-6615
>>
>> Is there anyone saw this issue before ?
>>
>
>
>
> --
> Best Regards,
> Haosdent Huang
>



-- 
Best Regards,
Haosdent Huang

Re: Running mesos-slave in the docker that leave many zombie process

2016-11-20 Thread haosdent

No sure if it related to this issue
https://github.com/mesosphere/docker-containers/issues/9

On Mon, Nov 21, 2016 at 12:27 PM, X Brick  wrote:

> Hi,
>
> I meet a problem when running mesos-slave in the docker. Here are some
> zombie process in this way.
>
> ```
> root 10547 19464  0 Oct25 ?00:00:00 [docker] 
> root 14505 19464  0 Oct25 ?00:00:00 [docker] 
> root 16069 19464  0 Oct25 ?00:00:00 [docker] 
> root 19962 19464  0 Oct25 ?00:00:00 [docker] 
> root 23346 19464  0 Oct25 ?00:00:00 [docker] 
> root 24544 19464  0 Oct25 ?00:00:00 [docker] 
> ```
>
> And I find the zombies come from mesos-slave process:
>
> ```
> pstree -p -s 10547
> systemd(1)───docker-containe(19448)───mesos-slave(19464)───docker(10547)
> ```
>
> The logs has been deleted by the cron job a few weeks ago, but I remember
> so many `Failed to shutdown socket with fd xx: Transport endpoint is not
> connected` in the log.
>
> I report this to the JIRA: https://issues.apache.org/
> jira/browse/MESOS-6615
>
> Is there anyone saw this issue before ?
>



-- 
Best Regards,
Haosdent Huang

Re: Question on Mesos 1.1.0 LaunchGroup

2016-11-18 Thread haosdent

Hi, @Qi You may refer `mesos-executor` about how to build `LaunchGroup`
https://github.com/apache/mesos/blob/master/src/cli/execute.cpp#L498-L524

```
 operation->set_type(Offer::Operation::LAUNCH_GROUP);

 ExecutorInfo* executorInfo =
   operation->mutable_launch_group()->mutable_executor();

 executorInfo->set_type(ExecutorInfo::DEFAULT);
 executorInfo->mutable_executor_id()->set_value(
 "default-executor");
...
```
As you see, executor-id is a string here and you could use any string to
identify the executor.

On Fri, Nov 18, 2016 at 3:47 PM, Qi Feng  wrote:

> I'm trying the LaunchGroup feature.
>
> But I find the an executorInfo is required.
>
>
> message LaunchGroup {
>   required ExecutorInfo executor =3D 1;
>   required TaskGroupInfo task_group =3D 2;
> }
>
> What's more, an executor id is required in executorInfo. How would I build
> =
> the executorInfo if I use the default executor of mesos?
> https://github.com/apache/mesos/blob/1.1.x/include/mesos/mesos.proto#L566
>
> Thanks for any reply.
>



-- 
Best Regards,
Haosdent Huang

Re: Implementation examples for framework using V1 APIs for Scala/Java

2016-11-14 Thread haosdent

You may refer the example framework in mesos-rxjava
https://github.com/mesosphere/mesos-rxjava/tree/master/mesos-rxjava-example/mesos-rxjava-example-framework
as well.

On Sun, Nov 13, 2016 at 10:36 PM, David Greenberg 
wrote:

> There's also a book (disclaimer: I am the author) about framework
> development. The examples are in Java. Here's a link:
> http://shop.oreilly.com/product/mobile/0636920039952.do
>
> On Sun, Nov 13, 2016 at 2:38 AM Tomek Janiszewski 
> wrote:
>
>> Hi
>>
>> Here are slides from Dario Rexin "Writing a Mesos HTTP API Client"
>> presentation http://schd.ws/hosted_files/mesosconeu2016/e6/mesoscon_eu_
>> 2016.pdf at MesosCon EU. Unfortunately event was not recorded.
>>
>> —
>>
>>
>> Tomek
>>
>> niedz., 13.11.2016, 11:26 użytkownik Petr Novak 
>> napisał:
>>
>> Hello,
>>
>> Are there any examples/guides I can take a look?
>>
>>
>>
>> Many thanks,
>>
>> Petr
>>
>>


-- 
Best Regards,
Haosdent Huang

Re: Mesos containerizer & isolation

2016-11-02 Thread haosdent

>- Is it possible to hide host processes from the container?
You may consider to use the namespaces/pid isolator, add `namespaces/pid`
in the `--isolation` flag when launch Mesos Agent
> -Is it possible to run processes that open network ports (possibly
already open on the host system) and have them mapped to different ports on
the host system, just as with Docker's `-p`?
You need to use CNI port mapping. Refer to its document
https://reviews.apache.org/r/53015/
>  Is there any method (except `sudo`/`setuser`) to achieve running as a
user present in the image's /etc/fstab?
Mesos don't support user namespace now, need to use su to switch users

On Thu, Nov 3, 2016 at 9:56 AM, Tobias Pfeiffer  wrote:

> Actually, say I was in a fancy mood, could I actually *not* use the Docker
> image provider and instead run `nvidia-docker run [more hand-crafted
> parameters] myimage ` as an ordinary command within the Mesos
> container, or would I have to dig very deep into Mesos to find the right
> parameters to pass to nvidia-docker?
>
> Thanks
> Tobias
>
> On Thu, Nov 3, 2016 at 10:18 AM, Tobias Pfeiffer  wrote:
>
>> Hi,
>>
>> I asked this question also yesterday in the #mesos channel on IRC, but I
>> guess due to timezone differences there were not many people awake and/or
>> working, sorry for reposting. (Maybe someone answered after I left, but it
>> seems that the IRC bot is only archiving channel joins/leaves? ->
>> http://wilderness.apache.org/channels/?f=apache-syncope/2016-11-02)
>>
>> My question is about the Mesos containerizer. I want to run code using
>> the Mesos GPU support and the docs state that this is currently only
>> supported by the Mesos containerizer. So my understanding of using the
>> Mesos containerizer with Docker images is that
>> - the content of the Docker images is unpacked to the filesystem (using
>> one of the provisioner backends, such as "copy" or "overlay")
>> - the user's command is executed in a chroot in that directory.
>> Is that correct?
>>
>> The first thing I noticed is (besides a much higher latency due to the
>> image provisioning process) that `ps aux` and `hostname` expose details of
>> the host system, so I was wondering about the level of isolation that I can
>> achieve with the Mesos containerizer, as opposed to running in a Docker
>> container. In particular:
>> - Is it possible to hide host processes from the container?
>> - Is it possible to run processes that open network ports (possibly
>> already open on the host system) and have them mapped to different ports on
>> the host system, just as with Docker's `-p`?
>> - I have a USER directive in my Dockerfile in order for the CMD to be
>> executed as that user, but that does not seem to be supported (yet?) by the
>> Docker image provider. Is there any method (except `sudo`/`setuser`) to
>> achieve running as a user present in the image's /etc/fstab?
>> - I may have to run untrusted code, so can I make sure that users cannot
>> break out of the chroot? What about UID namespacing, so that root in the
>> chroot does not become root on the host system when breaking out?
>>
>> Thanks for your help
>> Tobias
>>
>
>


-- 
Best Regards,
Haosdent Huang

Re: On Mesos versioning and deprecation policy

2016-10-29 Thread haosdent

+1 For the sum up. Now it is clear for me.

On Sat, Oct 29, 2016 at 6:45 AM, Vinod Kone  wrote:

> We had an extended discussion around this in the last community sync.
> Thanks for those who participated!
>
> To sum up the discussion:
>
> --> As mesos devs, we should strive to not make incompatible changes in
> APIs, flags, environment variables.
>
> --> In the rare case where an incompatible change is preferred (e.g., code
> complexity), we should give a clear 6 months heads up the users that a
> breaking change is going to take place.
>
> --> Breaking changes do not necessitate a major version bump. This is
> because we want to allow live upgrades between major versions (e.g., 1.10
> to 2.0).
>
> --> Compatibility guarantees do not apply to experimental features (incl.
> APIs).
>
> --> We need to have clear documentation about procedure that devs could
> follow when deprecating/removing stable features and adding experimental
> features.
>
> --> We need to improve upgrades.md to make it easy for operators to know
> what features are deprecated/removed between versions X and Y.
>
> --> We should decouple internal protos used by Mesos from the unversioned
> protos used by driver based frameworks.
>
> I will spend some time in the next few weeks to create/update the
> documentation reflecting these points.
>
> Anything else I missed?
>
> Thanks,
>
> On Sat, Oct 15, 2016 at 11:47 AM, haosdent  wrote:
>
> > Thanks @yan's great inputs! I couldn't agree more almost of them.
> >
> > > Also the API is not just what the machine reads but all the
> documentation
> > associated with it, right? It depends on what the documentation says;
> what
> > the user _should_ expect.
> >
> > I think different users may have different expectations. And the guy who
> > developed the APIs may have different understand from some users as well.
> > Our documentations should cover most of cases.
> >
> > But in case that we didn't or forgot to write it explicitly in the
> > document, should we give up to update the API? Just like user Alice said
> > this is a BUG while user Bob said this is a feature. I think we still
> need
> > to raise it case by case to ensure most users are not affected by the
> > breaking API changes.
> >
> > On Sat, Oct 15, 2016 at 6:55 AM, Vinod Kone 
> wrote:
> >
> > > We will chat about this in the upcoming community sync (thursday 3 PM).
> > > So, please make sure to attend if you are interested.
> > >
> > > On Fri, Oct 14, 2016 at 3:44 PM, Yan Xu  wrote:
> > >
> > >>
> > >> On Fri, Oct 14, 2016 at 3:37 PM, Yan Xu  wrote:
> > >>
> > >>> Thanks Alex for starting this!
> > >>>
> > >>> In addition to comments below, I think it'll be helpful to keep the
> > >>> existing versioning doc concise and user-friendly while having a
> > dedicated
> > >>> doc for the "implementation details" where precise requirements and
> > >>> procedures go. Maybe some duplication/cross-referencing is needed but
> > Mesos
> > >>> developers will find the latter much more helpful while the
> > users/framework
> > >>> developer will find the former easy to read.
> > >>>
> > >>> e.g., a similar split:
> > >>> https://github.com/kubernetes/kubernetes/blob/master/docs/api.md
> > >>> https://github.com/kubernetes/kubernetes/blob/master/docs/de
> > >>> vel/api_changes.md (which has a lot of details on how the kubernetes
> > >>> community is thinking about similar issues, which we can learn from)
> > >>>
> > >>> Jiang Yan Xu 
> > >>>
> > >>> On Wed, Oct 12, 2016 at 9:34 AM, Alex Rukletsov  >
> > >>> wrote:
> > >>>
> > >>>> Folks,
> > >>>>
> > >>>> There have been a bunch of online [1, 2] and offline discussions
> about
> > >>>> our
> > >>>> deprecation and versioning policy. I found that people—including
> > >>>> myself—read the versioning doc [3] differently; moreover some
> aspects
> > >>>> are
> > >>>> not captured there. I would like to start a discussion around this
> > >>>> topic by
> > >>>> sharing my confusions and suggestions. This will hopefully help us
> > stay
> > >>>> on
> > >>>> the same page and have similar expectations

Re: Question about mesos cgroups on centos 7.1

2016-10-28 Thread haosdent

Hi, @Jeff I try to write down my understanding about the cgroup in Mesos,
maybe some parts are incorrect, feel free to correct them if you think it
is incorrect. :-)

Actually the hierarchy you mentioned is not v2 hierarchy (v2 hierarchy is a
totally different structure).
The hierarchy you mentioned is a hierarchy structure systemd proposal. It
only exists in Ubuntu 16.04 and CentOS 7 (RHEL 7) which use systemd as the
default init system. But Mesos supports Ubuntu 14.04 / CentOS 6 as well
which xx.slice not exists. I believe this is why docker use

```
/sys/fs/cgroup/{resource}/docker (Yep, this is created by Docker daemon,
not Mesos )
```

Suppose Mesos only support CentOS 7.1 which systemd is installed default. I
think we still could not put the cgroups root hierarchy under `
 /sys/fs/cgroup/{resource}/user.slice/` as well. Because `user.slice` is
only used for user sessions. Refer to
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Resource_Management_Guide/sec-Default_Cgroup_Hierarchies.html,
services should maintain its root hierarchy under
`/sys/fs/cgroup/{resource}` instead of to put it in the cgroup hierarchy
user sessions.

But if you want to put it under `/sys/fs/cgroup/{resource}/user.slice`,
Mesos Agents provides a flag `--cgroups_root` to specific the cgroup root.
You could pass `--cgroups_root="user.slice/mesos"` when launch Mesos Agents
to achieve that target.

On Fri, Oct 28, 2016 at 2:30 AM, Jeff Kubina  wrote:

> I am running Mesos 0.28.1 on Centos 7.1 using --isolation=cgroups and
> noticed that mesos creates the cgroups:
>
> /sys/fs/cgroup/{resource}/docker
> /sys/fs/cgroup/{resource}/mesos
> /sys/fs/cgroup/{resource}/mesos_executors.slice
>
> Shouldn't these cgroups be in /sys/fs/cgroup/{resource}/
> user.slice/{docker,mesos,mesos_executors.slice} to ensure a more fair
> allocation of the cgroup resources possible? Or is there a way to configure
> that to happen?
>
> The v2-cgroup hierarchy is such that:
>
> 1) /sys/fs/cgroup, is for kernel processes
>
> 2) /sys/fs/cgroup/system.slice is for systemd processes, which contains
> docker.service and mesos-slave.service, and
>
> 3) /sys/fs/cgroup/user.slice is for all other processes.
>
> Having {docker,mesos,mesos_executors.slice} in user.slice would enable
> finer grain control of the resources across the mesos and user processes.
>
>
>
>

-- 
Best Regards,
Haosdent Huang

Re: default docker stop timeout

2016-10-26 Thread haosdent

It is because try to keep compatible with the old behavior. Before
https://issues.apache.org/jira/browse/MESOS-1925, only support use `docker
kill` to stop docker containers. You could specific --docker_stop_timeout
if your want to make docker exit gracefully.

On Wed, Oct 26, 2016 at 4:33 PM, Hendrik Haddorp 
wrote:

> Hi,
>
> what's the reason the the docker stop timeout is set to 0?
>
> http://mesos.apache.org/documentation/latest/configuration/
> --docker_stop_timeout=VALUEThe time docker daemon waits after stopping
> a container before killing that container.
> This flag is
> deprecated; use task's kill policy instead. (default: 0ns)
>
> Without changing this containers can not be gracefully stopped. Wouldn't a
> value of say 10 seconds be better?
>
> thanks,
> Hendrik
>

-- 
Best Regards,
Haosdent Huang

Re: Getting files from a container after a task?

2016-10-20 Thread haosdent

oh, sorry for misleading. As I know, Mesos didn't provide this and you need
to upload it in your tasks. I think no need to implement custom executor
for this.

Just do something like this in your command

```
#!/usr/bin/env bash
1. run tasks
2. upload data if the tasks success
```

On Fri, Oct 21, 2016 at 9:55 AM, Tobias Pfeiffer  wrote:

> Hi,
>
> On Fri, Oct 21, 2016 at 10:37 AM, haosdent  wrote:
>
>> Hi, @Mark You may try set `CommandInfo.URI`. Then Mesos would download
>> files from given URL before launch your tasks.
>>
>
> I think Mark asked for the opposite, uploading stuff after task
> completion. I would in fact also be very much interested in that
> functionality, something like "zip that directory and upload it to
> somewhere".
>
> Tobias
>
>
>
>

-- 
Best Regards,
Haosdent Huang

Re: Getting files from a container after a task?

2016-10-20 Thread haosdent

Hi, @Mark You may try set `CommandInfo.URI`. Then Mesos would download
files from given URL before launch your tasks.

On Fri, Oct 21, 2016 at 12:42 AM, Mark Hammons <
mark.hamm...@inaf.cnrs-gif.fr> wrote:

> Hi all,
>
> Mesos provides the functionality to send data at the end of a task, but is
> there any way to send large files? Something like downloading files with
> CommandInfo, but for the results of a task. I'm currently getting this
> behavior by having a custom executor that downloads and uploads the data,
> but
> I'd rather not have to have this.
> 
> Mark Edgar Hammons II | +33 06 03 69 56 56




-- 
Best Regards,
Haosdent Huang

Re: Performance regression in v1 api vs v0

2016-10-16 Thread haosdent

Hmm, this is an interesting topic. @anandmazumdar create a benchmark test
case to compare v1 and v0 APIs before. You could run it via

```
./bin/mesos-tests.sh --benchmark
--gtest_filter="*SchedulerReconcileTasks_BENCHMARK_Test*"
```

Here is the result that run it in my machine.

```
[ RUN  ] Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerLibrary/0
Reconciling 1000 tasks took 386.451108ms using the scheduler library
[   OK ]
Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerLibrary/0 (479 ms)
[ RUN  ] Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerLibrary/1
Reconciling 1 tasks took 3.389258444secs using the scheduler library
[   OK ]
Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerLibrary/1 (3435 ms)
[ RUN  ] Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerLibrary/2
Reconciling 5 tasks took 16.624603964secs using the scheduler library
[   OK ]
Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerLibrary/2 (16737 ms)
[ RUN  ] Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerLibrary/3
Reconciling 10 tasks took 33.134018718secs using the scheduler library
[   OK ]
Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerLibrary/3 (3 ms)
[ RUN  ] Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerDriver/0
Reconciling 1000 tasks took 24.212092ms using the scheduler driver
[   OK ] Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerDriver/0
(89 ms)
[ RUN  ] Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerDriver/1
Reconciling 1 tasks took 316.115078ms using the scheduler driver
[   OK ] Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerDriver/1
(385 ms)
[ RUN  ] Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerDriver/2
Reconciling 5 tasks took 1.239050154secs using the scheduler driver
[   OK ] Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerDriver/2
(1379 ms)
[ RUN  ] Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerDriver/3
Reconciling 10 tasks took 2.38445672secs using the scheduler driver
[   OK ] Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerDriver/3
(2711 ms)
```

*SchedulerLibrary* is the HTTP API, *SchedulerDriver* is the old way based
on libmesos.so.

On Sun, Oct 16, 2016 at 2:41 PM, Dario Rexin  wrote:

> Hi all,
>
> I recently did some performance testing on the v1 scheduler API and found
> that throughput is around 10x lower than for the v0 API. Using 1
> connection, I don’t get a lot more than 1,500 calls per second, where the
> v0 API can do ~15,000. If I use multiple connections, throughput maxes out
> at 3 connections and ~2,500 calls / s. If I add any more connections, the
> throughput per connection drops and the total throughput stays around
> ~2,500 calls / s. Has anyone done performance testing on the v1 API before?
> It seems a little strange to me, that it’s so much slower, given that the
> v0 API also uses HTTP (well, more or less). I would be thankful for any
> comments and experience reports of other users.
>
> Thanks,
> Dario
>
>


-- 
Best Regards,
Haosdent Huang

Re: On Mesos versioning and deprecation policy

2016-10-15 Thread haosdent

xample, what if we decide to send less task health
>>>> updates to
>>>> schedulers based on some health policy? It influences the flow of task
>>>> status updates, should such change be considered compatible? Taking it
>>>> to
>>>> an extreme, we may not even be able to fix some bugs because someone may
>>>> already rely on this behaviour!
>>>>
>>>
>>> API changes should warrant a major version bump. Also the API is not
>>> just what the machine reads but all the documentation associated with it,
>>> right? It depends on what the documentation says; what the user _should_
>>> expect.
>>>
>>> That said, I feel that these things are hard to be talked about in the
>>> abstract. Even with a guideline, we still need to make case-by-case
>>> decisions. (e.g., has the documentation precisely defined this precise
>>> behavior? If not, is it reasonable for the users to expect some behavior
>>> because it's common sense? How bad is it if some behavior just changes a
>>> tiny bit?) Therefore we need to make sure the process for API changes are
>>> more rigorously defined.
>>>
>>> Whether something is a bug depends on whether the API does what it says
>>> it'll do. The line may sometimes be blurry but in general I don't feel it's
>>> a problem. If someone is relying on the behavior that is a bug, we should
>>> still help them fix it but the bug shouldn't count as "our guarantee".
>>>
>>>
>>>>
>>>> Another tightly related thing we should explicitly call out is
>>>> upgradability and rollback capabilities inside a major release.
>>>> Committing
>>>> to this may significantly limit what we can change within a major
>>>> release;
>>>> on the other side it will give users more time and a better experience
>>>> about using and maintaining Mesos clusters.
>>>>
>>>
>>> According to the versioning doc upgradability depends on whether you
>>> depend on deprecated/removed features.
>>>
>>> That paragraph should be explained more precisely:
>>> - "deprecated" means your system won't break but warnings are shown
>>> (Maybe we should use some standard deprecation warning keywords so the
>>> operator can monitor the log for such warnings!
>>> - "removed": means it may break.
>>>
>>> If you deprecate a flag/env that interface with operator tooling in the
>>> next minor release, the operator basically has 6 months from the next minor
>>> release to change the her tooling. I feel this is pretty acceptable.
>>> If you deprecate a flag/env variable that interface with the framework
>>> (executor) in the next minor release, I feel it may not be enough and it
>>> probably warrants a major version bump. So perhaps the API shouldn't be
>>> just the protos.
>>>
>>>
>>>> 2. Versioned vs. unversioned protobufs.
>>>> Currently we have v1 and unnamed protobufs, which simultaneously mean
>>>> v0,
>>>> v2, and internal. I am sometimes confused about what is the right way to
>>>> update or introduce a field or message there, do people feel the same?
>>>> How
>>>> about splitting the unnamed version into explicit v0, v2, and internal?
>>>>
>>>
>>> As haosdent mentioned, we have captured this in MESOS-6268. The benefit
>>> is clear but I guess the people will be more motivated when we find some v2
>>> feature can't be made compatible with the v0 API. (Anand's point
>>> in MESOS-6016). On the other hand, if we cut v0 API access before that
>>> happens (is v0 API obsolete and should be removed 6 months after 1.0?) then
>>> we don't need to worry about v0 and can use unversioned protos as
>>> "internal"?
>>>
>>>
>>>> Food for thought. It would be great if we can only maintain "diffs" to
>>>> the
>>>> internal protobufs in the code, instead of duplicating them altogether.
>>>>
>>>> 3. API and feature labelling.
>>>> I suggest to introduce explicit labels for API and features, to ensure
>>>> users have the right assumptions about the their lifetime while
>>>> engineers
>>>> have the ability to change a wip feature in an non-compatible way. I
>>>> propose the following:
>>>> API: stable, non-stable, pure (not used by Mesos components)
>>>> Feature: experimental, normal.
>>>>
>>>
>>>  +1 on formalizing the terminologies.
>>>
>>> Historically the distinction is not clear for the following:
>>>
>>> 1. The API has no compatibility guarantee at all.
>>> 2. The feature provided by this API is experimental
>>>
>>
>> To add to this point: because 2) logically doesn't apply to the "pure
>> (not used by Mesos components)" fields in the API, it could be more
>> confusing and thus require more precise definition.
>>
>>
>>>
>>> IMO It's OK that we say that we don't distinguish the two (the API has
>>> no compatibility guarantee until the feature is fully released) but we have
>>> to make it clear.
>>> If we don't make such distinction, ALL API additions should be marked as
>>> unstable first and be changed stable later (as a formal process).
>>>
>>>
>>>>
>>>> Looking forward to your thoughts and suggestions.
>>>> AlexR
>>>>
>>>> [1] https://www.mail-archive.com/user@mesos.apache.org/msg08025.html
>>>> [2] https://www.mail-archive.com/dev@mesos.apache.org/msg36621.html
>>>> [3]
>>>> https://github.com/apache/mesos/blob/b2beef37f6f85a8c75e9681
>>>> 36caa7a1f292ba20e/docs/versioning.md
>>>>
>>>
>>>
>>
>


-- 
Best Regards,
Haosdent Huang

Re: On Mesos versioning and deprecation policy

2016-10-13 Thread haosdent

>How about splitting the unnamed version into explicit v0, v2, and internal?

Currently our internal protobuf and v0 protobuf use the same unnamed
version protobuf and under the same namespace (`package mesos`).
If we are going to split v0 and internal, that requires copy all protobuf
files under `package mesos` into `package mesos.internal` and need to
change the whole code base to use the protobuf in `package mesos.internal`.
But it is beneficial to do this, so that we could avoid [the hacks][1]
that convert from the unversioned protobuf(v0) to the unversioned
protobuf(internal).

[1]
https://github.com/apache/mesos/blob/fa976c22ac66ff5c905157a5a36bda1d21525b32/src/master/master.cpp#L4077-L4108

On Thu, Oct 13, 2016 at 12:34 AM, Alex Rukletsov 
wrote:

> Folks,
>
> There have been a bunch of online [1, 2] and offline discussions about our
> deprecation and versioning policy. I found that people—including
> myself—read the versioning doc [3] differently; moreover some aspects are
> not captured there. I would like to start a discussion around this topic by
> sharing my confusions and suggestions. This will hopefully help us stay on
> the same page and have similar expectations. The second goal is to
> eliminate ambiguities from the versioning doc (thanks Vinod for
> volunteering to update it).
>
> 1. API vs. semantic changes.
> Current versioning guide treat features (e.g. flags, metrics, endpoints)
> and API differently: incompatible changes for the former are allowed after
> 6 month deprecation cycle, while for the latter they require bumping a
> major version. I suggest we consolidate these policies.
>
> We should also define and clearly explain what changes require bumping the
> major version. I have no strong opinion here and would love to hear what
> people think. The original motivation for maintaining backwards
> compatibility is to make sure vN schedulers can correctly work with vN API
> without being updated. But what about semantic changes that do not touch
> the API? For example, what if we decide to send less task health updates to
> schedulers based on some health policy? It influences the flow of task
> status updates, should such change be considered compatible? Taking it to
> an extreme, we may not even be able to fix some bugs because someone may
> already rely on this behaviour!
>
> Another tightly related thing we should explicitly call out is
> upgradability and rollback capabilities inside a major release. Committing
> to this may significantly limit what we can change within a major release;
> on the other side it will give users more time and a better experience
> about using and maintaining Mesos clusters.
>
> 2. Versioned vs. unversioned protobufs.
> Currently we have v1 and unnamed protobufs, which simultaneously mean v0,
> v2, and internal. I am sometimes confused about what is the right way to
> update or introduce a field or message there, do people feel the same? How
> about splitting the unnamed version into explicit v0, v2, and internal?
>
> Food for thought. It would be great if we can only maintain "diffs" to the
> internal protobufs in the code, instead of duplicating them altogether.
>
> 3. API and feature labelling.
> I suggest to introduce explicit labels for API and features, to ensure
> users have the right assumptions about the their lifetime while engineers
> have the ability to change a wip feature in an non-compatible way. I
> propose the following:
> API: stable, non-stable, pure (not used by Mesos components)
> Feature: experimental, normal.
>
> Looking forward to your thoughts and suggestions.
> AlexR
>
> [1] https://www.mail-archive.com/user@mesos.apache.org/msg08025.html
> [2] https://www.mail-archive.com/dev@mesos.apache.org/msg36621.html
> [3]
> https://github.com/apache/mesos/blob/b2beef37f6f85a8c75e968136caa7a
> 1f292ba20e/docs/versioning.md
>



-- 
Best Regards,
Haosdent Huang

Re: How to shutdown mesos-agent gracefully?

2016-10-10 Thread haosdent

gracefully means not affect running tasks?

On Tue, Oct 11, 2016 at 2:36 PM, Klaus Ma  wrote:

> It seems there's not a way to shutdown mesos-agent gracefully.
> Maintenance feature expect the agents re-register back in the future.
>
> Thanks
> Klaus
> --
>
> Regards,
> 
> Da (Klaus), Ma (马达), PMP® | Software Architect
> IBM Platform Development & Support, STG, IBM GCG
> +86-10-8245 4084 | mad...@cn.ibm.com | http://k82.me
>



-- 
Best Regards,
Haosdent Huang

Re: Welcome Qian Zhang as a new committer!

2016-10-08 Thread haosdent

Congrats to Qian!!! Looking forward work more with you in the community!

On Sun, Oct 9, 2016 at 2:20 AM, Jie Yu  wrote:

> Hi folks,
>
> I' happy to announce that the PMC has voted Qian Zhang as a new committer and
> member of PMC for the Apache Mesos project. Please join me to congratulate
> him!
>
> A little more about Qian Zhang:
>
> Qian Zhang has been working on the Apache Mesos project for about an year
> now. He designed and implemented the CNI
> <https://github.com/containernetworking/cni> (Container Network
> Interface) support in Mesos with Avinash, which standardized the networking
> integration in Mesos. He also worked with haosdent on the unified cgroups
> isolator <https://issues.apache.org/jira/browse/MESOS-4697>, which
> greatly simplifies the original cgroups support in Mesos and makes
> extension to new subsystems so much easier. He was also involved in
> discussions on quotas and pods, and provided valuable feedback. He is
> currently working on OCI <https://github.com/opencontainers/image-spec> 
> support
> in Mesos, trying to enable Mesos to launch OCI containers.
>
> More details can be found in his committer candidate checklist
> <https://docs.google.com/document/d/1p5MyCoWhZC2sAsQwsbSQUaCiAi1-7mjWXei9D9SaYUU/edit?usp=sharing>
> .
>
> Qian, thank you for your great work to the project so far. Would love to
> see more!
>
> - Jie
>



-- 
Best Regards,
Haosdent Huang

Re: Resource Isolation in Mesos

2016-10-06 Thread haosdent

Check with @Srikant via hangout. It looks the Linux cgroups memory.stat is
incorrect after `chown` cgroup to a normal user.
Would continue to follow up and verify if it is the bug of Mesos cgroups
after @Srikant have any test result in a new machine.
Thanks a lot for @Srikant great helps!

On Thu, Oct 6, 2016 at 8:17 PM, Srikant Kalani 
wrote:

> Thanks for the detail steps.
>
> We are also using same flags .
>
> Today we ran our task twice. First with the root I'd and it was working
> fine and we were able to implement cgroups .UI was working as expected.
>
> But second time when we ran same task with application I'd cgroup didn't
> work. Memory.stat file provided in your email dont have rss updated value.
>
> Do I need to use any other flags in agent so that non root I'd can also
> follow cgroups.
> On 5 Oct 2016 10:40 p.m., "haosdent"  wrote:
>
>> > These flags are used in agent - cgroups_limits_swap=true
>> --isolation=cgroups/cpu,cgroups/mem --cgroups_hierachy=/sys/fs/c group
>> In agent logs I can see updated memory limit to 33MB for container.
>>
>> Not sure if there are typos or not, some flags name may incorrect. Add
>> according to
>>
>> > "mem_limit_bytes": 1107296256,
>>
>> I think mesos allocated 1107296256 bytes memory (1GB) to your task
>> instead of 33 MB.
>>
>> For the status of `mem_rss_bytes` is zero, let me describe how I test it
>> on my machine, maybe helpful for you to troubleshoot the problem.
>>
>> ```
>> ## Start the master
>> sudo ./bin/mesos-master.sh --ip=111.223.45.25 --hostname=111.223.45.25
>> --work_dir=/tmp/mesos
>> ## Start the agent
>> sudo ./bin/mesos-agent.sh --ip=111.223.45.25 --hostname=111.223.45.25
>> --work_dir=/tmp/mesos --master=111.223.45.25:5050
>> --cgroups_hierarchy=/sys/fs/cgroup --isolation=cgroups/cpu,cgroups/mem
>> --cgroups_limit_swap=true
>> ## Start the task
>> ./src/mesos-execute --master=111.223.45.25:5050 --name="test-single-1"
>> --command="sleep 2000"
>> ```
>>
>> Then query the `/containers` endpoint to get the container id of the task
>>
>> ```
>> $ curl 'http://111.223.45.25:5051/containers' 2>/dev/null |jq .
>> [
>>   {
>> "container_id": "74fea157-100f-4bf8-b0d0-b65c6e17def1",
>> "executor_id": "test-single-1",
>> "executor_name": "Command Executor (Task: test-single-1) (Command: sh
>> -c 'sleep 2000')",
>> "framework_id": "db9f43ce-0361-4c65-b42f-4dbbefa75ff8-",
>> "source": "test-single-1",
>> "statistics": {
>>   "cpus_limit": 1.1,
>>   "cpus_system_time_secs": 3.69,
>>   "cpus_user_time_secs": 3.1,
>>   "mem_anon_bytes": 9940992,
>>   "mem_cache_bytes": 8192,
>>   "mem_critical_pressure_counter": 0,
>>   "mem_file_bytes": 8192,
>>   "mem_limit_bytes": 167772160,
>>   "mem_low_pressure_counter": 0,
>>   "mem_mapped_file_bytes": 0,
>>   "mem_medium_pressure_counter": 0,
>>   "mem_rss_bytes": 9940992,
>>   "mem_swap_bytes": 0,
>>   "mem_total_bytes": 10076160,
>>   "mem_total_memsw_bytes": 10076160,
>>   "mem_unevictable_bytes": 0,
>>   "timestamp": 1475686847.54635
>> },
>> "status": {
>>   "executor_pid": 2775
>> }
>>   }
>> ]
>> ```
>>
>> As you see above, the container id is `74fea157-100f-4bf8-b0d0-b65c6e17def1`,
>> so I
>>
>> ```
>> $ cat /sys/fs/cgroup/memory/mesos/74fea157-100f-4bf8-b0d0-b65c6e17
>> def1/memory.stat
>> ```
>>
>> Mesos get the memory statistics from this file for the task. `total_rss`
>> would be parsed as the `"mem_rss_bytes"` field.
>>
>> ```
>> ...
>> hierarchical_memory_limit 167772160
>> hierarchical_memsw_limit 167772160
>> total_rss 9940992
>> ...
>> ```
>>
>> You could check which step above is mismatch with your side and reply
>> this email for future discussion, the problem seems to be the
>> incorrect configuration or launch flags.
>>
>> On Wed, Oct 5, 2016 at 8:46 PM, Srikant Kalani <
>> srikant.blackr...@gmail.com> wrote:
>>
>>> What i can see in http output is mem_rss_bytes is not comi

Re: Resource Isolation in Mesos

2016-10-05 Thread haosdent

> These flags are used in agent - cgroups_limits_swap=true
--isolation=cgroups/cpu,cgroups/mem --cgroups_hierachy=/sys/fs/c group
In agent logs I can see updated memory limit to 33MB for container.

Not sure if there are typos or not, some flags name may incorrect. Add
according to

> "mem_limit_bytes": 1107296256,

I think mesos allocated 1107296256 bytes memory (1GB) to your task instead
of 33 MB.

For the status of `mem_rss_bytes` is zero, let me describe how I test it on
my machine, maybe helpful for you to troubleshoot the problem.

```
## Start the master
sudo ./bin/mesos-master.sh --ip=111.223.45.25 --hostname=111.223.45.25
--work_dir=/tmp/mesos
## Start the agent
sudo ./bin/mesos-agent.sh --ip=111.223.45.25 --hostname=111.223.45.25
--work_dir=/tmp/mesos --master=111.223.45.25:5050
--cgroups_hierarchy=/sys/fs/cgroup --isolation=cgroups/cpu,cgroups/mem
--cgroups_limit_swap=true
## Start the task
./src/mesos-execute --master=111.223.45.25:5050 --name="test-single-1"
--command="sleep 2000"
```

Then query the `/containers` endpoint to get the container id of the task

```
$ curl 'http://111.223.45.25:5051/containers' 2>/dev/null |jq .
[
  {
"container_id": "74fea157-100f-4bf8-b0d0-b65c6e17def1",
"executor_id": "test-single-1",
"executor_name": "Command Executor (Task: test-single-1) (Command: sh
-c 'sleep 2000')",
"framework_id": "db9f43ce-0361-4c65-b42f-4dbbefa75ff8-",
"source": "test-single-1",
"statistics": {
  "cpus_limit": 1.1,
  "cpus_system_time_secs": 3.69,
  "cpus_user_time_secs": 3.1,
  "mem_anon_bytes": 9940992,
  "mem_cache_bytes": 8192,
  "mem_critical_pressure_counter": 0,
  "mem_file_bytes": 8192,
  "mem_limit_bytes": 167772160,
  "mem_low_pressure_counter": 0,
  "mem_mapped_file_bytes": 0,
  "mem_medium_pressure_counter": 0,
  "mem_rss_bytes": 9940992,
  "mem_swap_bytes": 0,
  "mem_total_bytes": 10076160,
  "mem_total_memsw_bytes": 10076160,
  "mem_unevictable_bytes": 0,
  "timestamp": 1475686847.54635
},
"status": {
  "executor_pid": 2775
}
  }
]
```

As you see above, the container id is
`74fea157-100f-4bf8-b0d0-b65c6e17def1`, so I

```
$ cat
/sys/fs/cgroup/memory/mesos/74fea157-100f-4bf8-b0d0-b65c6e17def1/memory.stat
```

Mesos get the memory statistics from this file for the task. `total_rss`
would be parsed as the `"mem_rss_bytes"` field.

```
...
hierarchical_memory_limit 167772160
hierarchical_memsw_limit 167772160
total_rss 9940992
...
```

You could check which step above is mismatch with your side and reply this
email for future discussion, the problem seems to be the
incorrect configuration or launch flags.

On Wed, Oct 5, 2016 at 8:46 PM, Srikant Kalani 
wrote:

> What i can see in http output is mem_rss_bytes is not coming on rhel7.
>
> Here is the http output :
>
> Output for Agent running on rhel7
>
> [{"container\_id":"8062e683\-204c\-40c2\-87ae\-
> fcc2c3f71b85","executor\_id":"\*\*\*\*\*","executor\_name":"Command
> Executor (Task: \*\*\*\*\*) (Command: sh \-c '\\*\*\*\*\*\*...')","
> framework\_id":"edbffd6d\-b274\-4cb1\-b386\-2362ed2af517\-","source":"
> \*\*\*\*\*","statistics":{"cpus\_limit":1.1,"cpus\_
> system\_time\_secs":0.01,"cpus\_user\_time\_secs":0.03,"
> mem\_anon\_bytes":0,"mem\_cache\_bytes":0,"mem\_
> critical\_pressure\_counter":0,"mem\_file\_bytes":0,"mem\_
> limit\_bytes":1107296256,"mem\_low\_pressure\_counter":0,"
> mem\_mapped\_file\_bytes":0,"mem\_medium\_pressure\_
> counter":0,"mem\_rss\_bytes":0,"mem\_swap\_bytes":0,"mem\_
> total\_bytes":0,"mem\_unevictable\_bytes":0,"
> timestamp":1475668277.62915},"status":{"executor\_pid":14454}}]
>
> Output for Agent running on Rhel 6
>
>   [{"container\_id":"359c0944\-c089\-4d43\-983e\-
> 1f97134fe799","executor\_id":"\*\*\*\*\*","executor\_name":"Command
> Executor (Task: \*\*\*\*\*) (Command: sh \-c '\*\*\*\*\*\*...')","
> framework\_id":"edbffd6d\-b274\-4cb1\-b386\-2362ed2af517\-0001","source":"
> \*\*\*\*\*","statistics":{"cpus\_limit":8.1,"cpus\_
> system\_time\_secs":1.92,"cpus\_user\_time\_secs"

Re: Troubleshooting tasks that are stuck in the 'Staging' state

2016-10-05 Thread haosdent

> How do you typically monitor the messages between Master and Agents?
For my side, I didn't monitor this. And only check the logs when
troubleshooting some problems.
Not sure if other users or developers have tools to meet your requirement
here.

On Wed, Oct 5, 2016 at 8:16 PM, Frank Scholten 
wrote:

> Ok. How do you typically monitor the messages between Master and
> Agents? Do you have some tools for this on the cluster?
>
> On Tue, Oct 4, 2016 at 6:21 PM, haosdent  wrote:
> > Hi, @Frank Thanks for your information
> >
> >> I see messages 'Telling agent (...) to kill task (...)'. Why does this
> >> happen?
> > This should because your framework send a `KillTaskMessage` or
> > `scheduler::Call::KILL` request to the Mesos Master, then the Mesos is
> going
> > to kill your task.
> >
> >>Is this the exact text to search for or is this the name of the protobuf
> >> message? Are these logged on a higher log level?
> > it exists in the log of the agents. It looks like
> > ```
> > I1004 23:19:36.175673 45405 slave.cpp:1539] Got assigned task '1' for
> > framework e7287433-36f9-48dd-8633-8a6ac7083a43-
> > I1004 23:19:36.176206 45405 slave.cpp:1696] Launching task '1' for
> framework
> > e7287433-36f9-48dd-8633-8a6ac7083a43-
> > ```
> > Usually, you could grep your task id in the agent log to see how the task
> > failed.
> >
> >
> >
> > On Tue, Oct 4, 2016 at 8:50 PM, Frank Scholten 
> > wrote:
> >>
> >> Thanks Haosdent for your quick response.
> >>
> >> I added GLOG_v=1 to the master and agents.
> >>
> >> 1. The framework is registered. Marathon in this case.
> >> 2. I see messages 'Telling agent (...) to kill task (...)'. Why does
> >> this happen? I also see 'Sending explicit reconciliation state
> >> TASK_LOST for task fake-marathon-pacemaker-task-(...)'.
> >> 3. I searched for RunTaskMessage in the agent log but could not find
> >> it. Is this the exact text to search for or is this the name of the
> >> protobuf message? Are these logged on a higher log level?
> >>
> >> On Tue, Oct 4, 2016 at 11:22 AM, haosdent  wrote:
> >> > staging is the initialize status of the task. I think you may your
> logs
> >> > via
> >> > these steps:
> >> >
> >> > 1. If your framework registered successfully in the master?
> >> > 2. If the master send resources offers to your framework and your
> >> > framework
> >> > accept it?
> >> > 3. If your agents receive the RunTaskMessage from master to launch
> your
> >> > task?
> >> >
> >> > In additionally, use `export GLOG_v=1` before start masters and agents
> >> > may
> >> > helpful for your troubleshooting.
> >> >
> >> > On Tue, Oct 4, 2016 at 4:58 PM, Frank Scholten <
> fr...@frankscholten.nl>
> >> > wrote:
> >> >>
> >> >> Hi all,
> >> >>
> >> >> I am looking for some ways to troubleshoot or debug tasks that are
> >> >> stuck in the 'staging' state. Typically they have no logs in the
> >> >> sandbox.
> >> >>
> >> >> Are there are any endpoints or things to look for in logs to identify
> >> >> a root cause?
> >> >>
> >> >> Is there a troubleshooting guide for Mesos to solve problems like
> this?
> >> >>
> >> >> Cheers,
> >> >>
> >> >> Frank
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > Best Regards,
> >> > Haosdent Huang
> >
> >
> >
> >
> > --
> > Best Regards,
> > Haosdent Huang
>



-- 
Best Regards,
Haosdent Huang

Re: Resource Isolation in Mesos

2016-10-05 Thread haosdent

Hi, @Srikant How about the result of http://${YOUR_AGENT_IP}:5051/containers?
It is wired that you could saw

```
Updated 'memory.limit_in_bytes' to xxx
```

in log as you mentioned, but `limit_in_bytes` is still the initialize value
as you show above.

On Wed, Oct 5, 2016 at 2:04 PM, Srikant Kalani 
wrote:

> Here are the values -
> Memory.limit_in_bytes = 1107296256
> Memory.soft_limit_in_bytes=1107296256
> Memory.memsw.limit_in_bytes=9223372036854775807
>
> I have run the same task on mesos 1.0.1 running on rhel6 and UI then shows
> task memory usage as 2.2G/1.0G where 2.2 is used and 1.0G is allocated but
> since we don't have cgroups their so task are not getting killed.
>
> On rhel7 UI is showing 0B/1.0G for task memory details.
>
> Any idea is this rhel7 fault or do I need to  adjust some configurations ?
> On 4 Oct 2016 21:33, "haosdent"  wrote:
>
>> Hi, @Srikant
>>
>> Hi, @Srikant
>>
>> Usually, your task should be killed when over cgroup limit. Would you
>> enter the `/sys/fs/cgroup/memory/mesos` folder in the agent?
>> Then check the values in `${YOUR_CONTAINER_ID}/memory.limit_in_bytes`,
>>  `${YOUR_CONTAINER_ID}/memory.soft_limit_in_bytes` and
>> `${YOUR_CONTAINER_ID}/memory.memsw.limit_in_bytes` and reply in this
>> email.
>>
>> ${YOUR_CONTAINER_ID} is the container id of your task here, you could
>> find it from the agent log. Or as you said, you only have this one task, so
>> it should only have one directory under `/sys/fs/cgroup/memory/mesos`.
>>
>> Furthermore, would you show the result of 
>> http://${YOUR_AGENT_IP}:5051/containers?
>> It contains some tasks statistics information as well.
>>
>> On Tue, Oct 4, 2016 at 9:00 PM, Srikant Kalani <
>> srikant.blackr...@gmail.com> wrote:
>>
>>> We have upgraded linux from rhel6 to rhel7 and mesos from 0.27 to 1.0.1.
>>> After upgrade we are not able to see memory used by task which was fine
>>> in previous version. Due to this cgroups are not effective.
>>>
>>> Answers to your questions below :
>>>
>>> There is only 1 task running as a appserver which is consuming approx
>>> 20G mem but this info is not coming in Mesos UI.
>>> Swaps are enabled in agent start command.
>>> These flags are used in agent - cgroups_limits_swap=true
>>> --isolation=cgroups/cpu,cgroups/mem --cgroups_hierachy=/sys/fs/c group
>>> In agent logs I can see updated memory limit to 33MB for container.
>>>
>>> Web UI shows the total memory allocated to framework but it is not
>>> showing memory used by task.It always shows 0B/33MB.
>>>
>>> Not sure if this is rhel7 issue or mesos 1.0.1.
>>>
>>> Any suggestions ?
>>> On 26 Sep 2016 21:55, "haosdent"  wrote:
>>>
>>>> Hi, @Srikant May you elaborate
>>>>
>>>> >We have verified using top command that framework was using 2gB
>>>> memory while allocated was just 50 mb.
>>>>
>>>> * How many running tasks in your framework?
>>>> * Do you enable or disable swap in the agents?
>>>> * What's the flags that you launch agents?
>>>> * Have you saw some thing like `Updated 'memory.limit_in_bytes' to ` in
>>>> the log of agent?
>>>>
>>>> On Tue, Sep 27, 2016 at 12:14 AM, Srikant Kalani <
>>>> srikant.blackr...@gmail.com> wrote:
>>>>
>>>>> Hi Greg ,
>>>>>
>>>>> Previously we were running Mesos 0.27 on Rhel6 and since we already
>>>>> have one c group hierarchy for cpu and memory for our production  
>>>>> processes
>>>>> I'd we were not able to merge two c groups hierarchy on rhel6. Slave
>>>>> process was not coming up.
>>>>> Now we have moved  to Rhel7 and both mesos master and slave are
>>>>> running on rhel7 with c group implemented.But we are seeing that mesos UI
>>>>> not showing the actual memory used by framework.
>>>>>
>>>>> Any idea why framework usage of cpu and memory is not coming in UI.
>>>>> Due to this OS is still not killing the task which are consuming more
>>>>> memory than the allocated one.
>>>>> We have verified using top command that framework was using 2gB memory
>>>>> while allocated was just 50 mb.
>>>>>
>>>>> Please suggest.
>>>>> On 8 Sep 2016 01:53, "Greg Mann"  wrote:
>>>>>
>>>>>> Hi Srikan

Re: Troubleshooting tasks that are stuck in the 'Staging' state

2016-10-04 Thread haosdent

Hi, @Frank Thanks for your information

> I see messages 'Telling agent (...) to kill task (...)'. Why does this
happen?
This should because your framework send a `KillTaskMessage` or
`scheduler::Call::KILL` request to the Mesos Master, then the Mesos is
going to kill your task.

>Is this the exact text to search for or is this the name of the protobuf
message? Are these logged on a higher log level?
it exists in the log of the agents. It looks like
```
I1004 23:19:36.175673 45405 slave.cpp:1539] Got assigned task '1' for
framework e7287433-36f9-48dd-8633-8a6ac7083a43-
I1004 23:19:36.176206 45405 slave.cpp:1696] Launching task '1' for
framework e7287433-36f9-48dd-8633-8a6ac7083a43-
```
Usually, you could grep your task id in the agent log to see how the task
failed.



On Tue, Oct 4, 2016 at 8:50 PM, Frank Scholten 
wrote:

> Thanks Haosdent for your quick response.
>
> I added GLOG_v=1 to the master and agents.
>
> 1. The framework is registered. Marathon in this case.
> 2. I see messages 'Telling agent (...) to kill task (...)'. Why does
> this happen? I also see 'Sending explicit reconciliation state
> TASK_LOST for task fake-marathon-pacemaker-task-(...)'.
> 3. I searched for RunTaskMessage in the agent log but could not find
> it. Is this the exact text to search for or is this the name of the
> protobuf message? Are these logged on a higher log level?
>
> On Tue, Oct 4, 2016 at 11:22 AM, haosdent  wrote:
> > staging is the initialize status of the task. I think you may your logs
> via
> > these steps:
> >
> > 1. If your framework registered successfully in the master?
> > 2. If the master send resources offers to your framework and your
> framework
> > accept it?
> > 3. If your agents receive the RunTaskMessage from master to launch your
> > task?
> >
> > In additionally, use `export GLOG_v=1` before start masters and agents
> may
> > helpful for your troubleshooting.
> >
> > On Tue, Oct 4, 2016 at 4:58 PM, Frank Scholten 
> > wrote:
> >>
> >> Hi all,
> >>
> >> I am looking for some ways to troubleshoot or debug tasks that are
> >> stuck in the 'staging' state. Typically they have no logs in the
> >> sandbox.
> >>
> >> Are there are any endpoints or things to look for in logs to identify
> >> a root cause?
> >>
> >> Is there a troubleshooting guide for Mesos to solve problems like this?
> >>
> >> Cheers,
> >>
> >> Frank
> >
> >
> >
> >
> > --
> > Best Regards,
> > Haosdent Huang
>



-- 
Best Regards,
Haosdent Huang

Re: Resource Isolation in Mesos

2016-10-04 Thread haosdent

Hi, @Srikant

Hi, @Srikant

Usually, your task should be killed when over cgroup limit. Would you enter
the `/sys/fs/cgroup/memory/mesos` folder in the agent?
Then check the values in `${YOUR_CONTAINER_ID}/memory.limit_in_bytes`,
 `${YOUR_CONTAINER_ID}/memory.soft_limit_in_bytes` and
`${YOUR_CONTAINER_ID}/memory.memsw.limit_in_bytes` and reply in this email.

${YOUR_CONTAINER_ID} is the container id of your task here, you could find
it from the agent log. Or as you said, you only have this one task, so it
should only have one directory under `/sys/fs/cgroup/memory/mesos`.

Furthermore, would you show the result of
http://${YOUR_AGENT_IP}:5051/containers?
It contains some tasks statistics information as well.

On Tue, Oct 4, 2016 at 9:00 PM, Srikant Kalani 
wrote:

> We have upgraded linux from rhel6 to rhel7 and mesos from 0.27 to 1.0.1.
> After upgrade we are not able to see memory used by task which was fine in
> previous version. Due to this cgroups are not effective.
>
> Answers to your questions below :
>
> There is only 1 task running as a appserver which is consuming approx 20G
> mem but this info is not coming in Mesos UI.
> Swaps are enabled in agent start command.
> These flags are used in agent - cgroups_limits_swap=true
> --isolation=cgroups/cpu,cgroups/mem --cgroups_hierachy=/sys/fs/c group
> In agent logs I can see updated memory limit to 33MB for container.
>
> Web UI shows the total memory allocated to framework but it is not showing
> memory used by task.It always shows 0B/33MB.
>
> Not sure if this is rhel7 issue or mesos 1.0.1.
>
> Any suggestions ?
> On 26 Sep 2016 21:55, "haosdent"  wrote:
>
>> Hi, @Srikant May you elaborate
>>
>> >We have verified using top command that framework was using 2gB memory
>> while allocated was just 50 mb.
>>
>> * How many running tasks in your framework?
>> * Do you enable or disable swap in the agents?
>> * What's the flags that you launch agents?
>> * Have you saw some thing like `Updated 'memory.limit_in_bytes' to ` in
>> the log of agent?
>>
>> On Tue, Sep 27, 2016 at 12:14 AM, Srikant Kalani <
>> srikant.blackr...@gmail.com> wrote:
>>
>>> Hi Greg ,
>>>
>>> Previously we were running Mesos 0.27 on Rhel6 and since we already have
>>> one c group hierarchy for cpu and memory for our production  processes I'd
>>> we were not able to merge two c groups hierarchy on rhel6. Slave process
>>> was not coming up.
>>> Now we have moved  to Rhel7 and both mesos master and slave are running
>>> on rhel7 with c group implemented.But we are seeing that mesos UI not
>>> showing the actual memory used by framework.
>>>
>>> Any idea why framework usage of cpu and memory is not coming in UI. Due
>>> to this OS is still not killing the task which are consuming more memory
>>> than the allocated one.
>>> We have verified using top command that framework was using 2gB memory
>>> while allocated was just 50 mb.
>>>
>>> Please suggest.
>>> On 8 Sep 2016 01:53, "Greg Mann"  wrote:
>>>
>>>> Hi Srikant,
>>>> Without using cgroups, it won't be possible to enforce isolation of
>>>> cpu/memory on a Linux agent. Could you elaborate a bit on why you aren't
>>>> able to use cgroups currently? Have you tested the existing Mesos cgroup
>>>> isolators in your system?
>>>>
>>>> Cheers,
>>>> Greg
>>>>
>>>> On Tue, Sep 6, 2016 at 9:24 PM, Srikant Kalani <
>>>> srikant.blackr...@gmail.com> wrote:
>>>>
>>>>> Hi Guys,
>>>>>
>>>>> We are running Mesos cluster in our development environment. We are
>>>>> seeing the cases where framework uses more amount of resources like cpu 
>>>>> and
>>>>> memory then the initial requested resources. When any new framework is
>>>>> registered Mesos calculates the resources on the basis of already offered
>>>>> resources to first framework and it doesn't consider actual  resources
>>>>> utilised by previous framework.
>>>>> This is resulting in incorrect calculation of resources.
>>>>> Mesos website says that we should Implement  c groups but it is not
>>>>> possible in our case as we have already implemented c groups in other
>>>>> projects and due to Linux restrictions  we can't merge two c groups
>>>>> hierarchy.
>>>>>
>>>>> Any idea how we can implement resource Isolation in Mesos ?
>>>>>
>>>>> We are using Mesos 0.27.1
>>>>>
>>>>> Thanks
>>>>> Srikant Kalani
>>>>>
>>>>
>>>>
>>
>>
>> --
>> Best Regards,
>> Haosdent Huang
>>
>


-- 
Best Regards,
Haosdent Huang

Re: Troubleshooting tasks that are stuck in the 'Staging' state

2016-10-04 Thread haosdent

staging is the initialize status of the task. I think you may your logs via
these steps:

1. If your framework registered successfully in the master?
2. If the master send resources offers to your framework and your framework
accept it?
3. If your agents receive the RunTaskMessage from master to launch your
task?

In additionally, use `export GLOG_v=1` before start masters and agents may
helpful for your troubleshooting.

On Tue, Oct 4, 2016 at 4:58 PM, Frank Scholten 
wrote:

> Hi all,
>
> I am looking for some ways to troubleshoot or debug tasks that are
> stuck in the 'staging' state. Typically they have no logs in the
> sandbox.
>
> Are there are any endpoints or things to look for in logs to identify
> a root cause?
>
> Is there a troubleshooting guide for Mesos to solve problems like this?
>
> Cheers,
>
> Frank
>

-- 
Best Regards,
Haosdent Huang

Re: determine slave capabilities

2016-10-04 Thread haosdent

hi, @Hendrik You could specific the --attribute flag when starting mesos
agent. For example, use --attributes=docker:false. Then you could get it in
the `Offer` in your framework. Another way is query the /flags endpoint of
the agent in your framework. You could get the url of the agent from
`Offer` as well.

On Tue, Oct 4, 2016 at 5:06 PM, Hendrik Haddorp 
wrote:

> Hi,
>
> is there a way for a framework to determine what containerizers are
> available on a slave? I have a setup where one slave has no docker engine
> so that I get an error when I try to start a container on that slave. Thus
> it would be nice if I could somehow check in advanced what capabilities a
> slave has.
>
> regards,
> Hendrik
>

-- 
Best Regards,
Haosdent Huang

Re: Target version vs Fixed Version

2016-10-03 Thread haosdent

For resolved issue, is it OK to do similar things? For example, this issue
https://issues.apache.org/jira/browse/MESOS-5613 make mesos-local not work
in 1.0.x, and I think it would be better that check pick this into 1.0.x.

On Tue, Oct 4, 2016 at 9:17 AM, Vinod Kone  wrote:

> Hi,
>
> Going forward, if you want an unresolved issue to be targeted for a
> specific version please set the "Target Version". The committer that
> commits the fix and resolves the ticket will set the appropriate "Fix
> Version".
> This applies to backports as well.
>
> Thanks,
> Vinod
>
> -- Forwarded message --
> From: Vinod Kone (JIRA) 
> Date: Mon, Oct 3, 2016 at 6:13 PM
> Subject: [jira] [Updated] (MESOS-6026) Tasks mistakenly marked as FAILED
> due to race b/w ⁠sendExecutorTerminatedStatusUpdate()⁠ and
> ⁠_statusUpdate()⁠
> To: iss...@mesos.apache.org
>
>
>
>  [ https://issues.apache.org/jira/browse/MESOS-6026?page=
> com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>
> Vinod Kone updated MESOS-6026:
> ------
> Target Version/s: 1.0.2
>Fix Version/s: (was: 1.0.2)
>



-- 
Best Regards,
Haosdent Huang

Re: Web UI no longer shows Tasks information

2016-09-28 Thread haosdent

Hi, @bmahler I think this should be a expected behavior when open the webui
of no-leanding masters, because their metrics about frameworks and tasks
would not updated.

On Sep 29, 2016 4:12 AM, "Benjamin Mahler"  wrote:

> Thanks for reporting this Rodrick, do you see any errors in your browser's
> console?
>
> On Tue, Sep 27, 2016 at 4:29 AM, Rodrick Brown 
> wrote:
>
>>
>> On Sep 27, 2016, at 3:43 AM, haosdent  wrote:
>>
>> Hi, @Rodrick
>>
>> >"master/frameworks_connected": 0,
>>
>> Is it because the master you open no the leading master?
>>
>>
>>
>> Sorry you’re right the leader has switched over, so it seems the stats
>> are their but the UI isn’t able to render this information in the Task
>> pane.
>>
>> $  curl -s http://leader.mesos:5050/metrics/snapshot | jq
>> {
>>   "master/messages_status_update": 11461,
>>   "master/messages_unregister_slave": 0,
>>   "master/messages_reconcile_tasks": 5660,
>>   "master/tasks_lost": 59,
>>   "master/messages_decline_offers": 783792,
>>   "master/invalid_status_update_acknowledgements": 0,
>>   "system/load_5min": 0.23,
>>   "master/tasks_failed": 2202,
>>   "master/messages_launch_tasks": 4051,
>>   "master/messages_resource_request": 0,
>>   "master/messages_status_update_acknowledgement": 8398,
>>   "master/slaves_connected": 16,
>>   "master/event_queue_http_requests": 0,
>>   "master/messages_deactivate_framework": 189,
>>   "master/messages_reregister_slave": 33,
>>   "master/messages_executor_to_framework": 0,
>>   "registrar/log/recovered": 1,
>>   "master/slave_removals/reason_registered": 10,
>>   "master/messages_suppress_offers": 101,
>>   "master/uptime_secs": 62643.405314816,
>>   "allocator/mesos/resources/disk/total": 2547171,
>>   "registrar/queued_operations": 0,
>>   "system/load_1min": 0.08,
>>   "master/slaves_disconnected": 0,
>>   "master/frameworks_active": 13,
>>   "master/elected": 1,
>>   "master/valid_executor_to_framework_messages": 0,
>>   "allocator/mesos/resources/mem/total": 890692,
>>   "master/valid_status_updates": 8409,
>>   "system/load_15min": 0.33,
>>   "allocator/event_queue_dispatches": 1,
>>   "master/slaves_active": 16,
>>   "registrar/state_store_ms/p": 77.609414272,
>>   "registrar/state_store_ms/p95": 33.68032,
>>   "master/frameworks_connected": 13,
>>   "allocator/mesos/roles/sparkr/shares/dominant": 0.202054133190822,
>>   "master/messages_exited_executor": 3,
>>   "master/cpus_total": 220,
>>   "master/mem_revocable_percent": 0,
>>   "master/slave_registrations": 25,
>>   "allocator/mesos/roles/*/shares/dominant": 0.353636363636364,
>>   "allocator/mesos/event_queue_dispatches": 1,
>>   "master/tasks_running": 92,
>>   "master/slaves_inactive": 0,
>>   "master/messages_register_framework": 764,
>>   "master/event_queue_dispatches": 17,
>>   "master/messages_update_slave": 34,
>>   "master/mem_revocable_total": 0,
>>   "master/messages_reregister_framework": 204,
>>   "master/dropped_messages": 12,
>>   "master/tasks_staging": 0,
>>   "allocator/mesos/resources/cpus/offered_or_allocated": 112.6,
>>   "master/tasks_error": 0,
>>   "master/invalid_framework_to_executor_messages": 0,
>>   "master/invalid_executor_to_framework_messages": 0,
>>   "master/slave_shutdowns_scheduled": 15,
>>   "master/slave_removals/reason_unregistered": 0,
>>   "master/frameworks_inactive": 1,
>>   "master/tasks_finished": 1819,
>>   "master/frameworks_disconnected": 1,
>>   "master/disk_revocable_used": 0,
>>   "master/messages_authenticate": 0,
>>   "registrar/state_store_ms/max": 77.662208,
>>   "master/event_queue_messages": 0,
>>   "master/slave_shutdowns_canceled": 0,
>>   "master/messages_kill_task": 369,
>>   "master/slave_reregistrations": 16,
>>   "allocator/mesos/allocation_run_ms/p999": 4.60756198

Re: cgroup blkio controller for persistent volume ?

2016-09-28 Thread haosdent

Hi, @vincent We have not yet added blkio support in Mesos containerizer.
Refer to https://issues.apache.org/jira/browse/MESOS-6162

On Wed, Sep 28, 2016 at 3:42 PM, vincent gromakowski <
vincent.gromakow...@gmail.com> wrote:

> Hi,
> Is Mesos using the cgroup blkio controller for managing persistent volume
> isolation ?
> If true, did you already notice a drop in I/O performance compared to host
> volume like with Docker (cf https://github.com/docker/docker/issues/21485
> )  ?
>



-- 
Best Regards,
Haosdent Huang

Re: Web UI no longer shows Tasks information

2016-09-27 Thread haosdent

Hi, @Rodrick

>"master/frameworks_connected": 0,

Is it because the master you open no the leading master?

On Tue, Sep 27, 2016 at 1:26 PM, Rodrick Brown 
wrote:

>
>
> On Sep 26, 2016, at 11:54 PM, haosdent  wrote:
>
> Hi, @Rodrick What is the response when you open `
> http://${MASTER_IP}:${MASTER_PORT}/metrics/snapshot`?
> <http://$%7Bmaster_ip%7D:$%7Bmaster_port%7D/metrics/snapshot%60?>
>
>
> $ curl -s http://master:5050/metrics/snapshot |jq
> {
>   "master/slave_shutdowns_scheduled": 0,
>   "master/invalid_executor_to_framework_messages": 0,
>   "master/mem_used": 0,
>   "allocator/mesos/allocation_runs": 40793,
>   "master/slave_shutdowns_completed": 0,
>   "master/invalid_status_updates": 0,
>   "master/messages_authenticate": 0,
>   "master/frameworks_disconnected": 0,
>   "master/disk_revocable_used": 0,
>   "master/messages_exited_executor": 0,
>   "master/messages_status_update": 0,
>   "master/messages_unregister_slave": 0,
>   "master/messages_framework_to_executor": 0,
>   "master/messages_reconcile_tasks": 0,
>   "master/tasks_lost": 0,
>   "master/messages_decline_offers": 0,
>   "allocator/mesos/allocation_run_ms/p99": 0.01587456,
>   "master/recovery_slave_removals": 0,
>   "allocator/mesos/allocation_run_ms/max": 0.028928,
>   "master/cpus_total": 0,
>   "master/messages_register_slave": 0,
>   "allocator/mesos/allocation_run_ms/p90": 0.013056,
>   "master/tasks_running": 0,
>   "allocator/mesos/allocation_run_ms/p999": 0.027905024,
>   "master/slave_reregistrations": 0,
>   "master/cpus_percent": 0,
>   "allocator/mesos/resources/mem/offered_or_allocated": 0,
>   "master/messages_register_framework": 0,
>   "allocator/mesos/allocation_run_ms": 0.011008,
>   "allocator/mesos/event_queue_dispatches": 0,
>   "master/tasks_staging": 0,
>   "master/slave_removals/reason_unregistered": 0,
>   "allocator/mesos/allocation_run_ms/count": 1000,
>   "master/event_queue_http_requests": 0,
>   "master/slave_removals": 0,
>   "master/gpus_used": 0,
>   "allocator/mesos/allocation_run_ms/p50": 0.012032,
>   "master/dropped_messages": 0,
>   "allocator/mesos/allocation_run_ms/min": 0.00384,
>   "master/disk_total": 0,
>   "allocator/mesos/resources/disk/offered_or_allocated": 0,
>   "system/mem_free_bytes": 13478334464,
>   "master/slave_removals/reason_unhealthy": 0,
>   "master/gpus_revocable_total": 0,
>   "master/cpus_revocable_total": 0,
>   "allocator/mesos/allocation_run_ms/p": 0.028825702401,
>   "system/cpus_total": 8,
>   "master/mem_percent": 0,
>   "master/gpus_percent": 0,
>   "master/mem_revocable_used": 0,
>   "master/valid_status_update_acknowledgements": 0,
>   "master/disk_used": 0,
>   "master/messages_unregister_framework": 0,
>   "allocator/mesos/allocation_run_ms/p95": 0.013056,
>   "master/messages_resource_request": 0,
>   "master/slaves_inactive": 0,
>   "master/messages_update_slave": 0,
>   "master/event_queue_dispatches": 3,
>   "master/event_queue_messages": 0,
>   "allocator/mesos/resources/cpus/offered_or_allocated": 0,
>   "master/cpus_revocable_used": 0,
>   "system/mem_total_bytes": 15769341952,
>   "master/tasks_killing": 0,
>   "allocator/mesos/resources/cpus/total": 0,
>   "master/messages_revive_offers": 0,
>   "master/gpus_revocable_percent": 0,
>   "master/disk_percent": 0,
>   "master/messages_kill_task": 0,
>   "master/slave_shutdowns_canceled": 0,
>   "master/gpus_revocable_used": 0,
>   "master/disk_revocable_percent": 0,
>   "master/frameworks_connected": 0,
>   "master/slave_registrations": 0,
>   "master/cpus_revocable_percent": 0,
>   "master/mem_revocable_percent": 0,
>   "master/disk_revocable_total": 0,
>   "master/gpus_total": 0,
>   "master/tasks_finished": 0,
>   "master/frameworks_inactive": 0,
>   "master/outstanding_offers": 0,
>   "master/valid_framework_to_executor_messages": 0,
>   "master/cpus_used": 0,
>   "master/ta

Re: Web UI no longer shows Tasks information

2016-09-26 Thread haosdent

Hi, @Rodrick What is the response when you open `http://
${MASTER_IP}:${MASTER_PORT}/metrics/snapshot`?

On Tue, Sep 27, 2016 at 11:43 AM, Rodrick Brown  wrote:

> I just upgraded our cluster from 0.28.2 to 1.0.1 and notice the Web UI
> task details are no longer updating. I'm trying to see if this issue is
> isolated to my setup? or broken across the board I was not able to find
> anything about this bug for 1.0.1
>
>
> Tasks
> Staging 0
> Starting 0
> Running 0
> Killing 0
> Finished 0
> Killed 0
> Failed 0
> Lost 0
> Orphan 21
>
>
>
> --
>
> [image: Orchard Platform] <http://www.orchardplatform.com/>
>
> *Rodrick Brown */ *Lead SRE - DevOps*
>
> 9174456839 / rodr...@orchardplatform.com
>
> Orchard Platform
> 101 5th Avenue, 4th Floor, New York, NY
>
> *NOTICE TO RECIPIENTS*: This communication is confidential and intended
> for the use of the addressee only. If you are not an intended recipient of
> this communication, please delete it immediately and notify the sender by
> return email. Unauthorized reading, dissemination, distribution or copying
> of this communication is prohibited. This communication does not constitute
> an offer to sell or a solicitation of an indication of interest to purchase
> any loan, security or any other financial product or instrument, nor is it
> an offer to sell or a solicitation of an indication of interest to purchase
> any products or services to any persons who are prohibited from receiving
> such information under applicable law. The contents of this communication
> may not be accurate or complete and are subject to change without notice.
> As such, Orchard App, Inc. (including its subsidiaries and affiliates,
> "Orchard") makes no representation regarding the accuracy or completeness
> of the information contained herein. The intended recipient is advised to
> consult its own professional advisors, including those specializing in
> legal, tax and accounting matters. Orchard does not provide legal, tax or
> accounting advice.
>



-- 
Best Regards,
Haosdent Huang

Re: Resource Isolation in Mesos

2016-09-26 Thread haosdent

Hi, @Srikant May you elaborate

>We have verified using top command that framework was using 2gB memory
while allocated was just 50 mb.

* How many running tasks in your framework?
* Do you enable or disable swap in the agents?
* What's the flags that you launch agents?
* Have you saw some thing like `Updated 'memory.limit_in_bytes' to ` in the
log of agent?

On Tue, Sep 27, 2016 at 12:14 AM, Srikant Kalani <
srikant.blackr...@gmail.com> wrote:

> Hi Greg ,
>
> Previously we were running Mesos 0.27 on Rhel6 and since we already have
> one c group hierarchy for cpu and memory for our production  processes I'd
> we were not able to merge two c groups hierarchy on rhel6. Slave process
> was not coming up.
> Now we have moved  to Rhel7 and both mesos master and slave are running on
> rhel7 with c group implemented.But we are seeing that mesos UI not showing
> the actual memory used by framework.
>
> Any idea why framework usage of cpu and memory is not coming in UI. Due to
> this OS is still not killing the task which are consuming more memory than
> the allocated one.
> We have verified using top command that framework was using 2gB memory
> while allocated was just 50 mb.
>
> Please suggest.
> On 8 Sep 2016 01:53, "Greg Mann"  wrote:
>
>> Hi Srikant,
>> Without using cgroups, it won't be possible to enforce isolation of
>> cpu/memory on a Linux agent. Could you elaborate a bit on why you aren't
>> able to use cgroups currently? Have you tested the existing Mesos cgroup
>> isolators in your system?
>>
>> Cheers,
>> Greg
>>
>> On Tue, Sep 6, 2016 at 9:24 PM, Srikant Kalani <
>> srikant.blackr...@gmail.com> wrote:
>>
>>> Hi Guys,
>>>
>>> We are running Mesos cluster in our development environment. We are
>>> seeing the cases where framework uses more amount of resources like cpu and
>>> memory then the initial requested resources. When any new framework is
>>> registered Mesos calculates the resources on the basis of already offered
>>> resources to first framework and it doesn't consider actual  resources
>>> utilised by previous framework.
>>> This is resulting in incorrect calculation of resources.
>>> Mesos website says that we should Implement  c groups but it is not
>>> possible in our case as we have already implemented c groups in other
>>> projects and due to Linux restrictions  we can't merge two c groups
>>> hierarchy.
>>>
>>> Any idea how we can implement resource Isolation in Mesos ?
>>>
>>> We are using Mesos 0.27.1
>>>
>>> Thanks
>>> Srikant Kalani
>>>
>>
>>


-- 
Best Regards,
Haosdent Huang

Re: multi-tenancy in mesos

2016-09-19 Thread haosdent

There is a topic about this in MesosCon EU.
http://schd.ws/hosted_files/mesosconeu2016/cd/compute_final_mesosConEur2016.pdf

On Mon, Sep 19, 2016 at 11:14 PM, tommy xiao  wrote:

> Hi team,
>
> anyone have some experience with multi-tenancy purpose build on mesos
> cluster?
> could you please share some hints? thanks a lot.
>
> --
> Deshi Xiao
> Twitter: xds2000
> E-mail: xiaods(AT)gmail.com
>

-- 
Best Regards,
Haosdent Huang

Re: Question about the deprecated policy after 1.0

2016-09-14 Thread haosdent

uot;This API/flag
> will be deprecated in Mesos 2.0... "
> > To help folks discover deprecations we can have a live document that
> lists the deprecated features by version. Currently the CHANGELOG file only
> lists deprecation in the next release so there's not a place to put the
> deprecations for 2.0 when we are only at 1.1.0 (WIP).
> > Thoughts?
> >
> > Jiang Yan Xu 
> >
> > On Tue, Sep 6, 2016 at 6:40 AM, Silas Snider  <mailto:swsni...@apple.com>> wrote:
> > Responses inline
> >
> > > On Sep 6, 2016, at 1:33 AM, haosdent  haosd...@gmail.com>> wrote:
> > >
> > > Hi, Silas. Thanks a lot to help test the health check changes recently.
> > >
> > > According to my understanding about your email, you mentioned two
> problems:
> > >
> > > 1. The bug that broken exists HTTP/command health check caused by
> r50812 <https://reviews.apache.org/r/50812 <https://reviews.apache.org/r/
> 50812>> and r50996 <https://reviews.apache.org/r/50996 <
> https://reviews.apache.org/r/50996>>
> > >
> > > >It is now true that even with the proposed change (51560), we will
> still get tasks rejected with TASK_ERROR in 1.1.0, despite the same exact
> code working in 1.0.0.
> > > >Even in the case of the command health checks, which are once again
> supported in 51560, we now get deprecation warnings, suggesting that mesos
> will again break us in 1.4.
> > >
> > > As you mentioned, this is a bug and we definitely should fix before
> release 1.1.0.
> > > I have updated r51560 <https://reviews.apache.org/r/51560 <
> https://reviews.apache.org/r/51560>> yesterday and verify it fix the
> problem via r51635. As you see in
> > > the r51560 <https://reviews.apache.org/r/51560 <
> https://reviews.apache.org/r/51560>>, we make sure the protobuf
> compatible again and didn't lose any
> > > fields. Would you help to double check if it fixes your problem when
> you free?
> > > It would be highly appreciated that if you could help to verify it.
> > >
> > > After this bug fix, we could ensure all tasks with HTTP/command health
> check are not when upgrading to 1.1.0.
> > >
> >
> > I see those changes now (I’m very very bad at the review board UI, so
> I’m sorry if it was always there and I missed it somehow).
> >
> > > 2. Should we make the `HealthCheck::type` required after v2 ?
> > >
> > > To be honest, I think 6 months should be enough and it also should be
> changed in
> > > v1 because it is a minor change and we didn't make it `required` in
> protobuf
> > > message level. We still keeping it `option` in protobuf message
> definition and
> > > add a check about it in Mesos code.
> > > But your concerns make sense as well, so let's see what other
> users/developers say to
> > > see if we could make an agreement on this.
> > >
> >
> > This is an important point. It doesn’t make sense to me that the
> compatibility policy is talking about only whether a protobuf field is
> optional or required — it seems to me that any change that takes a protocol
> exchange that did not result in a TASK_ERROR before, and changes it to
> cause a TASK_ERROR now, *is* making that protobuf field semantically
> required, whether or not the protobuf def says so.
> >
> > I’ll also point out that there is no definition of ‘minor’ change in the
> compatibility document, and therefore, whether or not a change appears to
> be ‘minor’ under some rubric (and I agree that this change could seem
> minor), if it’s part of the v1 mesos.proto, it affects downstream users
> (such as me, a writer of schedulers).
> >
> > > On Tue, Sep 6, 2016 at 1:18 PM, Silas Snider  <mailto:swsni...@apple.com> <mailto:swsni...@apple.com  swsni...@apple.com>>> wrote:
> > > There’s a little history to this:
> > >
> > > In https://reviews.apache.org/r/50812/ <https://reviews.apache.org/r/
> 50812/> <https://reviews.apache.org/r/50812/ <
> https://reviews.apache.org/r/50812/>> <https://reviews.apache.org/r/50812/
> <https://reviews.apache.org/r/50812/> <https://reviews.apache.org/r/50812/
> <https://reviews.apache.org/r/50812/>>>, on the 8th of August, the HTTP
> health check message was changed to be entirely incompatible with the
> previous HTTP health check message. Not only was its name changed (breaking
> compatibility with anyone using the feature with libmesos), but the field
> tags were rearranged, making it truly wire-format incompatible

Re: forcing framework to re-schedule?

2016-09-13 Thread haosdent

Hi, @Victor taskId is specified in `TaskInfo` when you launchTask.

On Wed, Sep 14, 2016 at 6:22 AM, Victor L  wrote:

> how can i get taskId to call "killTask"?
>
> On Tue, Sep 13, 2016 at 9:59 AM, haosdent  wrote:
>
>> If you want to kill the task from the scheduler, you just need to call
>> `killTask`(https://github.com/apache/mesos/blob/1.0.x/includ
>> e/mesos/scheduler.hpp#L257).
>> If you want to kill the task by health check, you could try to set the
>> correct `consecutive_failures` number (https://github.com/apache/mes
>> os/blob/1.0.x/include/mesos/v1/mesos.proto#L358).
>>
>> On Tue, Sep 13, 2016 at 2:53 AM, Victor L  wrote:
>>
>>> How can i explicitly kill the task from my class?
>>>
>>> On Mon, Sep 12, 2016 at 2:10 PM, haosdent  wrote:
>>>
>>>> If the target you perform health check is your task, Mesos support
>>>> health check by a command. When your task reaches the health task failure
>>>> limit, the task would be killed and then your framework could launch the
>>>> task again when receives the `TASK_KILLED` in `statusUpdate`.
>>>>
>>>> On Tue, Sep 13, 2016 at 2:03 AM, Victor L  wrote:
>>>>
>>>>> It checks if process is functional. I don't think standard
>>>>> healthchecks wouldn't be sufficient for my purpose and my question still
>>>>> stands: how  to use result...
>>>>>
>>>>> On Mon, Sep 12, 2016 at 1:48 PM, haosdent  wrote:
>>>>>
>>>>>> Hi, @victor What's your health check agent used for? Because Mesos
>>>>>> supports health checks now.
>>>>>>
>>>>>> On Tue, Sep 13, 2016 at 1:46 AM, Victor L 
>>>>>> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>> I am writing "healthcheck agent" for mesos deployment framework as
>>>>>>> independent thread periodically checking if main process ( started by
>>>>>>> framework) is running...
>>>>>>> What would be the mechanism to "communicate" failure to the
>>>>>>> framework  to cause specific outcome? For example: how can i use 
>>>>>>> failure to
>>>>>>> cause framework to reschedule deployment on different node?
>>>>>>> Thanks,
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best Regards,
>>>>>> Haosdent Huang
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> Haosdent Huang
>>>>
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Haosdent Huang
>>
>
>


-- 
Best Regards,
Haosdent Huang

Re: Unified cgroups isolator

2016-09-13 Thread haosdent

Really appreciate @qian and @jie's great helps on this! It makes us easier
to add cgroups isolation for rest subsystem.

Additionally, if you find any changes about unified cgroups isolator break
your environment, please let us know. I would
try to fix asap.

On Wed, Sep 14, 2016 at 1:59 AM, Jie Yu  wrote:

> Hi,
>
> We just merged the unified cgroups isolator. Huge shout out to @haosdent
> and @qianzhang to make this happen!
> https://issues.apache.org/jira/browse/MESOS-4697
>
> Just to give you some context. Previously, it's a huge pain to add a new
> cgroups subsystem to Mesos because it requires creating a new isolator (a
> lot of code duplication). Now, we merge all the subsystems into one single
> isolator, that makes adding a new subsystem very easy.
>
> More importantly, the new cgroups isolator supports cgroups v2!
>
> - Jie
>

-- 
Best Regards,
Haosdent Huang

Re: forcing framework to re-schedule?

2016-09-13 Thread haosdent

If you want to kill the task from the scheduler, you just need to call
`killTask`(
https://github.com/apache/mesos/blob/1.0.x/include/mesos/scheduler.hpp#L257
).
If you want to kill the task by health check, you could try to set the
correct `consecutive_failures` number (
https://github.com/apache/mesos/blob/1.0.x/include/mesos/v1/mesos.proto#L358
).

On Tue, Sep 13, 2016 at 2:53 AM, Victor L  wrote:

> How can i explicitly kill the task from my class?
>
> On Mon, Sep 12, 2016 at 2:10 PM, haosdent  wrote:
>
>> If the target you perform health check is your task, Mesos support health
>> check by a command. When your task reaches the health task failure limit,
>> the task would be killed and then your framework could launch the task
>> again when receives the `TASK_KILLED` in `statusUpdate`.
>>
>> On Tue, Sep 13, 2016 at 2:03 AM, Victor L  wrote:
>>
>>> It checks if process is functional. I don't think standard healthchecks
>>> wouldn't be sufficient for my purpose and my question still stands: how  to
>>> use result...
>>>
>>> On Mon, Sep 12, 2016 at 1:48 PM, haosdent  wrote:
>>>
>>>> Hi, @victor What's your health check agent used for? Because Mesos
>>>> supports health checks now.
>>>>
>>>> On Tue, Sep 13, 2016 at 1:46 AM, Victor L  wrote:
>>>>
>>>>> Hello,
>>>>> I am writing "healthcheck agent" for mesos deployment framework as
>>>>> independent thread periodically checking if main process ( started by
>>>>> framework) is running...
>>>>> What would be the mechanism to "communicate" failure to the framework
>>>>> to cause specific outcome? For example: how can i use failure to cause
>>>>> framework to reschedule deployment on different node?
>>>>> Thanks,
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> Haosdent Huang
>>>>
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Haosdent Huang
>>
>
>


-- 
Best Regards,
Haosdent Huang

Re: forcing framework to re-schedule?

2016-09-12 Thread haosdent

If the target you perform health check is your task, Mesos support health
check by a command. When your task reaches the health task failure limit,
the task would be killed and then your framework could launch the task
again when receives the `TASK_KILLED` in `statusUpdate`.

On Tue, Sep 13, 2016 at 2:03 AM, Victor L  wrote:

> It checks if process is functional. I don't think standard healthchecks
> wouldn't be sufficient for my purpose and my question still stands: how  to
> use result...
>
> On Mon, Sep 12, 2016 at 1:48 PM, haosdent  wrote:
>
>> Hi, @victor What's your health check agent used for? Because Mesos
>> supports health checks now.
>>
>> On Tue, Sep 13, 2016 at 1:46 AM, Victor L  wrote:
>>
>>> Hello,
>>> I am writing "healthcheck agent" for mesos deployment framework as
>>> independent thread periodically checking if main process ( started by
>>> framework) is running...
>>> What would be the mechanism to "communicate" failure to the framework
>>> to cause specific outcome? For example: how can i use failure to cause
>>> framework to reschedule deployment on different node?
>>> Thanks,
>>>
>>
>>
>>
>> --
>> Best Regards,
>> Haosdent Huang
>>
>
>


-- 
Best Regards,
Haosdent Huang

Re: forcing framework to re-schedule?

2016-09-12 Thread haosdent

Hi, @victor What's your health check agent used for? Because Mesos supports
health checks now.

On Tue, Sep 13, 2016 at 1:46 AM, Victor L  wrote:

> Hello,
> I am writing "healthcheck agent" for mesos deployment framework as
> independent thread periodically checking if main process ( started by
> framework) is running...
> What would be the mechanism to "communicate" failure to the framework  to
> cause specific outcome? For example: how can i use failure to cause
> framework to reschedule deployment on different node?
> Thanks,
>



-- 
Best Regards,
Haosdent Huang

Re: Question about the deprecated policy after 1.0

2016-09-06 Thread haosdent

Hi, Silas. Thanks a lot to help test the health check changes recently.

According to my understanding about your email, you mentioned two problems:

1. The bug that broken exists HTTP/command health check caused by r50812
<https://reviews.apache.org/r/50812> and r50996
<https://reviews.apache.org/r/50996>

>It is now true that even with the proposed change (51560), we will still
get tasks rejected with TASK_ERROR in 1.1.0, despite the same exact code
working in 1.0.0.
>Even in the case of the command health checks, which are once again
supported in 51560, we now get deprecation warnings, suggesting that mesos
will again break us in 1.4.

As you mentioned, this is a bug and we definitely should fix before release
1.1.0.
I have updated r51560 <https://reviews.apache.org/r/51560> yesterday and
verify it fix the problem via r51635. As you see in
the r51560 <https://reviews.apache.org/r/51560>, we make sure the protobuf
compatible again and didn't lose any
fields. Would you help to double check if it fixes your problem when you
free?
It would be highly appreciated that if you could help to verify it.

After this bug fix, we could ensure all tasks with HTTP/command health
check are not when upgrading to 1.1.0.

2. Should we make the `HealthCheck::type` required after v2 ?

To be honest, I think 6 months should be enough and it also should be
changed in
v1 because it is a minor change and we didn't make it `required` in
protobuf
message level. We still keeping it `option` in protobuf message definition
and
add a check about it in Mesos code.
But your concerns make sense as well, so let's see what other
users/developers say to
see if we could make an agreement on this.

On Tue, Sep 6, 2016 at 1:18 PM, Silas Snider  wrote:

> There’s a little history to this:
>
> In https://reviews.apache.org/r/50812/ <https://reviews.apache.org/r/
> 50812/>, on the 8th of August, the HTTP health check message was changed
> to be entirely incompatible with the previous HTTP health check message.
> Not only was its name changed (breaking compatibility with anyone using the
> feature with libmesos), but the field tags were rearranged, making it truly
> wire-format incompatible. This change also introduced a ‘type’ field to the
> HealthCheck message as an optional enum.
>
> Next, in https://reviews.apache.org/r/50996/ <
> https://reviews.apache.org/r/50996/>, on the 13th of August, the health
> checking code was changed to make the new ‘type’ field mandatory — if the
> protobuf field is not present, the mesos master rejects your task with
> TASK_ERROR.
>
> A colleague of mine was testing our internal scheduler against HEAD of
> mesos, and discovered that any task they submitted was being rejected as
> TASK_ERROR, since we were setting health checks, but not sending type. I
> filed MESOS-6110, on the 30th of August, and haosdent huang has kindly
> created https://reviews.apache.org/r/51560/ <https://reviews.apache.org/r/
> 51560/> to try to fix this.
>
> In the course of reviewing that fix, I noticed that it only addresses the
> case of a command health check, and does not continue to support HTTP
> health checks in the way they were in 1.0.0. This is a problem for our
> scheduler, as we have ~always (before mesos actually added support) passed
> our HTTP health checks in the message, depending on our custom executor to
> actually perform the check. It is now true that even with the proposed
> change (51560), we will still get tasks rejected with TASK_ERROR in 1.1.0,
> despite the same exact code working in 1.0.0.
>
> Even in the case of the command health checks, which are once again
> supported in 51560, we now get deprecation warnings, suggesting that mesos
> will again break us in 1.4.
>
> It is my team’s belief that the mesos compatibility guarantee, as
> documented on this page: http://mesos.apache.org/documentation/latest/
> versioning/ <http://mesos.apache.org/documentation/latest/versioning/>
> would prohibit this sort of change from occurring. Specifically, the ‘API
> Versioning’ section says "The API version is only bumped if we need to make
> a backwards incompatible API change. We will strive to support a given API
> version for at least a year.” and under the ‘API compatibility’ the change
> is considered to be breaking if it would involve "Adding new required
> fields to existing requests to “/scheduler”.”
>
> The proposed change does indeed add a new required field — ‘type’ to the
> v1 api, in the case of command health checks in 6 months, in the case of
> http health checks, immediately. Therefore, it seems clear that this
> constitutes a new ‘v2’ api, and it’s very clear that 6 months is too short,
> especially as another part of the 'API Versioning’ section says "The
> dep

Question about the deprecated policy after 1.0

2016-09-05 Thread haosdent

Hi, folks. As I mentioned in the previous email
http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1.
We have added `type` in the `HealthCheck` protobuf definition in 1.1.0 and
health checks without `type` specified will be deprecated since 1.1.0.

For backwards compatibility, we still support the command health check if
the
type is not specified for now. But we plan to make `type` become a required
field
and return `TASK_ERROR` if the type is not specified after 6 months. The
question
is if this meets the deprecated policy since 1.0 ? If 6 months is too short
and
we have to deprecate it after 2.0 ?

Looking forward the answers. Any concerns and questions are appreciated,
thanks a lot!

-- 
Best Regards,
Haosdent Huang

Re: what is the status on this?

2016-09-04 Thread haosdent

Jay has some patches for de-couple Mesos with Zookeeper

https://issues.apache.org/jira/browse/MESOS-5828
https://issues.apache.org/jira/browse/MESOS-5829

I think it should be possible to support consul by custom modules after
jay's work done.

On Sun, Sep 4, 2016 at 6:02 PM, kant kodali  wrote:

> Hi Alex,
>
> We have some experienced devops people here and they all had one thing in
> common which is Zookeeper is a pain to maintain. In fact we refused to
> bring in new tech stacks that require Zookeeper such as Kafka for example.
> so we desperately in search for alternative preferably using consul. I just
> hear lot of positive response when comes it consul. It will be great to see
> mesos and consul working together in which we would be ready to jump at it
> and make a switch for YARN to Mesos.
>
> Thanks,
> Kant
>
>
>
> On Wed, Aug 31, 2016 1:03 AM, Alex Rukletsov a...@mesosphere.com wrote:
>
>> Kant—
>>
>> mind telling us what is your use case and why this ticket is important
>> for you? It will help us prioritize work.
>>
>> On Fri, Aug 26, 2016 at 2:46 AM, tommy xiao  wrote:
>>
>> Hi guys, i always focus on t his case. but good news is etcd always have
>> patchs. so the coming consul is very easy, just need some time to do coding
>> on it. if you have interesting it? let us collaborate it.
>>
>> 2016-08-26 8:11 GMT+08:00 Joseph Wu :
>>
>> There is no timeline as no one has done any work on the issue.
>>
>>
>> On Thu, Aug 25, 2016 at 4:54 PM, kant kodali  wrote:
>>
>> Hi Guys,
>>
>> I see this ticket and other related tickets should be part of sprints in
>> 2015 and it is still not resolved yet. can we have a timeline on this? This
>> would be really helpful
>>
>> https://issues.apache.org/jira/browse/MESOS-3797
>>
>> Thanks!
>>
>>
>>
>>
>>
>> --
>> Deshi Xiao
>> Twitter: xds2000
>> E-mail: xiaods(AT)gmail.com
>>
>>
>>


-- 
Best Regards,
Haosdent Huang

Re: Support HTTP(s)/TCP Health Check in Mesos

2016-09-02 Thread haosdent

Just test with curl 7.50.1, HTTP 2 is supported.

On Sat, Sep 3, 2016 at 12:32 AM, haosdent  wrote:

> The current implementation of HTTP(s) health check is based on curl.
> According to the document of curl
>
> >Since 7.47.0, the curl tool enables HTTP/2 by default for HTTPS
> connections.
>
> So I think it should be supported if the curl version in your Mesos Agent
> is higher that 7.47. But I have not yet try this.
>
> On Sat, Sep 3, 2016 at 12:23 AM, Aaron Wood  wrote:
>
>> Since you mentioned that you're working on supporting HTTPS health checks
>> I'm curious if there are any plans to support HTTP/2 over TLS (or even
>> over
>> plain HTTP). I would think that using HTTP/2 for any communication that
>> happens in Mesos would provide a nice improvement in heavy load
>> situations.
>>
>> On Fri, Sep 2, 2016 at 10:59 AM, haosdent  wrote:
>>
>> > Hi, dear friends. @alexr and I are working on supporting HTTP(s)/TCP
>> Health
>> > Check in Mesos.
>> > We have finished and committed some initial works. But if you use the
>> old
>> > protobuf definition of
>> > `HealthCheck` to implement HTTP health check in your custom executor
>> > before, our changes recently would
>> > break it.
>> >
>> > The change of the protobuf definition of `HealthCheck` is
>> >
>> > ```
>> >  message HealthCheck {
>> >  +  enum Type {
>> >  +UNKNOWN = 0;
>> >  +COMMAND = 1;
>> >  +HTTP = 2;
>> >  +TCP = 3;
>> >  +  }
>> >  +
>> >  -  message HTTP {
>> >  +  message HTTPCheckInfo {
>> >  +optional string scheme = 1;
>> >  -required uint32 port = 1;
>> >  +required uint32 port = 2;
>> >  -optional string path = 2 [default = "/"];
>> >  +optional string path = 3;
>> >  -repeated uint32 statuses = 4;
>> > }
>> > ...
>> >  +  optional Type type = 8;
>> >  -  // HTTP health check - not yet recommended for use, see MESOS-2533.
>> >  -  optional HTTP http = 1;
>> >  +  optional HTTPCheckInfo http = 1;
>> > ...
>> >   }
>> > ```
>> >
>> > Noted that we add a field `type` to specific the health check type and
>> use
>> > `HTTPCheckInfo` instead of `HTTP`.
>> > As I know, Mesos didn't support HTTP health check before 1.0 and it is
>> > supposed to not used.
>> >
>> > But thanks to @swsnider to report the issues recently, user may
>> implement
>> > the custom executor with
>> > HTTP health check. So I am writing this email to check if anyone
>> > implemented HTTP health check in custom executor
>> > like @swsnider and if you depend on the old protobuf definition of
>> > `HealthCheck` heavily.
>> > If so, how many month your need for the deprecation cycle of this?
>> >
>> > Any concerns and questions are appreciated, thanks a lot!
>> >
>> > --
>> > Best Regards,
>> > Haosdent Huang
>> >
>>
>
>
>
> --
> Best Regards,
> Haosdent Huang
>



-- 
Best Regards,
Haosdent Huang

Re: Support HTTP(s)/TCP Health Check in Mesos

2016-09-02 Thread haosdent

The current implementation of HTTP(s) health check is based on curl.
According to the document of curl

>Since 7.47.0, the curl tool enables HTTP/2 by default for HTTPS
connections.

So I think it should be supported if the curl version in your Mesos Agent
is higher that 7.47. But I have not yet try this.

On Sat, Sep 3, 2016 at 12:23 AM, Aaron Wood  wrote:

> Since you mentioned that you're working on supporting HTTPS health checks
> I'm curious if there are any plans to support HTTP/2 over TLS (or even over
> plain HTTP). I would think that using HTTP/2 for any communication that
> happens in Mesos would provide a nice improvement in heavy load situations.
>
> On Fri, Sep 2, 2016 at 10:59 AM, haosdent  wrote:
>
> > Hi, dear friends. @alexr and I are working on supporting HTTP(s)/TCP
> Health
> > Check in Mesos.
> > We have finished and committed some initial works. But if you use the old
> > protobuf definition of
> > `HealthCheck` to implement HTTP health check in your custom executor
> > before, our changes recently would
> > break it.
> >
> > The change of the protobuf definition of `HealthCheck` is
> >
> > ```
> >  message HealthCheck {
> >  +  enum Type {
> >  +UNKNOWN = 0;
> >  +COMMAND = 1;
> >  +HTTP = 2;
> >  +TCP = 3;
> >  +  }
> >  +
> >  -  message HTTP {
> >  +  message HTTPCheckInfo {
> >  +optional string scheme = 1;
> >  -required uint32 port = 1;
> >  +required uint32 port = 2;
> >  -optional string path = 2 [default = "/"];
> >  +optional string path = 3;
> >  -repeated uint32 statuses = 4;
> > }
> > ...
> >  +  optional Type type = 8;
> >  -  // HTTP health check - not yet recommended for use, see MESOS-2533.
> >  -  optional HTTP http = 1;
> >  +  optional HTTPCheckInfo http = 1;
> > ...
> >   }
> > ```
> >
> > Noted that we add a field `type` to specific the health check type and
> use
> > `HTTPCheckInfo` instead of `HTTP`.
> > As I know, Mesos didn't support HTTP health check before 1.0 and it is
> > supposed to not used.
> >
> > But thanks to @swsnider to report the issues recently, user may implement
> > the custom executor with
> > HTTP health check. So I am writing this email to check if anyone
> > implemented HTTP health check in custom executor
> > like @swsnider and if you depend on the old protobuf definition of
> > `HealthCheck` heavily.
> > If so, how many month your need for the deprecation cycle of this?
> >
> > Any concerns and questions are appreciated, thanks a lot!
> >
> > --
> > Best Regards,
> > Haosdent Huang
> >
>



-- 
Best Regards,
Haosdent Huang

Support HTTP(s)/TCP Health Check in Mesos

2016-09-02 Thread haosdent

Hi, dear friends. @alexr and I are working on supporting HTTP(s)/TCP Health
Check in Mesos.
We have finished and committed some initial works. But if you use the old
protobuf definition of
`HealthCheck` to implement HTTP health check in your custom executor
before, our changes recently would
break it.

The change of the protobuf definition of `HealthCheck` is

```
 message HealthCheck {
 +  enum Type {
 +UNKNOWN = 0;
 +COMMAND = 1;
 +HTTP = 2;
 +TCP = 3;
 +  }
 +
 -  message HTTP {
 +  message HTTPCheckInfo {
 +optional string scheme = 1;
 -required uint32 port = 1;
 +required uint32 port = 2;
 -optional string path = 2 [default = "/"];
 +optional string path = 3;
 -repeated uint32 statuses = 4;
}
...
 +  optional Type type = 8;
 -  // HTTP health check - not yet recommended for use, see MESOS-2533.
 -  optional HTTP http = 1;
 +  optional HTTPCheckInfo http = 1;
...
  }
```

Noted that we add a field `type` to specific the health check type and use
`HTTPCheckInfo` instead of `HTTP`.
As I know, Mesos didn't support HTTP health check before 1.0 and it is
supposed to not used.

But thanks to @swsnider to report the issues recently, user may implement
the custom executor with
HTTP health check. So I am writing this email to check if anyone
implemented HTTP health check in custom executor
like @swsnider and if you depend on the old protobuf definition of
`HealthCheck` heavily.
If so, how many month your need for the deprecation cycle of this?

Any concerns and questions are appreciated, thanks a lot!

-- 
Best Regards,
Haosdent Huang

Re: Best Practices for Scheduler in Scala

2016-09-01 Thread haosdent

https://github.com/mesosphere/mesos-rxjava is based on the new HTTP API. I
think you could check out it.

On Wed, Aug 31, 2016 at 7:38 AM, Tobias Pfeiffer  wrote:

> Hi,
>
> I am just getting started with Mesos. My goal is to run some
> user-submitted code in a Docker container on my cluster. As per my
> understanding, I would need to write a Scheduler that submits some tasks to
> the Mesos cluster and tracks their process.
>
> I have discovered that I can write a Scheduler in an arbitrary language
> (Scala for me) using the Scheduler HTTP API <http://mesos.apache.org/
> documentation/latest/scheduler-http-api/>. That page says "most scheduler
> developers should use a library for their language of choice that manages
> the details of the HTTP API" and links to <http://mesos.apache.org/
> documentation/latest/api-client-libraries/>. (Note that that page doesn't
> mention "HTTP" anywhere except in the .)
>
> The latter page lists a mesos-scala-api library <https://github.com/nokia/
> mesos-scala-api> provided by Nokia, but looking at the dependencies
> (libmesos.so etc.), this is not actually using the HTTP API, is it? Also,
> the "Hello World" example from <https://gist.github.com/guenter/7471695>
> directly imports org.apache.mesos and works fine as is, so now I am a bit
> unsure about which is the way to write a Scheduler, using the "Hello World"
> approach, the Nokia library or write a client for the HTTP API myself. Any
> suggestions?
>
> Thank you,
> Tobias
>
>


-- 
Best Regards,
Haosdent Huang

Re: missing documentation: view_frameworks, view_tasks etc in mesos 1.0

2016-08-31 Thread haosdent

Hi, @haripriya I saw we already have "view_executors" in the document (
https://github.com/apache/mesos/blob/master/docs/authorization.md#authorizable-actions)
?

On Thu, Sep 1, 2016 at 4:41 AM, Haripriya Ayyalasomayajula <
aharipriy...@gmail.com> wrote:

> Well, I had to turn on auth for run_tasks, I had different set of
> configuration there.
> I had some syntax issue with the above mentioned configurations in my
> original file, fixed them and it works file.
> Is there a way the flags view_executors etc can be added to the existing
> documentation?
>
> On Wed, Aug 31, 2016 at 1:26 AM, haosdent  wrote:
>
>> Because your types are ANY, have you consider disable auth via don't
>> specify `--acl` flag when you launch Mesos master?
>>
>>
>>
>> On Wed, Aug 31, 2016 at 3:00 AM, Haripriya Ayyalasomayajula <
>> aharipriy...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I've upgraded my mesos cluster to 1.0.
>>> I have spark and Marathon registered as frameworks and have no problem
>>> running jobs.
>>> I am unable to see any frameworks nor any tasks on the web UI.
>>>
>>> I found out that the following fields have been added to acls.
>>>  view_frameworks, view_tasks, view_executors, access_sandboxes,
>>> access_mesos_logs
>>> and there are no examples related to these in:
>>> http://mesos.apache.org/documentation/latest/authorization/
>>> Can someone help me understand where I'm going wrong?
>>>
>>> Looking at the JIRA https://issues.apache.org/jira/browse/MESOS-5746
>>> I tried to come up with this json configuration, but that doesn't seem
>>> to work either.
>>> Here is my mesos_acls.json file:
>>>
>>>   "get_endpoints": [  {
>>>
>>>   "principals": {  "type": "ANY" },
>>>
>>>   "paths": {  "type": "ANY"  }  }
>>>
>>>],
>>>
>>>
>>>   "view_frameworks": [  {
>>>
>>>   "principals": {  "type": "ANY" },
>>>
>>>   "users": {  "type": "ANY"  }  }
>>>
>>>],
>>>
>>>
>>>   "view_tasks": [  {
>>>
>>>   "principals": {  "type": "ANY" },
>>>
>>>   "users": {  "type": "ANY"  }  }
>>>
>>>],
>>>
>>>  "view_executors": [  {
>>>
>>>   "principals": {  "type": "ANY" },
>>>
>>>   "users": {  "type": "ANY"  }  }
>>>
>>>],
>>>
>>>  "access_sandboxes": [  {
>>>
>>>   "principals": {  "type": "ANY" },
>>>
>>>   "users": {  "type": "ANY"  }  }
>>>
>>>],
>>>
>>>  "access_mesos_logs": [  {
>>>
>>>   "principals": {  "type": "ANY" },
>>>
>>>   "logs": {  "type": "ANY"  }  }
>>>
>>>],
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Haripriya Ayyalasomayajula
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Haosdent Huang
>>
>
>
>
> --
> Regards,
> Haripriya Ayyalasomayajula
>
>


-- 
Best Regards,
Haosdent Huang

Re: missing documentation: view_frameworks, view_tasks etc in mesos 1.0

2016-08-31 Thread haosdent

Because your types are ANY, have you consider disable auth via don't
specify `--acl` flag when you launch Mesos master?



On Wed, Aug 31, 2016 at 3:00 AM, Haripriya Ayyalasomayajula <
aharipriy...@gmail.com> wrote:

> Hi all,
>
> I've upgraded my mesos cluster to 1.0.
> I have spark and Marathon registered as frameworks and have no problem
> running jobs.
> I am unable to see any frameworks nor any tasks on the web UI.
>
> I found out that the following fields have been added to acls.
>  view_frameworks, view_tasks, view_executors, access_sandboxes,
> access_mesos_logs
> and there are no examples related to these in: http://mesos.apache.org/
> documentation/latest/authorization/
> Can someone help me understand where I'm going wrong?
>
> Looking at the JIRA https://issues.apache.org/jira/browse/MESOS-5746
> I tried to come up with this json configuration, but that doesn't seem to
> work either.
> Here is my mesos_acls.json file:
>
>   "get_endpoints": [  {
>
>   "principals": {  "type": "ANY" },
>
>   "paths": {  "type": "ANY"  }  }
>
>],
>
>
>   "view_frameworks": [  {
>
>   "principals": {  "type": "ANY" },
>
>   "users": {  "type": "ANY"  }  }
>
>],
>
>
>   "view_tasks": [  {
>
>   "principals": {  "type": "ANY" },
>
>   "users": {  "type": "ANY"  }  }
>
>],
>
>  "view_executors": [  {
>
>   "principals": {  "type": "ANY" },
>
>   "users": {  "type": "ANY"  }  }
>
>],
>
>  "access_sandboxes": [  {
>
>   "principals": {  "type": "ANY" },
>
>   "users": {  "type": "ANY"  }  }
>
>],
>
>  "access_mesos_logs": [  {
>
>   "principals": {  "type": "ANY" },
>
>   "logs": {  "type": "ANY"  }  }
>
>],
>
>
>
> --
> Regards,
> Haripriya Ayyalasomayajula
>
>


-- 
Best Regards,
Haosdent Huang

Re: Setup Multi Node Mesos Cluster

2016-08-29 Thread haosdent

Hi, @Mahendra Do you follow
https://github.com/apache/mesos/blob/master/docs/getting-started.md to
install Mesos via build source code? Do you have detail logs about:

>*Failing with HTTP test failing*.

On Fri, Aug 19, 2016 at 6:39 PM, Mahendra Singh <
mahendra.singh1...@gmail.com> wrote:

> Hi All,
> I need your help is setting up Mesos Multinode Cluster:
>
> I am new to mesos, facing issue while setting up Mesos Multi node cluster.
> I have two Nodes: Node1 and Node 2
>
> this is what i am doing to do set up.
>
> *Node 1 Terminal*
>
> Step Install all required libraris
> Done : make
>make check
>*Failing with HTTP test failing*.
>  Starting Mesos Master:
> master@Node1: /build/mesos-master --work_dir_/var/lib/mesos
>
> Mesos Master is running well
>
> *Node 2 Terminal*
>
> slave@Node2: /build/mesos-slave --master:MasterServerHostname:5050
>  Slave is also starting.
>
> Now I am trying to test using test framework on maser Node1
>
> JAVA:
>Running /build/src/exampls/java/test-framework
> MasterServerHostname:5050
> Failed: no example.jar file present
>
> Python:
>   Running: /build/src/exampls/python/test-framework
> MasterServerHostname:50
> 50
>
> Failing: Task 0 is in state TASK_LOST
>
> The update data did not match!
>   Expected: 'data with a \x00 byte'
>   Actual:   ''
> Failed to call scheduler's statusUpdate
>
> Any idea or pointer where I am doing wrong.
> --
> Regards,
> Mahendra
>



-- 
Best Regards,
Haosdent Huang

Re: how to enable force_pull_image on docker containerizer

2016-08-27 Thread haosdent

Hi, @John, `force_pull_image` specified in protobuf message. If you use
Marathon, you could refer to
https://mesosphere.github.io/marathon/docs/native-docker.html for the
details.

```
{
  "type": "DOCKER",
  "docker": {
"image": "group/image",
"forcePullImage": true
  }
}
```

On Sat, Aug 27, 2016 at 12:10 PM, John Wetherill 
wrote:

> According to this docker-containerizer doc
> <http://mesos.apache.org/documentation/latest/docker-containerizer/>:
>
> "The containerizer also supports optional force pulling of the image. It
> is set disabled as default, so the docker image will only be updated again
> if it’s not available on the host. To enable force pulling an image,
> force_pull_image has to be set as true."
>
> I haven't been able to figure out how to enable force_pull_image for
> mesos-slave. There are hints here
> <https://issues.apache.org/jira/browse/MESOS-1886> but still not clear to
> me.
>
> Any additional hints appreciated.
>
>
>


-- 
Best Regards,
Haosdent Huang

Re: 答复: mesos-go example scheduler doesn't work

2016-08-21 Thread haosdent

Hi, could you show the associate logs in Mesos Agent?

On Aug 19, 2016 5:49 PM, "志昌 余"  wrote:

> I ran a scheduler with a cluster (3 nodes) of mesos masters. The
> "127.0,.0.1" in mesos log looks strange. (I guess mesos-go reported a wrong
> IP to mesos?)
>
> So I changed my env to use only one mesos master, and ensure scheduler run
> at the same machine with mesos master. This worked around the "Deactivated
> framework" problem.
>
>
> Then I get a different problem... all tasks failed:
>
>
> I0819 17:39:38.678068   17505 main.go:132] Received Offer <
> e39b3090-6d8a-4b1d-9a3e-defcfd9fa9c2-O62 > with cpus= 2  mem= 2928
> I0819 17:39:38.678135   17505 main.go:164] Prepared task: go-task-63 with
> offer e39b3090-6d8a-4b1d-9a3e-defcfd9fa9c2-O62 for launch
> I0819 17:39:38.678209   17505 main.go:170] Launching  1 tasks for offer
> e39b3090-6d8a-4b1d-9a3e-defcfd9fa9c2-O62
> E0819 17:39:38.691142   17505 main.go:205] executor
> "&ExecutorID{Value:*default,XXX_unrecognized:[],}" lost on slave
> "&SlaveID{Value:*0583e05a-2ddb-4db5-945f-00d1c98d3780-S2,XXX_unrecognized:[],}"
> code -1
> I0819 17:39:38.691603   17505 main.go:176] Status update: task 63  is in
> state  TASK_FAILED
> E0819 17:39:38.691662   17505 main.go:205] executor
> "&ExecutorID{Value:*default,XXX_unrecognized:[],}" lost on slave
> "&SlaveID{Value:*4f24314d-e38d-457c-82e3-0f5535315007-S1,XXX_unrecognized:[],}"
> code -1
> E0819 17:39:38.692593   17505 main.go:205] executor
> "&ExecutorID{Value:*default,XXX_unrecognized:[],}" lost on slave
> "&SlaveID{Value:*d14a8781-f524-4531-bd16-2217506fa594-S0,XXX_unrecognized:[],}"
> code -1
> I0819 17:39:38.694184   17505 main.go:176] Status update: task 62  is in
> state  TASK_FAILED
> I0819 17:39:38.694818   17505 main.go:176] Status update: task 61  is in
> state  TASK_FAILED
>
>
> --
> *发件人:* 志昌 余 
> *发送时间:* 2016年8月19日 17:31:10
> *收件人:* user@mesos.apache.org
> *主题:* 答复: mesos-go example scheduler doesn't work
>
>
> I also tried per the README.md:
> [cannon@yzc-mesos1 examples]$ ./_output/scheduler -master=10.18.6.57:5050
> -executor="$EXECUTOR_BIN" -logtostderr=true
> I0819 17:10:39.068475   17278 main.go:215] Initializing the Example
> Scheduler...
> I0819 17:10:39.076339   17278 scheduler.go:334] Initializing mesos
> scheduler driver
> I0819 17:10:39.077385   17278 scheduler.go:833] Starting the scheduler
> driver...
> I0819 17:10:39.077616   17278 http_transporter.go:383] listening on
> 127.0.0.1 port 52671
> I0819 17:10:39.078865   17278 scheduler.go:850] Mesos scheduler driver
> started with PID=scheduler(1)@127.0.0.1:52671
> I0819 17:10:39.080876   17278 scheduler.go:1053] Scheduler driver
> running.  Waiting to be stopped.
> I0819 17:10:39.391125   17278 scheduler.go:419] New master
> master@10.18.6.57:5050 detected
> I0819 17:10:39.391610   17278 scheduler.go:483] No credentials were
> provided. Attempting to register scheduler without authentication.
>
>
>
>
> --
> *发件人:* 志昌 余 
> *发送时间:* 2016年8月19日 17:28
> *收件人:* user@mesos.apache.org
> *主题:* mesos-go example scheduler doesn't work
>
>
> Hi all,
>
> I'm trying https://github.com/mesos/mesos-go (master branch) with
> mesos-master:0.28.0-2.0.16.ubuntu1404 (run with docker).
>
> The scheduler doesn't run any tasks. Here's the output, and stuck
> there forever:
> [cannon@yzc-mesos1 scheduler]$ ./scheduler -master 10.18.6.57:5050
> -logtostderr=true
> I0819 16:50:11.850130   16619 main.go:215] Initializing the Example
> Scheduler...
> I0819 16:50:11.854875   16619 scheduler.go:334] Initializing mesos
> scheduler driver
> I0819 16:50:11.855194   16619 scheduler.go:833] Starting the scheduler
> driver...
> I0819 16:50:11.855344   16619 http_transporter.go:383] listening on
> 127.0.0.1 port 36443
> I0819 16:50:11.855444   16619 scheduler.go:850] Mesos scheduler driver
> started with PID=scheduler(1)@127.0.0.1:36443
> I0819 16:50:11.855500   16619 scheduler.go:1053] Scheduler driver
> running.  Waiting to be stopped.
> I0819 16:50:11.946344   16619 scheduler.go:419] New master
> master@10.18.6.57:5050 detected
> I0819 16:50:11.946375   16619 scheduler.go:483] No credentials were
> provided. Attempting to register scheduler without authentication.
>
> The mesos master log indicates that it deactivated that scheduler
> again and again:
> I0819 17:25:41.866633 9 master.cpp:2231] Received SUBSCRIBE call for
> framework 'Test Framework (Go)' at scheduler(1)@127.0.0.1:52671
> I0819 17:25:41.866986 9 master.cpp:2302] Subscribing framework Test
> Framework (Go) with checkpointing disabled and capabilities [  ]
> E0819 17:25:41.86932913 process.cpp:1958] Failed to shutdown socket
> with fd 15: Transport endpoint is not connected
> I0819 17:25:41.869443 7 hierarchical.cpp:265] Added framework
> 0583e05a-2ddb-4db5-945f-00d1c98d3780-0083
> I0819 17:25:41.870447 9 master.cpp:1212] Framework
> 0583e05a-2ddb-4db5-945f-00d1c98d3780-0083 (Test Framework (Go)) at
> scheduler(1)@127.0.0

Re: Mesos python bindings

2016-08-17 Thread haosdent

Hi, @Rodrick. The current ways to install pip is to install Mesos from
building source or install debs from
https://open.mesosphere.com/downloads/mesos/

On Thu, Aug 18, 2016 at 3:42 AM, Rodrick Brown 
wrote:

> Where can I find official python bindings for Mesos the latest I see on
> pip is 0.19 which seems to be over ~2 years old has this project been
> discontinued?
> I’m on Mesos 0.28.2
>
>
> --
>
> [image: Orchard Platform] <http://www.orchardplatform.com/>
>
> Rodrick Brown / DevOPs Engineer
> +1 917 445 6839 / rodr...@orchardplatform.com
> 
>
> Orchard Platform
> 101 5th Avenue, 4th Floor, New York, NY 10003
> http://www.orchardplatform.com
>
> Orchard Blog <http://www.orchardplatform.com/blog/> | Marketplace Lending
> Meetup <http://www.meetup.com/Peer-to-Peer-Lending-P2P/>
>
>
> *NOTICE TO RECIPIENTS*: This communication is confidential and intended
> for the use of the addressee only. If you are not an intended recipient of
> this communication, please delete it immediately and notify the sender by
> return email. Unauthorized reading, dissemination, distribution or copying
> of this communication is prohibited. This communication does not constitute
> an offer to sell or a solicitation of an indication of interest to purchase
> any loan, security or any other financial product or instrument, nor is it
> an offer to sell or a solicitation of an indication of interest to purchase
> any products or services to any persons who are prohibited from receiving
> such information under applicable law. The contents of this communication
> may not be accurate or complete and are subject to change without notice.
> As such, Orchard App, Inc. (including its subsidiaries and affiliates,
> "Orchard") makes no representation regarding the accuracy or completeness
> of the information contained herein. The intended recipient is advised to
> consult its own professional advisors, including those specializing in
> legal, tax and accounting matters. Orchard does not provide legal, tax or
> accounting advice.
>



-- 
Best Regards,
Haosdent Huang

Do all the topics in MesosCon Asia share in English

2016-08-17 Thread haosdent

Hi, MesosCon Aisa is accepting submissions now which close on September
9th. Do all the topics are supposed to be shared in English?

Or accept sharing in Chinese as well?

-- 
Best Regards,
Haosdent Huang

Re: Using mesos' cfs limits on a docker container?

2016-08-14 Thread haosdent

Personally, I suggest to use the approach @Joseph and @Avinash mentioned.
Because zhitao and my patches require Docker >= 1.7.0 .

On Mon, Aug 15, 2016 at 1:27 AM, haosdent  wrote:

> Not sure if this related to https://issues.apache.org/
> jira/browse/MESOS-2154
> So far we have a quick workaround: specify the `cpu-period` and
> `cpu-quota` in the parameters field of `DockerInfo`. Then `Docker::run`
> would delegate this to the docker daemon.
>
> And recently zhitao and me work on the fix for this, we have some under
> reviewing patches. I think it should be fixed shortly once zhitao and my
> patches ready.
>
> On Mon, Aug 15, 2016 at 12:11 AM, Erik Weathers 
> wrote:
>
>> What was the problem and how did you overcome it?  (i.e. This would be a
>> sad resolution to this thread for someone faced with this same problem in
>> the future.)
>>
>>
>> On Sunday, August 14, 2016, Mark Hammons 
>> wrote:
>>
>>> I finally got this working after fiddling with it all night. It works
>>> great so far!
>>>
>>> Mark Edgar Hammons II - Research Engineer at BioEmergences
>>> 0603695656
>>>
>>> On 14 Aug 2016, at 04:50, Joseph Wu  wrote:
>>>
>>> If you're not against running Docker containers without the Docker
>>> daemon, try using the Unified containerizer.
>>> See the latter half of this document: http://mesos.apache.org/docume
>>> ntation/latest/mesos-containerizer/
>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__mesos.apache.org_documentation_latest_mesos-2Dcontainerizer_&d=DQMFaQ&c=LNdz7nrxyGFUIUTz2qIULQ&r=cPg4mUupZEtURFK34GyDCtRjHoUmKrI7oHRZqAh3hZY&m=p3yjpxMelmcew1dQtqJniCFVDpbSbJQBXaW-mA1QVHU&s=6sjCv4C-sSI7jwRLgPi2uCrQR8G0D_Kvtde-tRjBybc&e=>
>>>
>>> On Sat, Aug 13, 2016 at 7:02 PM, Mark Hammons <
>>> mark.hamm...@inaf.cnrs-gif.fr> wrote:
>>>
>>>> Hi All,
>>>>
>>>>
>>>>
>>>> I was having a lot of success having mesos force sandboxed programs to
>>>> work within cpu and memory constraints, but when I added docker into the
>>>> mix, the cpu limitations go out the window (not sure about the memory
>>>> limitations. Is there any way to mix these two methods of isolation? I'd
>>>> like my executor/algorithm to run inside a docker container, but have that
>>>> container's memory and cpu usage controlled by systemd/mesos.
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Mark
>>>> --
>>>>
>>>> Mark Hammons - +33 06 03 69 56 56
>>>>
>>>> Research Engineer @ BioEmergences
>>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__bioemergences.iscpif.fr&d=DQMFaQ&c=LNdz7nrxyGFUIUTz2qIULQ&r=cPg4mUupZEtURFK34GyDCtRjHoUmKrI7oHRZqAh3hZY&m=p3yjpxMelmcew1dQtqJniCFVDpbSbJQBXaW-mA1QVHU&s=hlyM8jpFaEkcQ5X8UJs0BTG53J2X6F-zEs0JIKxCFEQ&e=>
>>>>
>>>> Lab Phone: 01 69 82 34 19
>>>>
>>>
>>>
>
>
> --
> Best Regards,
> Haosdent Huang
>



-- 
Best Regards,
Haosdent Huang

Re: Using mesos' cfs limits on a docker container?

2016-08-14 Thread haosdent

Not sure if this related to https://issues.apache.org/jira/browse/MESOS-2154
So far we have a quick workaround: specify the `cpu-period` and `cpu-quota`
in the parameters field of `DockerInfo`. Then `Docker::run` would delegate
this to the docker daemon.

And recently zhitao and me work on the fix for this, we have some under
reviewing patches. I think it should be fixed shortly once zhitao and my
patches ready.

On Mon, Aug 15, 2016 at 12:11 AM, Erik Weathers 
wrote:

> What was the problem and how did you overcome it?  (i.e. This would be a
> sad resolution to this thread for someone faced with this same problem in
> the future.)
>
>
> On Sunday, August 14, 2016, Mark Hammons 
> wrote:
>
>> I finally got this working after fiddling with it all night. It works
>> great so far!
>>
>> Mark Edgar Hammons II - Research Engineer at BioEmergences
>> 0603695656
>>
>> On 14 Aug 2016, at 04:50, Joseph Wu  wrote:
>>
>> If you're not against running Docker containers without the Docker
>> daemon, try using the Unified containerizer.
>> See the latter half of this document: http://mesos.apache.org/docume
>> ntation/latest/mesos-containerizer/
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__mesos.apache.org_documentation_latest_mesos-2Dcontainerizer_&d=DQMFaQ&c=LNdz7nrxyGFUIUTz2qIULQ&r=cPg4mUupZEtURFK34GyDCtRjHoUmKrI7oHRZqAh3hZY&m=p3yjpxMelmcew1dQtqJniCFVDpbSbJQBXaW-mA1QVHU&s=6sjCv4C-sSI7jwRLgPi2uCrQR8G0D_Kvtde-tRjBybc&e=>
>>
>> On Sat, Aug 13, 2016 at 7:02 PM, Mark Hammons <
>> mark.hamm...@inaf.cnrs-gif.fr> wrote:
>>
>>> Hi All,
>>>
>>>
>>>
>>> I was having a lot of success having mesos force sandboxed programs to
>>> work within cpu and memory constraints, but when I added docker into the
>>> mix, the cpu limitations go out the window (not sure about the memory
>>> limitations. Is there any way to mix these two methods of isolation? I'd
>>> like my executor/algorithm to run inside a docker container, but have that
>>> container's memory and cpu usage controlled by systemd/mesos.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Mark
>>> --
>>>
>>> Mark Hammons - +33 06 03 69 56 56
>>>
>>> Research Engineer @ BioEmergences
>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__bioemergences.iscpif.fr&d=DQMFaQ&c=LNdz7nrxyGFUIUTz2qIULQ&r=cPg4mUupZEtURFK34GyDCtRjHoUmKrI7oHRZqAh3hZY&m=p3yjpxMelmcew1dQtqJniCFVDpbSbJQBXaW-mA1QVHU&s=hlyM8jpFaEkcQ5X8UJs0BTG53J2X6F-zEs0JIKxCFEQ&e=>
>>>
>>> Lab Phone: 01 69 82 34 19
>>>
>>
>>


-- 
Best Regards,
Haosdent Huang

Re: [VOTE] Release Apache Mesos 1.0.1 (rc1)

2016-08-13 Thread haosdent

+1 (non-binding)

Run `sudo make check` on CentOS 7.2 and Ubuntu 14.04

On Sat, Aug 13, 2016 at 6:07 AM, Kapil Arya  wrote:

> +1 (binding)
>
> You can find the rpm/deb packages here:
>   http://open.mesosphere.com/downloads/mesos-rc/#apache-mesos-1.0.1-rc1
>
> The following docker tags (built off of ubuntu 14.04) are also available:
> mesosphere/mesos:1.0.1-rc1
> mesosphere/mesos-master:1.0.1-rc1
> mesosphere/mesos-slave:1.0.1-rc1
>
> Kapil
>
> On Fri, Aug 12, 2016 at 4:39 PM, Alex Rukletsov 
> wrote:
>
>> +1 (binding)
>>
>> make check on Mac OS 10.11.6 with apple clang-703.0.31.
>>
>> DockerFetcherPluginTest.INTERNET_CURL_FetchImage is flaky (MESOS-4570),
>> but
>> this does not seem to be a regression or a blocker.
>>
>> On Fri, Aug 12, 2016 at 10:30 PM, Radoslaw Gruchalski <
>> ra...@gruchalski.com>
>> wrote:
>>
>> > I am trying to build Mesos 1.0.1 for Centos 7 in a Docker container but
>> > I'm hitting this: https://issues.apache.org/jira/browse/MESOS-5925.
>> >
>> > Kind regards,
>> >
>> > Radek Gruchalski
>> > ra...@gruchalski.com
>> > +4917685656526
>> >
>> > *Confidentiality:*
>> > This communication is intended for the above-named person and may be
>> > confidential and/or legally privileged.
>> > If it has come to you in error you must take no action based on it, nor
>> > must you copy or show it to anyone; please delete/destroy and inform the
>> > sender immediately.
>> >
>> > On Thu, Aug 11, 2016 at 2:32 AM, Vinod Kone 
>> wrote:
>> >
>> >> Hi all,
>> >>
>> >>
>> >> Please vote on releasing the following candidate as Apache Mesos 1.0.1.
>> >>
>> >>
>> >> The CHANGELOG for the release is available at:
>> >>
>> >> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_p
>> >> lain;f=CHANGELOG;hb=1.0.1-rc1
>> >>
>> >> 
>> >> 
>> >>
>> >>
>> >> The candidate for Mesos 1.0.1 release is available at:
>> >>
>> >> https://dist.apache.org/repos/dist/dev/mesos/1.0.1-rc1/mesos
>> -1.0.1.tar.gz
>> >>
>> >>
>> >> The tag to be voted on is 1.0.1-rc1:
>> >>
>> >> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit
>> ;h=1.0.1-rc1
>> >>
>> >>
>> >> The MD5 checksum of the tarball can be found at:
>> >>
>> >> https://dist.apache.org/repos/dist/dev/mesos/1.0.1-rc1/mesos
>> >> -1.0.1.tar.gz.md5
>> >>
>> >>
>> >> The signature of the tarball can be found at:
>> >>
>> >> https://dist.apache.org/repos/dist/dev/mesos/1.0.1-rc1/mesos
>> >> -1.0.1.tar.gz.asc
>> >>
>> >>
>> >> The PGP key used to sign the release is here:
>> >>
>> >> https://dist.apache.org/repos/dist/release/mesos/KEYS
>> >>
>> >>
>> >> The JAR is up in Maven in a staging repository here:
>> >>
>> >> https://repository.apache.org/content/repositories/orgapachemesos-1155
>> >>
>> >>
>> >> Please vote on releasing this package as Apache Mesos 1.0.1!
>> >>
>> >>
>> >> The vote is open until Mon Aug 15 17:29:33 PDT 2016 and passes if a
>> >> majority of at least 3 +1 PMC votes are cast.
>> >>
>> >>
>> >> [ ] +1 Release this package as Apache Mesos 1.0.1
>> >>
>> >> [ ] -1 Do not release this package because ...
>> >>
>> >>
>> >> Thanks,
>> >>
>> >
>> >
>>
>
>


-- 
Best Regards,
Haosdent Huang

Re: Master pailer failure 0.28.2

2016-08-09 Thread haosdent

Hi, @Charles Allen I use 1.0.0 and it looks fine in my master. Does it
happen after you refresh?

On Wed, Aug 10, 2016 at 1:31 AM, Charles Allen <
charles.al...@metamarkets.com> wrote:

> Slave pailer is working fine
>
> On Tue, Aug 9, 2016 at 10:29 AM Charles Allen <
> charles.al...@metamarkets.com> wrote:
>
>> For some reason I started getting the following failure on 0.28.2 with
>> the pailer when trying to view master logs from the master console
>> (REDACTED is the ip address):
>>
>>
>> angular-1.2.3.min.js:84 Error: [$interpolate:interr]
>> http://errors.angularjs.org/1.2.3/$interpolate/interr?p0=%
>> 7B%7Boffered_cpus…7D&p1=TypeError%3A%20Cannot%20read%
>> 20property%20'toFixed'%20of%20undefined
>> at Error (native)
>> at http://REDACTED:5050/static/js/angular-1.2.3.min.js:6:449
>> at Object.s (http://REDACTED:5050/static/
>> js/angular-1.2.3.min.js:73:495)
>> at f.$digest (http://REDACTED:5050/static/
>> js/angular-1.2.3.min.js:99:14)
>> at f.$apply (http://REDACTED:5050/static/js/angular-1.2.3.min.js:101:
>> 369)
>> at f (http://REDACTED:5050/static/js/angular-1.2.3.min.js:67:175)
>> at Q (http://REDACTED:5050/static/js/angular-1.2.3.min.js:71:99)
>> at XMLHttpRequest.y.onreadystatechange (http://REDACTED:5050/static/
>> js/angular-1.2.3.min.js:72:130)(anonymous function) @
>> angular-1.2.3.min.js:84(anonymous function) @ angular-1.2.3.min.js:62s @
>> angular-1.2.3.min.js:74$digest @ angular-1.2.3.min.js:99$apply @
>> angular-1.2.3.min.js:101f @ angular-1.2.3.min.js:67Q @
>> angular-1.2.3.min.js:71y.onreadystatechange @ angular-1.2.3.min.js:72
>> angular-1.2.3.min.js:84 Error: [$interpolate:interr]
>> http://errors.angularjs.org/1.2.3/$interpolate/interr?p0=%
>> 7B%7Bidle_cpus%20…7D&p1=TypeError%3A%20Cannot%20read%
>> 20property%20'toFixed'%20of%20undefined
>> at Error (native)
>> at http://REDACTED:5050/static/js/angular-1.2.3.min.js:6:449
>> at Object.s (http://REDACTED:5050/static/
>> js/angular-1.2.3.min.js:73:495)
>> at f.$digest (http://REDACTED:5050/static/
>> js/angular-1.2.3.min.js:99:14)
>> at f.$apply (http://REDACTED:5050/static/js/angular-1.2.3.min.js:101:
>> 369)
>> at f (http://REDACTED:5050/static/js/angular-1.2.3.min.js:67:175)
>> at Q (http://REDACTED:5050/static/js/angular-1.2.3.min.js:71:99)
>> at XMLHttpRequest.y.onreadystatechange (http://REDACTED:5050/static/
>> js/angular-1.2.3.min.js:72:130)(anonymous function) @
>> angular-1.2.3.min.js:84(anonymous function) @ angular-1.2.3.min.js:62s @
>> angular-1.2.3.min.js:74$digest @ angular-1.2.3.min.js:99$apply @
>> angular-1.2.3.min.js:101f @ angular-1.2.3.min.js:67Q @
>> angular-1.2.3.min.js:71y.onreadystatechange @ angular-1.2.3.min.js:72
>>
>>
>> Has anyone seen this before?
>>
>


-- 
Best Regards,
Haosdent Huang

1 2 3 4 5 6 >

1 - 100 of 512 matches

Mail list logo