[jira] [Comment Edited] (MESOS-5188) docker executor thinks task is failed when docker container was stopped

2016-04-13 Thread Liqiang Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15239074#comment-15239074
 ] 

Liqiang Lin edited comment on MESOS-5188 at 4/13/16 11:06 AM:
--

I actually used docker containerizer {{--containerizers=docker,mesos}} in my 
case, rather than Mesos containerizer you said. And I also debugged Mesos 
Docker containerizer {{docker.cpp:DockerContainerizerProcess::launch(...)}} to 
verify build-in docker executor is created instead of customised executor 
started in a docker container.

Start default docker executor if taskInfo is set:
{code}
  if (taskInfo.isSome() && flags.docker_mesos_image.isNone()) {
// Launching task by forking a subprocess to run docker executor.
// TODO(steveniemitz): We should call 'update' to set CPU/CFS/mem
// quotas after 'launchExecutorProcess'. However, there is a race
// where 'update' can be called before mesos-docker-executor
// creates the Docker container for the task. See more details in
// the comments of r33174.
return container.get()->launch = fetch(containerId, slaveId)
  .then(defer(self(), [=]() { return pull(containerId); }))
  .then(defer(self(), [=]() {
return mountPersistentVolumes(containerId);
  }))
  .then(defer(self(), [=]() { return launchExecutorProcess(containerId); }))
  .then(defer(self(), [=](pid_t pid) {
return reapExecutor(containerId, pid);
  }));
  }
{code}

Start custom executor in a docker container:
{code}
return container.get()->launch = fetch(containerId, slaveId)
.then(defer(self(), [=]() { return pull(containerId); }))
.then(defer(self(), [=]() {
  return mountPersistentVolumes(containerId);
}))
.then(defer(self(), [=]() {
  return launchExecutorContainer(containerId, containerName);
}))
.then(defer(self(), [=](const Docker::Container& dockerContainer) {
  // Call update to set CPU/CFS/mem quotas at launch.
  // TODO(steveniemitz): Once the minimum docker version supported
  // is >= 1.7 this can be changed to pass --cpu-period and
  // --cpu-quota to the 'docker run' call in
  // launchExecutorContainer.
  return update(containerId, executorInfo.resources(), true)
.then([=]() {
  return Future(dockerContainer);
});
}))
.then(defer(self(), [=](const Docker::Container& dockerContainer) {
  return checkpointExecutor(containerId, dockerContainer);
}))
.then(defer(self(), [=](pid_t pid) {
  return reapExecutor(containerId, pid);
}));
{code}

Will investigate more about root cause of this problem.


was (Author: liqlin):
I actually used docker containerizer {code}--containerizers=docker,mesos{code} 
in my case, rather than Mesos containerizer you said. And I also debugged Mesos 
Docker containerizer {code}DockerContainerizerProcess::launch(...){code} to 
verify build-in docker executor is created instead of customised executor 
started in a docker container.

Start default docker executor if taskInfo is set:
{code}
  if (taskInfo.isSome() && flags.docker_mesos_image.isNone()) {
// Launching task by forking a subprocess to run docker executor.
// TODO(steveniemitz): We should call 'update' to set CPU/CFS/mem
// quotas after 'launchExecutorProcess'. However, there is a race
// where 'update' can be called before mesos-docker-executor
// creates the Docker container for the task. See more details in
// the comments of r33174.
return container.get()->launch = fetch(containerId, slaveId)
  .then(defer(self(), [=]() { return pull(containerId); }))
  .then(defer(self(), [=]() {
return mountPersistentVolumes(containerId);
  }))
  .then(defer(self(), [=]() { return launchExecutorProcess(containerId); }))
  .then(defer(self(), [=](pid_t pid) {
return reapExecutor(containerId, pid);
  }));
  }
{code}

Start custom executor in a docker container:
{code}
return container.get()->launch = fetch(containerId, slaveId)
.then(defer(self(), [=]() { return pull(containerId); }))
.then(defer(self(), [=]() {
  return mountPersistentVolumes(containerId);
}))
.then(defer(self(), [=]() {
  return launchExecutorContainer(containerId, containerName);
}))
.then(defer(self(), [=](const Docker::Container& dockerContainer) {
  // Call update to set CPU/CFS/mem quotas at launch.
  // TODO(steveniemitz): Once the minimum docker version supported
  // is >= 1.7 this can be changed to pass --cpu-period and
  // --cpu-quota to the 'docker run' call in
  // launchExecutorContainer.
  return update(containerId, executorInfo.resources(), true)
.then([=]() {
  return Future(dockerContainer);
});
}))
.then(defer(self(), [=](const Docker::Container& dockerContainer) {
  return checkpointExecutor(containerId, do

[jira] [Commented] (MESOS-5188) docker executor thinks task is failed when docker container was stopped

2016-04-13 Thread Liqiang Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15239074#comment-15239074
 ] 

Liqiang Lin commented on MESOS-5188:


I actually used docker containerizer {code}--containerizers=docker,mesos{code} 
in my case, rather than Mesos containerizer you said. And I also debugged Mesos 
Docker containerizer {code}DockerContainerizerProcess::launch(...){code} to 
verify build-in docker executor is created instead of customised executor 
started in a docker container.

Start default docker executor if taskInfo is set:
{code}
  if (taskInfo.isSome() && flags.docker_mesos_image.isNone()) {
// Launching task by forking a subprocess to run docker executor.
// TODO(steveniemitz): We should call 'update' to set CPU/CFS/mem
// quotas after 'launchExecutorProcess'. However, there is a race
// where 'update' can be called before mesos-docker-executor
// creates the Docker container for the task. See more details in
// the comments of r33174.
return container.get()->launch = fetch(containerId, slaveId)
  .then(defer(self(), [=]() { return pull(containerId); }))
  .then(defer(self(), [=]() {
return mountPersistentVolumes(containerId);
  }))
  .then(defer(self(), [=]() { return launchExecutorProcess(containerId); }))
  .then(defer(self(), [=](pid_t pid) {
return reapExecutor(containerId, pid);
  }));
  }
{code}

Start custom executor in a docker container:
{code}
return container.get()->launch = fetch(containerId, slaveId)
.then(defer(self(), [=]() { return pull(containerId); }))
.then(defer(self(), [=]() {
  return mountPersistentVolumes(containerId);
}))
.then(defer(self(), [=]() {
  return launchExecutorContainer(containerId, containerName);
}))
.then(defer(self(), [=](const Docker::Container& dockerContainer) {
  // Call update to set CPU/CFS/mem quotas at launch.
  // TODO(steveniemitz): Once the minimum docker version supported
  // is >= 1.7 this can be changed to pass --cpu-period and
  // --cpu-quota to the 'docker run' call in
  // launchExecutorContainer.
  return update(containerId, executorInfo.resources(), true)
.then([=]() {
  return Future(dockerContainer);
});
}))
.then(defer(self(), [=](const Docker::Container& dockerContainer) {
  return checkpointExecutor(containerId, dockerContainer);
}))
.then(defer(self(), [=](pid_t pid) {
  return reapExecutor(containerId, pid);
}));
{code}

Will investigate more about root cause of this problem.

> docker executor thinks task is failed when docker container was stopped
> ---
>
> Key: MESOS-5188
> URL: https://issues.apache.org/jira/browse/MESOS-5188
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 0.28.0
>Reporter: Liqiang Lin
> Fix For: 0.29.0
>
>
> Test cases:
> 1. Launch a task with Swarm (on Mesos).
> {code}
> # docker -H 192.168.56.110:54375 run -d --cpu-shares 1 ubuntu sleep 300
> {code}
> 2. Then stop the docker container.
> {code}
> # docker -H 192.168.56.110:54375 ps
> CONTAINER IDIMAGE   COMMAND CREATED   
>   STATUS  PORTS   NAMES
> b4813ba3ed4dubuntu  "sleep 300" 9 seconds ago 
>   Up 8 seconds
> mesos1/mesos-2cd5576e-6260-4262-a62c-b0dc45c86c45-S1.1595e79b-aef2-44b6-a313-ad4ff8626958
> # docker -H 192.168.56.110:54375 stop b4813ba3ed4d
> b4813ba3ed4d
> {code}
> 3. Found the task is failed. See Mesos slave log,
> {code}
> I0407 09:10:57.606552 32307 slave.cpp:1508] Got assigned task 99ee7dc74861 
> for framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-
> I0407 09:10:57.608230 32307 slave.cpp:1627] Launching task 99ee7dc74861 for 
> framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-
> I0407 09:10:57.609979 32307 paths.cpp:528] Trying to chown 
> '/var/lib/mesos/slaves/2cd5576e-6260-4262-a62c-b0dc45c86c45-S0/frameworks/5b84aad8-dd60-40b3-84c2-93be6b7aa81c-/executors/99ee7dc74861/runs/250a169f-7aba-474d-a4f5-cd24ecf0e7d9'
>  to user 'root'
> I0407 09:10:57.615881 32307 slave.cpp:5586] Launching executor 99ee7dc74861 
> of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c- with resources 
> cpus(*):0.1; mem(*):32 in work directory 
> '/var/lib/mesos/slaves/2cd5576e-6260-4262-a62c-b0dc45c86c45-S0/frameworks/5b84aad8-dd60-40b3-84c2-93be6b7aa81c-/executors/99ee7dc74861/runs/250a169f-7aba-474d-a4f5-cd24ecf0e7d9'
> I0407 09:12:18.458449 32307 slave.cpp:1845] Queuing task '99ee7dc74861' for 
> executor '99ee7dc74861' of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-
> I0407 09:12:18.459092 32307 slave.cpp:3711] No pings from master received 
> withi

[jira] [Commented] (MESOS-5184) Mesos does not validate role info when framework registered with specified role

2016-04-12 Thread Liqiang Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15238525#comment-15238525
 ] 

Liqiang Lin commented on MESOS-5184:


Yes. In MESOS-2210, we have role validation to disable some special characters 
in role name, e.g., "/", ".", "..", start with "-", etc. I think we need to 
validate role name when registered.

> Mesos does not validate role info when framework registered with specified 
> role
> ---
>
> Key: MESOS-5184
> URL: https://issues.apache.org/jira/browse/MESOS-5184
> Project: Mesos
>  Issue Type: Bug
>  Components: general
>Affects Versions: 0.28.0
>Reporter: Liqiang Lin
> Fix For: 0.29.0
>
>
> When framework registered with specified role, Mesos does not validate the 
> role info. It will accept the subscription and send unreserved resources as 
> offer to the framework.
> {code}
> # cat register.json
> {
> "framework_id": {"value" : "test1"},
> "type":"SUBSCRIBE",
> "subscribe":{
> "framework_info":{
> "user":"root",
> "name":"test1",
> "failover_timeout":60,
> "role":"/test/test1",
> "id":{"value":"test1"},
> "principal":"test1",
> "capabilities":[{"type":"REVOCABLE_RESOURCES"}]
> },
> "force":true
> }
> }
> # curl -v  http://192.168.56.110:5050/api/v1/scheduler -H "Content-type: 
> application/json" -X POST -d @register.json
> * Hostname was NOT found in DNS cache
> *   Trying 192.168.56.110...
> * Connected to 192.168.56.110 (192.168.56.110) port 5050 (#0)
> > POST /api/v1/scheduler HTTP/1.1
> > User-Agent: curl/7.35.0
> > Host: 192.168.56.110:5050
> > Accept: */*
> > Content-type: application/json
> > Content-Length: 265
> >
> * upload completely sent off: 265 out of 265 bytes
> < HTTP/1.1 200 OK
> < Date: Wed, 06 Apr 2016 21:34:18 GMT
> < Transfer-Encoding: chunked
> < Mesos-Stream-Id: 8b2c6740-b619-49c3-825a-e6ae780f4edc
> < Content-Type: application/json
> <
> 69
> {"subscribed":{"framework_id":{"value":"test1"}},"type":"SUBSCRIBED"}20
> {"type":"HEARTBEAT"}1531
> {"offers":{"offers":[{"agent_id":{"value":"2cd5576e-6260-4262-a62c-b0dc45c86c45-S0"},"attributes":[{"name":"mesos_agent_type","text":{"value":"IBM_MESOS_EGO"},"type":"TEXT"},{"name":"hostname","text":{"value":"mesos2"},"type":"TEXT"}],"framework_id":{"value":"test1"},"hostname":"mesos2","id":{"value":"5b84aad8-dd60-40b3-84c2-93be6b7aa81c-O0"},"resources":[{"name":"disk","role":"*","scalar":{"value":20576.0},"type":"SCALAR"},{"name":"ports","ranges":{"range":[{"begin":31000,"end":32000}]},"role":"*","type":"RANGES"},{"name":"mem","role":"*","scalar":{"value":3952.0},"type":"SCALAR"},{"name":"cpus","role":"*","scalar":{"value":4.0},"type":"SCALAR"}],"url":{"address":{"hostname":"mesos2","ip":"192.168.56.110","port":5051},"path":"\/slave(1)","scheme":"http"}},{"agent_id":{"value":"2cd5576e-6260-4262-a62c-b0dc45c86c45-S1"},"attributes":[{"name":"mesos_agent_type","text":{"value":"IBM_MESOS_EGO"},"type":"TEXT"},{"name":"hostname","text":{"value":"mesos1"},"type":"TEXT"}],"framework_id":{"v
> alue":"test1"},"hostname":"mesos1","id":{"value":"5b84aad8-dd60-40b3-84c2-93be6b7aa81c-O1"},"resources":[{"name":"disk","role":"*","scalar":{"value":21468.0},"type":"SCALAR"},{"name":"ports","ranges":{"range":[{"begin":31000,"end":32000}]},"role":"*","type":"RANGES"},{"name":"mem","role":"*","scalar":{"value":3952.0},"type":"SCALAR"},{"name":"cpus","role":"*","scalar":{"value":4.0},"type":"SCALAR"}],"url":{"address":{"hostname":"mesos1","ip":"192.168.56.111","port":5051},"path":"\/slave(1)","scheme":"http"}}]},"type":"OFFERS"}20
> {"type":"HEARTBEAT"}20
> {code}
> As you see,  the role under which framework register is "/test/test1", which 
> is an invalid role according to 
> [#MESOS-2210|https://issues.apache.org/jira/browse/MESOS-2210]
> And Mesos master log
> {code}
> I0407 05:34:18.132333 20672 master.cpp:2107] Received subscription request 
> for HTTP framework 'test1'
> I0407 05:34:18.133515 20672 master.cpp:2198] Subscribing framework 'test1' 
> with checkpointing disabled and capabilities [ REVOCABLE_RESOURCES ]
> I0407 05:34:18.135027 20674 hierarchical.cpp:264] Added framework test1
> I0407 05:34:18.138746 20672 master.cpp:5659] Sending 2 offers to framework 
> test1 (test1)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5188) docker executor thinks task is failed when docker container was stopped

2016-04-12 Thread Liqiang Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15238353#comment-15238353
 ] 

Liqiang Lin commented on MESOS-5188:


If that's the truth, the executor shall remove the stopped docker container in 
shutting down of executor, rather than try to stop the docker container.

> docker executor thinks task is failed when docker container was stopped
> ---
>
> Key: MESOS-5188
> URL: https://issues.apache.org/jira/browse/MESOS-5188
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 0.28.0
>Reporter: Liqiang Lin
> Fix For: 0.29.0
>
>
> Test cases:
> 1. Launch a task with Swarm (on Mesos).
> {code}
> # docker -H 192.168.56.110:54375 run -d --cpu-shares 1 ubuntu sleep 300
> {code}
> 2. Then stop the docker container.
> {code}
> # docker -H 192.168.56.110:54375 ps
> CONTAINER IDIMAGE   COMMAND CREATED   
>   STATUS  PORTS   NAMES
> b4813ba3ed4dubuntu  "sleep 300" 9 seconds ago 
>   Up 8 seconds
> mesos1/mesos-2cd5576e-6260-4262-a62c-b0dc45c86c45-S1.1595e79b-aef2-44b6-a313-ad4ff8626958
> # docker -H 192.168.56.110:54375 stop b4813ba3ed4d
> b4813ba3ed4d
> {code}
> 3. Found the task is failed. See Mesos slave log,
> {code}
> I0407 09:10:57.606552 32307 slave.cpp:1508] Got assigned task 99ee7dc74861 
> for framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-
> I0407 09:10:57.608230 32307 slave.cpp:1627] Launching task 99ee7dc74861 for 
> framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-
> I0407 09:10:57.609979 32307 paths.cpp:528] Trying to chown 
> '/var/lib/mesos/slaves/2cd5576e-6260-4262-a62c-b0dc45c86c45-S0/frameworks/5b84aad8-dd60-40b3-84c2-93be6b7aa81c-/executors/99ee7dc74861/runs/250a169f-7aba-474d-a4f5-cd24ecf0e7d9'
>  to user 'root'
> I0407 09:10:57.615881 32307 slave.cpp:5586] Launching executor 99ee7dc74861 
> of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c- with resources 
> cpus(*):0.1; mem(*):32 in work directory 
> '/var/lib/mesos/slaves/2cd5576e-6260-4262-a62c-b0dc45c86c45-S0/frameworks/5b84aad8-dd60-40b3-84c2-93be6b7aa81c-/executors/99ee7dc74861/runs/250a169f-7aba-474d-a4f5-cd24ecf0e7d9'
> I0407 09:12:18.458449 32307 slave.cpp:1845] Queuing task '99ee7dc74861' for 
> executor '99ee7dc74861' of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-
> I0407 09:12:18.459092 32307 slave.cpp:3711] No pings from master received 
> within 75secs
> I0407 09:12:18.460212 32307 slave.cpp:4593] Current disk usage 56.53%. Max 
> allowed age: 2.342613645432778days
> I0407 09:12:18.463484 32307 slave.cpp:928] Re-detecting master
> I0407 09:12:18.463969 32307 slave.cpp:975] Detecting new master
> I0407 09:12:18.464501 32307 slave.cpp:939] New master detected at 
> master@192.168.56.110:5050
> I0407 09:12:18.464848 32307 slave.cpp:964] No credentials provided. 
> Attempting to register without authentication
> I0407 09:12:18.465237 32307 slave.cpp:975] Detecting new master
> I0407 09:12:18.463611 32312 status_update_manager.cpp:174] Pausing sending 
> status updates
> I0407 09:12:18.465744 32312 status_update_manager.cpp:174] Pausing sending 
> status updates
> I0407 09:12:18.472323 32313 docker.cpp:1011] Starting container 
> '250a169f-7aba-474d-a4f5-cd24ecf0e7d9' for task '99ee7dc74861' (and executor 
> '99ee7dc74861') of framework '5b84aad8-dd60-40b3-84c2-93be6b7aa81c-'
> I0407 09:12:18.588739 32313 slave.cpp:1218] Re-registered with master 
> master@192.168.56.110:5050
> I0407 09:12:18.588927 32313 slave.cpp:1254] Forwarding total oversubscribed 
> resources
> I0407 09:12:18.589320 32313 slave.cpp:2395] Updating framework 
> 5b84aad8-dd60-40b3-84c2-93be6b7aa81c- pid to 
> scheduler(1)@192.168.56.110:53375
> I0407 09:12:18.592079 32308 status_update_manager.cpp:181] Resuming sending 
> status updates
> I0407 09:12:18.592842 32313 slave.cpp:2534] Updated checkpointed resources 
> from  to
> I0407 09:12:18.592793 32308 status_update_manager.cpp:181] Resuming sending 
> status updates
> I0407 09:12:20.582041 32307 slave.cpp:2836] Got registration for executor 
> '99ee7dc74861' of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c- from 
> executor(1)@192.168.56.110:40725
> I0407 09:12:20.584446 32307 docker.cpp:1308] Ignoring updating container 
> '250a169f-7aba-474d-a4f5-cd24ecf0e7d9' with resources passed to update is 
> identical to existing resources
> I0407 09:12:20.585093 32307 slave.cpp:2010] Sending queued task 
> '99ee7dc74861' to executor '99ee7dc74861' of framework 
> 5b84aad8-dd60-40b3-84c2-93be6b7aa81c- at executor(1)@192.168.56.110:40725
> I0407 09:12:21.307077 32312 slave.cpp:3195] Handling status update 
> TASK_RUNNING (UUID: a7098650-cbf6-4445-8216-b5

[jira] [Created] (MESOS-5188) docker executor thinks task is failed when docker container was stopped

2016-04-12 Thread Liqiang Lin (JIRA)
Liqiang Lin created MESOS-5188:
--

 Summary: docker executor thinks task is failed when docker 
container was stopped
 Key: MESOS-5188
 URL: https://issues.apache.org/jira/browse/MESOS-5188
 Project: Mesos
  Issue Type: Bug
  Components: docker
Affects Versions: 0.28.0
Reporter: Liqiang Lin
 Fix For: 0.29.0


Test cases:
1. Launch a task with Swarm (on Mesos).

{code}
# docker -H 192.168.56.110:54375 run -d --cpu-shares 1 ubuntu sleep 300
{code}

2. Then stop the docker container.

{code}
# docker -H 192.168.56.110:54375 ps
CONTAINER IDIMAGE   COMMAND CREATED 
STATUS  PORTS   NAMES
b4813ba3ed4dubuntu  "sleep 300" 9 seconds ago   
Up 8 seconds
mesos1/mesos-2cd5576e-6260-4262-a62c-b0dc45c86c45-S1.1595e79b-aef2-44b6-a313-ad4ff8626958

# docker -H 192.168.56.110:54375 stop b4813ba3ed4d
b4813ba3ed4d
{code}

3. Found the task is failed. See Mesos slave log,

{code}
I0407 09:10:57.606552 32307 slave.cpp:1508] Got assigned task 99ee7dc74861 for 
framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-
I0407 09:10:57.608230 32307 slave.cpp:1627] Launching task 99ee7dc74861 for 
framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-
I0407 09:10:57.609979 32307 paths.cpp:528] Trying to chown 
'/var/lib/mesos/slaves/2cd5576e-6260-4262-a62c-b0dc45c86c45-S0/frameworks/5b84aad8-dd60-40b3-84c2-93be6b7aa81c-/executors/99ee7dc74861/runs/250a169f-7aba-474d-a4f5-cd24ecf0e7d9'
 to user 'root'
I0407 09:10:57.615881 32307 slave.cpp:5586] Launching executor 99ee7dc74861 of 
framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c- with resources cpus(*):0.1; 
mem(*):32 in work directory 
'/var/lib/mesos/slaves/2cd5576e-6260-4262-a62c-b0dc45c86c45-S0/frameworks/5b84aad8-dd60-40b3-84c2-93be6b7aa81c-/executors/99ee7dc74861/runs/250a169f-7aba-474d-a4f5-cd24ecf0e7d9'
I0407 09:12:18.458449 32307 slave.cpp:1845] Queuing task '99ee7dc74861' for 
executor '99ee7dc74861' of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-
I0407 09:12:18.459092 32307 slave.cpp:3711] No pings from master received 
within 75secs
I0407 09:12:18.460212 32307 slave.cpp:4593] Current disk usage 56.53%. Max 
allowed age: 2.342613645432778days
I0407 09:12:18.463484 32307 slave.cpp:928] Re-detecting master
I0407 09:12:18.463969 32307 slave.cpp:975] Detecting new master
I0407 09:12:18.464501 32307 slave.cpp:939] New master detected at 
master@192.168.56.110:5050
I0407 09:12:18.464848 32307 slave.cpp:964] No credentials provided. Attempting 
to register without authentication
I0407 09:12:18.465237 32307 slave.cpp:975] Detecting new master
I0407 09:12:18.463611 32312 status_update_manager.cpp:174] Pausing sending 
status updates
I0407 09:12:18.465744 32312 status_update_manager.cpp:174] Pausing sending 
status updates
I0407 09:12:18.472323 32313 docker.cpp:1011] Starting container 
'250a169f-7aba-474d-a4f5-cd24ecf0e7d9' for task '99ee7dc74861' (and executor 
'99ee7dc74861') of framework '5b84aad8-dd60-40b3-84c2-93be6b7aa81c-'
I0407 09:12:18.588739 32313 slave.cpp:1218] Re-registered with master 
master@192.168.56.110:5050
I0407 09:12:18.588927 32313 slave.cpp:1254] Forwarding total oversubscribed 
resources
I0407 09:12:18.589320 32313 slave.cpp:2395] Updating framework 
5b84aad8-dd60-40b3-84c2-93be6b7aa81c- pid to 
scheduler(1)@192.168.56.110:53375
I0407 09:12:18.592079 32308 status_update_manager.cpp:181] Resuming sending 
status updates
I0407 09:12:18.592842 32313 slave.cpp:2534] Updated checkpointed resources from 
 to
I0407 09:12:18.592793 32308 status_update_manager.cpp:181] Resuming sending 
status updates
I0407 09:12:20.582041 32307 slave.cpp:2836] Got registration for executor 
'99ee7dc74861' of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c- from 
executor(1)@192.168.56.110:40725
I0407 09:12:20.584446 32307 docker.cpp:1308] Ignoring updating container 
'250a169f-7aba-474d-a4f5-cd24ecf0e7d9' with resources passed to update is 
identical to existing resources
I0407 09:12:20.585093 32307 slave.cpp:2010] Sending queued task '99ee7dc74861' 
to executor '99ee7dc74861' of framework 
5b84aad8-dd60-40b3-84c2-93be6b7aa81c- at executor(1)@192.168.56.110:40725
I0407 09:12:21.307077 32312 slave.cpp:3195] Handling status update TASK_RUNNING 
(UUID: a7098650-cbf6-4445-8216-b5f658d2f5f4) for task 99ee7dc74861 of framework 
5b84aad8-dd60-40b3-84c2-93be6b7aa81c- from executor(1)@192.168.56.110:40725
I0407 09:12:21.308820 32308 status_update_manager.cpp:320] Received status 
update TASK_RUNNING (UUID: a7098650-cbf6-4445-8216-b5f658d2f5f4) for task 
99ee7dc74861 of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-
I0407 09:12:21.310058 32308 slave.cpp:3593] Forwarding the update TASK_RUNNING 
(UUID: a7098650-cbf6-4445-8216-b5f658d2f5f4) for task 99ee7dc74861 of framework 
5b84aad8-dd60-40b3-84c2-93be6b7aa

[jira] [Updated] (MESOS-5184) Mesos does not validate role info when framework registered with specified role

2016-04-11 Thread Liqiang Lin (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liqiang Lin updated MESOS-5184:
---
Description: 
When framework registered with specified role, Mesos does not validate the role 
info. It will accept the subscription and send unreserved resources as offer to 
the framework.

{code}
# cat register.json
{
"framework_id": {"value" : "test1"},
"type":"SUBSCRIBE",
"subscribe":{
"framework_info":{
"user":"root",
"name":"test1",
"failover_timeout":60,
"role":"/test/test1",
"id":{"value":"test1"},
"principal":"test1",
"capabilities":[{"type":"REVOCABLE_RESOURCES"}]
},
"force":true
}
}

# curl -v  http://192.168.56.110:5050/api/v1/scheduler -H "Content-type: 
application/json" -X POST -d @register.json
* Hostname was NOT found in DNS cache
*   Trying 192.168.56.110...
* Connected to 192.168.56.110 (192.168.56.110) port 5050 (#0)
> POST /api/v1/scheduler HTTP/1.1
> User-Agent: curl/7.35.0
> Host: 192.168.56.110:5050
> Accept: */*
> Content-type: application/json
> Content-Length: 265
>
* upload completely sent off: 265 out of 265 bytes
< HTTP/1.1 200 OK
< Date: Wed, 06 Apr 2016 21:34:18 GMT
< Transfer-Encoding: chunked
< Mesos-Stream-Id: 8b2c6740-b619-49c3-825a-e6ae780f4edc
< Content-Type: application/json
<
69
{"subscribed":{"framework_id":{"value":"test1"}},"type":"SUBSCRIBED"}20
{"type":"HEARTBEAT"}1531
{"offers":{"offers":[{"agent_id":{"value":"2cd5576e-6260-4262-a62c-b0dc45c86c45-S0"},"attributes":[{"name":"mesos_agent_type","text":{"value":"IBM_MESOS_EGO"},"type":"TEXT"},{"name":"hostname","text":{"value":"mesos2"},"type":"TEXT"}],"framework_id":{"value":"test1"},"hostname":"mesos2","id":{"value":"5b84aad8-dd60-40b3-84c2-93be6b7aa81c-O0"},"resources":[{"name":"disk","role":"*","scalar":{"value":20576.0},"type":"SCALAR"},{"name":"ports","ranges":{"range":[{"begin":31000,"end":32000}]},"role":"*","type":"RANGES"},{"name":"mem","role":"*","scalar":{"value":3952.0},"type":"SCALAR"},{"name":"cpus","role":"*","scalar":{"value":4.0},"type":"SCALAR"}],"url":{"address":{"hostname":"mesos2","ip":"192.168.56.110","port":5051},"path":"\/slave(1)","scheme":"http"}},{"agent_id":{"value":"2cd5576e-6260-4262-a62c-b0dc45c86c45-S1"},"attributes":[{"name":"mesos_agent_type","text":{"value":"IBM_MESOS_EGO"},"type":"TEXT"},{"name":"hostname","text":{"value":"mesos1"},"type":"TEXT"}],"framework_id":{"v
alue":"test1"},"hostname":"mesos1","id":{"value":"5b84aad8-dd60-40b3-84c2-93be6b7aa81c-O1"},"resources":[{"name":"disk","role":"*","scalar":{"value":21468.0},"type":"SCALAR"},{"name":"ports","ranges":{"range":[{"begin":31000,"end":32000}]},"role":"*","type":"RANGES"},{"name":"mem","role":"*","scalar":{"value":3952.0},"type":"SCALAR"},{"name":"cpus","role":"*","scalar":{"value":4.0},"type":"SCALAR"}],"url":{"address":{"hostname":"mesos1","ip":"192.168.56.111","port":5051},"path":"\/slave(1)","scheme":"http"}}]},"type":"OFFERS"}20
{"type":"HEARTBEAT"}20
{code}

As you see,  the role under which framework register is "/test/test1", which is 
an invalid role according to 
[#MESOS-2210|https://issues.apache.org/jira/browse/MESOS-2210]

And Mesos master log

{code}
I0407 05:34:18.132333 20672 master.cpp:2107] Received subscription request for 
HTTP framework 'test1'
I0407 05:34:18.133515 20672 master.cpp:2198] Subscribing framework 'test1' with 
checkpointing disabled and capabilities [ REVOCABLE_RESOURCES ]
I0407 05:34:18.135027 20674 hierarchical.cpp:264] Added framework test1
I0407 05:34:18.138746 20672 master.cpp:5659] Sending 2 offers to framework 
test1 (test1)
{code}


  was:
When framework registered with specified role, Mesos does not validate the role 
info. It will accept the subscription and send unreserved resources as offer to 
the framework.

# cat register.json
{
"framework_id": {"value" : "test1"},
"type":"SUBSCRIBE",
"subscribe":{
"framework_info":{
"user":"root",
"name":"test1",
"failover_timeout":60,
"role":"/test/test1",
"id":{"value":"test1"},
"principal":"test1",
"capabilities":[{"type":"REVOCABLE_RESOURCES"}]
},
"force":true
}
}

# curl -v  http://192.168.56.110:5050/api/v1/scheduler -H "Content-type: 
application/json" -X POST -d @register.json
* Hostname was NOT found in DNS cache
*   Trying 192.168.56.110...
* Connected to 192.168.56.110 (192.168.56.110) port 5050 (#0)
> POST /api/v1/scheduler HTTP/1.1
> User-Agent: curl/7.35.0
> Host: 192.168.56.110:5050
> Accept: */*
> Content-type: application/json
> Content-Length: 265
>
* upload completely sent off: 265 out of 265 bytes
< HTTP/1.1 200 OK
< Date: Wed, 06 Apr 2016 21:34:18 GMT
< Transfer-Encoding: chunked
< Mesos-Stream-Id: 8b2c6740-b619-49c3-825a-e6ae780f4edc
< Content-Type: application/json
<
69
{"subscribed":{"framework_id":{"value":"test1"}},"type":"SUBSCRIBED"}20
{"type":"HEARTBEAT"}1531
{"offers":{"offers":[{"agent_id":{"value":"2cd5576e-6260-4262-a62c-b0dc45c86c45-S0"},"attributes":[{"name":"mesos_agent_type","text":{"value":"IBM_MESOS_EGO"},"

[jira] [Created] (MESOS-5184) Mesos does not validate role info when framework registered with specified role

2016-04-11 Thread Liqiang Lin (JIRA)
Liqiang Lin created MESOS-5184:
--

 Summary: Mesos does not validate role info when framework 
registered with specified role
 Key: MESOS-5184
 URL: https://issues.apache.org/jira/browse/MESOS-5184
 Project: Mesos
  Issue Type: Bug
  Components: general
Affects Versions: 0.28.0
Reporter: Liqiang Lin
 Fix For: 0.29.0


When framework registered with specified role, Mesos does not validate the role 
info. It will accept the subscription and send unreserved resources as offer to 
the framework.

# cat register.json
{
"framework_id": {"value" : "test1"},
"type":"SUBSCRIBE",
"subscribe":{
"framework_info":{
"user":"root",
"name":"test1",
"failover_timeout":60,
"role":"/test/test1",
"id":{"value":"test1"},
"principal":"test1",
"capabilities":[{"type":"REVOCABLE_RESOURCES"}]
},
"force":true
}
}

# curl -v  http://192.168.56.110:5050/api/v1/scheduler -H "Content-type: 
application/json" -X POST -d @register.json
* Hostname was NOT found in DNS cache
*   Trying 192.168.56.110...
* Connected to 192.168.56.110 (192.168.56.110) port 5050 (#0)
> POST /api/v1/scheduler HTTP/1.1
> User-Agent: curl/7.35.0
> Host: 192.168.56.110:5050
> Accept: */*
> Content-type: application/json
> Content-Length: 265
>
* upload completely sent off: 265 out of 265 bytes
< HTTP/1.1 200 OK
< Date: Wed, 06 Apr 2016 21:34:18 GMT
< Transfer-Encoding: chunked
< Mesos-Stream-Id: 8b2c6740-b619-49c3-825a-e6ae780f4edc
< Content-Type: application/json
<
69
{"subscribed":{"framework_id":{"value":"test1"}},"type":"SUBSCRIBED"}20
{"type":"HEARTBEAT"}1531
{"offers":{"offers":[{"agent_id":{"value":"2cd5576e-6260-4262-a62c-b0dc45c86c45-S0"},"attributes":[{"name":"mesos_agent_type","text":{"value":"IBM_MESOS_EGO"},"type":"TEXT"},{"name":"hostname","text":{"value":"mesos2"},"type":"TEXT"}],"framework_id":{"value":"test1"},"hostname":"mesos2","id":{"value":"5b84aad8-dd60-40b3-84c2-93be6b7aa81c-O0"},"resources":[{"name":"disk","role":"*","scalar":{"value":20576.0},"type":"SCALAR"},{"name":"ports","ranges":{"range":[{"begin":31000,"end":32000}]},"role":"*","type":"RANGES"},{"name":"mem","role":"*","scalar":{"value":3952.0},"type":"SCALAR"},{"name":"cpus","role":"*","scalar":{"value":4.0},"type":"SCALAR"}],"url":{"address":{"hostname":"mesos2","ip":"192.168.56.110","port":5051},"path":"\/slave(1)","scheme":"http"}},{"agent_id":{"value":"2cd5576e-6260-4262-a62c-b0dc45c86c45-S1"},"attributes":[{"name":"mesos_agent_type","text":{"value":"IBM_MESOS_EGO"},"type":"TEXT"},{"name":"hostname","text":{"value":"mesos1"},"type":"TEXT"}],"framework_id":{"v
alue":"test1"},"hostname":"mesos1","id":{"value":"5b84aad8-dd60-40b3-84c2-93be6b7aa81c-O1"},"resources":[{"name":"disk","role":"*","scalar":{"value":21468.0},"type":"SCALAR"},{"name":"ports","ranges":{"range":[{"begin":31000,"end":32000}]},"role":"*","type":"RANGES"},{"name":"mem","role":"*","scalar":{"value":3952.0},"type":"SCALAR"},{"name":"cpus","role":"*","scalar":{"value":4.0},"type":"SCALAR"}],"url":{"address":{"hostname":"mesos1","ip":"192.168.56.111","port":5051},"path":"\/slave(1)","scheme":"http"}}]},"type":"OFFERS"}20
{"type":"HEARTBEAT"}20

As you see,  the role under which framework register is "/test/test1", which is 
an invalid role according to 
[#MESOS-2210|https://issues.apache.org/jira/browse/MESOS-2210]

And Mesos master log

I0407 05:34:18.132333 20672 master.cpp:2107] Received subscription request for 
HTTP framework 'test1'
I0407 05:34:18.133515 20672 master.cpp:2198] Subscribing framework 'test1' with 
checkpointing disabled and capabilities [ REVOCABLE_RESOURCES ]
I0407 05:34:18.135027 20674 hierarchical.cpp:264] Added framework test1
I0407 05:34:18.138746 20672 master.cpp:5659] Sending 2 offers to framework 
test1 (test1)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5182) mesos-executor (CommandScheduler) does not accept offer with revocable resources

2016-04-11 Thread Liqiang Lin (JIRA)
Liqiang Lin created MESOS-5182:
--

 Summary: mesos-executor (CommandScheduler) does not accept offer 
with revocable resources
 Key: MESOS-5182
 URL: https://issues.apache.org/jira/browse/MESOS-5182
 Project: Mesos
  Issue Type: Bug
  Components: framework
Affects Versions: 0.28.0
Reporter: Liqiang Lin
 Fix For: 0.29.0


Currently mesos-executor (CommandScheduler) does not accept offer with 
revocable resources. It's unable to verify cases using revocable resources to 
launch tasks with this example framework.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4441) Do not allocate non-revocable resources beyond quota guarantee.

2016-01-27 Thread Liqiang Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15118833#comment-15118833
 ] 

Liqiang Lin commented on MESOS-4441:


Why mark resources beyond quota guarantee as revocable resources (May with 
"extra" revocable label)? Will tasks using such revocable resources be evicted 
in some cases? In which case?

> Do not allocate non-revocable resources beyond quota guarantee.
> ---
>
> Key: MESOS-4441
> URL: https://issues.apache.org/jira/browse/MESOS-4441
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Alexander Rukletsov
>Assignee: Michael Park
>Priority: Blocker
>  Labels: mesosphere
>
> h4. Status Quo
> Currently resources allocated to frameworks in a role with quota (aka 
> quota'ed role) beyond quota guarantee are marked non-revocable. This impacts 
> our flexibility for revoking them if we decide so in the future.
> h4. Proposal
> Once quota guarantee is satisfied we must not necessarily further allocate 
> resources as non-revocable. Instead we can mark all offers resources beyond 
> guarantee as revocable. When in the future {{RevocableInfo}} evolves 
> frameworks will get additional information about "revocability" of the 
> resource (i.e. allocation slack)
> h4. Caveats
> Though it seems like a simple change, it has several implications.
> h6. Fairness
> Currently the hierarchical allocator considers revocable resources as regular 
> resources when doing fairness calculations. This may prevent frameworks 
> getting non-revocable resources as part of their role's quota guarantee if 
> they accept some revocable resources as well.
> Consider the following scenario. A single framework in a role with quota set 
> to {{10}} CPUs is allocated {{10}} CPUs as non-revocable resources as part of 
> its quota and additionally {{2}} revocable CPUs. Now a task using {{2}} 
> non-revocable CPUs finishes and its resources are returned. Total allocation 
> for the role is {{8}} non-revocable + {{2}} revocable. However, the role may 
> not be offered additional {{2}} non-revocable since its total allocation 
> satisfies quota.
> h6. Resource math
> If we allocate non-revocable resources as revocable, we should make sure we 
> do accounting right: either we should update total agent resources and mark 
> them as revocable as well, or bookkeep resources as non-revocable and convert 
> them to revocable when necessary.
> h6. Coarse-grained nature of allocation
> The hierarchical allocator performs "coarse-grained" allocation, meaning it 
> always allocates the entire remaining agent resources to a single framework. 
> This may lead to over-allocating some resources as non-revocable beyond quota 
> guarantee.
> h6. Quotas smaller than fair share
> If a quota set for a role is smaller than its fair share, it may reduce the 
> amount of resources offered to this role, if frameworks in it do not accept 
> revocable resources. This is probably the most important consequence of the 
> proposed change. Operators may set quota to get guarantees, but may observe a 
> decrease in amount of resources a role gets, which is not intuitive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4441) Do not allocate non-revocable resources beyond quota guarantee.

2016-01-20 Thread Liqiang Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15109844#comment-15109844
 ] 

Liqiang Lin commented on MESOS-4441:


As Alexander reported, the rest 60 cpu resource may be allocated to framework 
as revocable resources, so no resource be wasted.
[~alexr] but for revocable resources in this case, will we have another DRF for 
resource fairness? Can you elaborate more about revoking such revocable 
resources in the near future? To satisfy new quota request?

> Do not allocate non-revocable resources beyond quota guarantee.
> ---
>
> Key: MESOS-4441
> URL: https://issues.apache.org/jira/browse/MESOS-4441
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Alexander Rukletsov
>Assignee: Alexander Rukletsov
>Priority: Blocker
>  Labels: mesosphere
>
> h4. Status Quo
> Currently resources allocated to frameworks in a role with quota (aka 
> quota'ed role) beyond quota guarantee are marked non-revocable. This impacts 
> our flexibility for revoking them if we decide so in the future.
> h4. Proposal
> Once quota guarantee is satisfied we must not necessarily further allocate 
> resources as non-revocable. Instead we can mark all offers resources beyond 
> guarantee as revocable. When in the future {{RevocableInfo}} evolves 
> frameworks will get additional information about "revocability" of the 
> resource (i.e. allocation slack)
> h4. Caveats
> Though it seems like a simple change, it has several implications.
> h6. Fairness
> Currently the hierarchical allocator considers revocable resources as regular 
> resources when doing fairness calculations. This may prevent frameworks 
> getting non-revocable resources as part of their role's quota guarantee if 
> they accept some revocable resources as well.
> Consider the following scenario. A single framework in a role with quota set 
> to {{10}} CPUs is allocated {{10}} CPUs as non-revocable resources as part of 
> its quota and additionally {{2}} revocable CPUs. Now a task using {{2}} 
> non-revocable CPUs finishes and its resources are returned. Total allocation 
> for the role is {{8}} non-revocable + {{2}} revocable. However, the role may 
> not be offered additional {{2}} non-revocable since its total allocation 
> satisfies quota.
> h6. Resource math
> If we allocate non-revocable resources as revocable, we should make sure we 
> do accounting right: either we should update total agent resources and mark 
> them as revocable as well, or bookkeep resources as non-revocable and convert 
> them to revocable when necessary.
> h6. Coarse-grained nature of allocation
> The hierarchical allocator performs "coarse-grained" allocation, meaning it 
> always allocates the entire remaining agent resources to a single framework. 
> This may lead to over-allocating some resources as non-revocable beyond quota 
> guarantee.
> h6. Quotas smaller than fair share
> If a quota set for a role is smaller than its fair share, it may reduce the 
> amount of resources offered to this role, if frameworks in it do not accept 
> revocable resources. This is probably the most important consequence of the 
> proposed change. Operators may set quota to get guarantees, but may observe a 
> decrease in amount of resources a role gets, which is not intuitive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2210) Disallow special characters in role.

2015-12-13 Thread Liqiang Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15055480#comment-15055480
 ] 

Liqiang Lin commented on MESOS-2210:


I see your code diff, but not sure the patch is accepted.
How about the visible characters "\ / : * ? & " ' ` < > | ~" in your previous 
append? Or just adopt Adam's proposal?

Thanks.

> Disallow special characters in role.
> 
>
> Key: MESOS-2210
> URL: https://issues.apache.org/jira/browse/MESOS-2210
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>Assignee: haosdent
>  Labels: mesosphere, newbie, persistent-volumes
>
> As we introduce persistent volumes in MESOS-1524, we will use roles as 
> directory names on the slave (https://reviews.apache.org/r/28562/). As a 
> result, the master should disallow special characters (like space and slash) 
> in role.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2210) Disallow special characters in role.

2015-12-13 Thread Liqiang Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15055448#comment-15055448
 ] 

Liqiang Lin commented on MESOS-2210:


Any new update about the patch? Is the character list which will be disallowed 
in role name finalized? 

Thanks.

> Disallow special characters in role.
> 
>
> Key: MESOS-2210
> URL: https://issues.apache.org/jira/browse/MESOS-2210
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>Assignee: haosdent
>  Labels: mesosphere, newbie, persistent-volumes
>
> As we introduce persistent volumes in MESOS-1524, we will use roles as 
> directory names on the slave (https://reviews.apache.org/r/28562/). As a 
> result, the master should disallow special characters (like space and slash) 
> in role.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3747) HTTP Scheduler API no longer allows FrameworkInfo.user to be empty string

2015-10-23 Thread Liqiang Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14970586#comment-14970586
 ] 

Liqiang Lin commented on MESOS-3747:


[~nnielsen] in {{paths.cpp:createExecutorDirectory}} which tests that depend on 
using user name which may not be existing on test machine? Can you point to me. 
 My proposed fixes:

1) Return Error in {{paths.cpp:createExecutorDirectory}} when {{chown}} failed.
2) Validate whether the "CommandInfo.user" or "FrameworkInfo.user" is existing 
if if "--switch-user" flag is set to true. If validation failed, return user 
not existing as failed reason to framework.

[~vinodkone] What's your suggestions?

> HTTP Scheduler API no longer allows FrameworkInfo.user to be empty string
> -
>
> Key: MESOS-3747
> URL: https://issues.apache.org/jira/browse/MESOS-3747
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Affects Versions: 0.24.0, 0.24.1, 0.25.0
>Reporter: Ben Whitehead
>Assignee: Liqiang Lin
>Priority: Blocker
>
> When using libmesos a framework can set its user to {{""}} (empty string) to 
> inherit the user the agent processes is running as, this behavior now results 
> in a {{TASK_FAILED}}.
> Full messages and relevant agent logs below.
> The error returned to the framework tells me nothing about the user not 
> existing on the agent host instead it tells me the container died due to OOM.
> {code:title=FrameworkInfo}
> call {
> type: SUBSCRIBE
> subscribe: {
> frameworkInfo: {
> user: "",
> name: "testing"
> }
> }
> }
> {code}
> {code:title=TaskInfo}
> call {
> framework_id { value: "20151015-125949-16777343-5050-20146-" },
> type: ACCEPT,
> accept { 
> offer_ids: [{ value: "20151015-125949-16777343-5050-20146-O0" }],
> operations { 
> type: LAUNCH, 
> launch { 
> task_infos [
> {
> name: "task-1",
> task_id: { value: "task-1" },
> agent_id: { value: 
> "20151015-125949-16777343-5050-20146-S0" },
> resources [
> { name: "cpus", type: SCALAR, scalar: { value: 
> 0.1 },  role: "*" },
> { name: "mem",  type: SCALAR, scalar: { value: 
> 64.0 }, role: "*" },
> { name: "disk", type: SCALAR, scalar: { value: 
> 0.0 },  role: "*" },
> ],
> command: { 
> environment { 
> variables [ 
> { name: "SLEEP_SECONDS" value: "15" } 
> ] 
> },
> value: "env | sort && sleep $SLEEP_SECONDS"
> }
> }
> ]
>  }
>  }
>  }
> }
> {code}
> {code:title=Update Status}
> event: {
> type: UPDATE,
> update: { 
> status: { 
> task_id: { value: "task-1" }, 
> state: TASK_FAILED,
> message: "Container destroyed while preparing isolators",
> agent_id: { value: "20151015-125949-16777343-5050-20146-S0" }, 
> timestamp: 1.444939217401241E9,
> executor_id: { value: "task-1" },
> source: SOURCE_AGENT, 
> reason: REASON_MEMORY_LIMIT,
> uuid: "\237g()L\026EQ\222\301\261\265\\\221\224|" 
> } 
> }
> }
> {code}
> {code:title=agent logs}
> I1015 13:15:34.260592 19639 slave.cpp:1270] Got assigned task task-1 for 
> framework e4de5b96-41cc-4713-af44-7cffbdd63ba6-
> I1015 13:15:34.260921 19639 slave.cpp:1386] Launching task task-1 for 
> framework e4de5b96-41cc-4713-af44-7cffbdd63ba6-
> W1015 13:15:34.262243 19639 paths.cpp:423] Failed to chown executor directory 
> '/home/ben.whitehead/opt/mesos/work/slave/work_dir/slaves/e4de5b96-41cc-4713-af44-7cffbdd63ba6-S0/frameworks/e4de5b96-41cc-4713-af44-7cffbdd63ba6-/executors/task-1/runs/3958ff84-8dd9-4c3c-995d-5aba5250541b':
>  Failed to get user information for '': Success
> I1015 13:15:34.262444 19639 slave.cpp:4852] Launching executor task-1 of 
> framework e4de5b96-41cc-4713-af44-7cffbdd63ba6- with resources 
> cpus(*):0.1; mem(*):32 in work directory 
> '/home/ben.whitehead/opt/mesos/work/slave/work_dir/slaves/e4de5b96-41cc-4713-af44-7cffbdd63ba6-S0/frameworks/e4de5b96-41cc-4713-af44-7cffbdd63ba6-/executors/task-1/runs/3958ff84-8dd9-4c3c-995d-5aba5250541b'
> I1015 13:15:34.262581 19639 slave.cpp:1604] Queuing task 'task-1' for 
> executor task-1 of framework 'e4de5b96-41cc-4713-af44-7cffbdd63ba6-
>

[jira] [Commented] (MESOS-3747) HTTP Scheduler API no longer allows FrameworkInfo.user to be empty string

2015-10-21 Thread Liqiang Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14966463#comment-14966463
 ] 

Liqiang Lin commented on MESOS-3747:


RR: https://reviews.apache.org/r/39514/

> HTTP Scheduler API no longer allows FrameworkInfo.user to be empty string
> -
>
> Key: MESOS-3747
> URL: https://issues.apache.org/jira/browse/MESOS-3747
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Affects Versions: 0.24.0, 0.24.1, 0.25.0
>Reporter: Ben Whitehead
>Assignee: Liqiang Lin
>Priority: Blocker
>
> When using libmesos a framework can set its user to {{""}} (empty string) to 
> inherit the user the agent processes is running as, this behavior now results 
> in a {{TASK_FAILED}}.
> Full messages and relevant agent logs below.
> The error returned to the framework tells me nothing about the user not 
> existing on the agent host instead it tells me the container died due to OOM.
> {code:title=FrameworkInfo}
> call {
> type: SUBSCRIBE
> subscribe: {
> frameworkInfo: {
> user: "",
> name: "testing"
> }
> }
> }
> {code}
> {code:title=TaskInfo}
> call {
> framework_id { value: "20151015-125949-16777343-5050-20146-" },
> type: ACCEPT,
> accept { 
> offer_ids: [{ value: "20151015-125949-16777343-5050-20146-O0" }],
> operations { 
> type: LAUNCH, 
> launch { 
> task_infos [
> {
> name: "task-1",
> task_id: { value: "task-1" },
> agent_id: { value: 
> "20151015-125949-16777343-5050-20146-S0" },
> resources [
> { name: "cpus", type: SCALAR, scalar: { value: 
> 0.1 },  role: "*" },
> { name: "mem",  type: SCALAR, scalar: { value: 
> 64.0 }, role: "*" },
> { name: "disk", type: SCALAR, scalar: { value: 
> 0.0 },  role: "*" },
> ],
> command: { 
> environment { 
> variables [ 
> { name: "SLEEP_SECONDS" value: "15" } 
> ] 
> },
> value: "env | sort && sleep $SLEEP_SECONDS"
> }
> }
> ]
>  }
>  }
>  }
> }
> {code}
> {code:title=Update Status}
> event: {
> type: UPDATE,
> update: { 
> status: { 
> task_id: { value: "task-1" }, 
> state: TASK_FAILED,
> message: "Container destroyed while preparing isolators",
> agent_id: { value: "20151015-125949-16777343-5050-20146-S0" }, 
> timestamp: 1.444939217401241E9,
> executor_id: { value: "task-1" },
> source: SOURCE_AGENT, 
> reason: REASON_MEMORY_LIMIT,
> uuid: "\237g()L\026EQ\222\301\261\265\\\221\224|" 
> } 
> }
> }
> {code}
> {code:title=agent logs}
> I1015 13:15:34.260592 19639 slave.cpp:1270] Got assigned task task-1 for 
> framework e4de5b96-41cc-4713-af44-7cffbdd63ba6-
> I1015 13:15:34.260921 19639 slave.cpp:1386] Launching task task-1 for 
> framework e4de5b96-41cc-4713-af44-7cffbdd63ba6-
> W1015 13:15:34.262243 19639 paths.cpp:423] Failed to chown executor directory 
> '/home/ben.whitehead/opt/mesos/work/slave/work_dir/slaves/e4de5b96-41cc-4713-af44-7cffbdd63ba6-S0/frameworks/e4de5b96-41cc-4713-af44-7cffbdd63ba6-/executors/task-1/runs/3958ff84-8dd9-4c3c-995d-5aba5250541b':
>  Failed to get user information for '': Success
> I1015 13:15:34.262444 19639 slave.cpp:4852] Launching executor task-1 of 
> framework e4de5b96-41cc-4713-af44-7cffbdd63ba6- with resources 
> cpus(*):0.1; mem(*):32 in work directory 
> '/home/ben.whitehead/opt/mesos/work/slave/work_dir/slaves/e4de5b96-41cc-4713-af44-7cffbdd63ba6-S0/frameworks/e4de5b96-41cc-4713-af44-7cffbdd63ba6-/executors/task-1/runs/3958ff84-8dd9-4c3c-995d-5aba5250541b'
> I1015 13:15:34.262581 19639 slave.cpp:1604] Queuing task 'task-1' for 
> executor task-1 of framework 'e4de5b96-41cc-4713-af44-7cffbdd63ba6-
> I1015 13:15:34.262684 19638 docker.cpp:734] No container info found, skipping 
> launch
> I1015 13:15:34.263478 19638 containerizer.cpp:640] Starting container 
> '3958ff84-8dd9-4c3c-995d-5aba5250541b' for executor 'task-1' of framework 
> 'e4de5b96-41cc-4713-af44-7cffbdd63ba6-'
> E1015 13:15:34.264516 19641 slave.cpp:3342] Container 
> '3958ff84-8dd9-4c3c-995d-5aba5250541b' for executor 'task-1' of framework 
> 'e4de5b96-41cc-4713-af44-7cffbdd63ba6-'

[jira] [Commented] (MESOS-3747) HTTP Scheduler API no longer allows FrameworkInfo.user to be empty string

2015-10-21 Thread Liqiang Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14966408#comment-14966408
 ] 

Liqiang Lin commented on MESOS-3747:


Yes. If --[no-]switch_user is false, tasks will be run as the same user as the 
Mesos agent process. Both framework scheduler and Mesos master can not know 
which Mesos agent is set --[no-]switch_user to true, which is set to false. We 
should pass the framework user info anyway. Let Mesos agent decide whether 
switch to framework user or not. If framework user did not exist on that Mesos 
agent, just fail framework tasks as [~gyliu] posted.

> HTTP Scheduler API no longer allows FrameworkInfo.user to be empty string
> -
>
> Key: MESOS-3747
> URL: https://issues.apache.org/jira/browse/MESOS-3747
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Affects Versions: 0.24.0, 0.24.1, 0.25.0
>Reporter: Ben Whitehead
>Assignee: Liqiang Lin
>Priority: Blocker
>
> When using libmesos a framework can set its user to {{""}} (empty string) to 
> inherit the user the agent processes is running as, this behavior now results 
> in a {{TASK_FAILED}}.
> Full messages and relevant agent logs below.
> The error returned to the framework tells me nothing about the user not 
> existing on the agent host instead it tells me the container died due to OOM.
> {code:title=FrameworkInfo}
> call {
> type: SUBSCRIBE
> subscribe: {
> frameworkInfo: {
> user: "",
> name: "testing"
> }
> }
> }
> {code}
> {code:title=TaskInfo}
> call {
> framework_id { value: "20151015-125949-16777343-5050-20146-" },
> type: ACCEPT,
> accept { 
> offer_ids: [{ value: "20151015-125949-16777343-5050-20146-O0" }],
> operations { 
> type: LAUNCH, 
> launch { 
> task_infos [
> {
> name: "task-1",
> task_id: { value: "task-1" },
> agent_id: { value: 
> "20151015-125949-16777343-5050-20146-S0" },
> resources [
> { name: "cpus", type: SCALAR, scalar: { value: 
> 0.1 },  role: "*" },
> { name: "mem",  type: SCALAR, scalar: { value: 
> 64.0 }, role: "*" },
> { name: "disk", type: SCALAR, scalar: { value: 
> 0.0 },  role: "*" },
> ],
> command: { 
> environment { 
> variables [ 
> { name: "SLEEP_SECONDS" value: "15" } 
> ] 
> },
> value: "env | sort && sleep $SLEEP_SECONDS"
> }
> }
> ]
>  }
>  }
>  }
> }
> {code}
> {code:title=Update Status}
> event: {
> type: UPDATE,
> update: { 
> status: { 
> task_id: { value: "task-1" }, 
> state: TASK_FAILED,
> message: "Container destroyed while preparing isolators",
> agent_id: { value: "20151015-125949-16777343-5050-20146-S0" }, 
> timestamp: 1.444939217401241E9,
> executor_id: { value: "task-1" },
> source: SOURCE_AGENT, 
> reason: REASON_MEMORY_LIMIT,
> uuid: "\237g()L\026EQ\222\301\261\265\\\221\224|" 
> } 
> }
> }
> {code}
> {code:title=agent logs}
> I1015 13:15:34.260592 19639 slave.cpp:1270] Got assigned task task-1 for 
> framework e4de5b96-41cc-4713-af44-7cffbdd63ba6-
> I1015 13:15:34.260921 19639 slave.cpp:1386] Launching task task-1 for 
> framework e4de5b96-41cc-4713-af44-7cffbdd63ba6-
> W1015 13:15:34.262243 19639 paths.cpp:423] Failed to chown executor directory 
> '/home/ben.whitehead/opt/mesos/work/slave/work_dir/slaves/e4de5b96-41cc-4713-af44-7cffbdd63ba6-S0/frameworks/e4de5b96-41cc-4713-af44-7cffbdd63ba6-/executors/task-1/runs/3958ff84-8dd9-4c3c-995d-5aba5250541b':
>  Failed to get user information for '': Success
> I1015 13:15:34.262444 19639 slave.cpp:4852] Launching executor task-1 of 
> framework e4de5b96-41cc-4713-af44-7cffbdd63ba6- with resources 
> cpus(*):0.1; mem(*):32 in work directory 
> '/home/ben.whitehead/opt/mesos/work/slave/work_dir/slaves/e4de5b96-41cc-4713-af44-7cffbdd63ba6-S0/frameworks/e4de5b96-41cc-4713-af44-7cffbdd63ba6-/executors/task-1/runs/3958ff84-8dd9-4c3c-995d-5aba5250541b'
> I1015 13:15:34.262581 19639 slave.cpp:1604] Queuing task 'task-1' for 
> executor task-1 of framework 'e4de5b96-41cc-4713-af44-7cffbdd63ba6-
> I1015 13:15:34.262684 19638 docker.cpp:734] No container 

[jira] [Assigned] (MESOS-3747) HTTP Scheduler API no longer allows FrameworkInfo.user to be empty string

2015-10-15 Thread Liqiang Lin (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liqiang Lin reassigned MESOS-3747:
--

Assignee: Liqiang Lin

> HTTP Scheduler API no longer allows FrameworkInfo.user to be empty string
> -
>
> Key: MESOS-3747
> URL: https://issues.apache.org/jira/browse/MESOS-3747
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Affects Versions: 0.24.0, 0.24.1, 0.25.0
>Reporter: Ben Whitehead
>Assignee: Liqiang Lin
>Priority: Blocker
>
> When using libmesos a framework can set its user to {{""}} (empty string) to 
> inherit the user the agent processes is running as, this behavior now results 
> in a {{TASK_FAILED}}.
> Full messages and relevant agent logs below.
> The error returned to the framework tells me nothing about the user not 
> existing on the agent host instead it tells me the container died due to OOM.
> {code:title=FrameworkInfo}
> call {
> type: SUBSCRIBE
> subscribe: {
> frameworkInfo: {
> user: "",
> name: "testing"
> }
> }
> }
> {code}
> {code:title=TaskInfo}
> call {
> framework_id { value: "20151015-125949-16777343-5050-20146-" },
> type: ACCEPT,
> accept { 
> offer_ids: [{ value: "20151015-125949-16777343-5050-20146-O0" }],
> operations { 
> type: LAUNCH, 
> launch { 
> task_infos [
> {
> name: "task-1",
> task_id: { value: "task-1" },
> agent_id: { value: 
> "20151015-125949-16777343-5050-20146-S0" },
> resources [
> { name: "cpus", type: SCALAR, scalar: { value: 
> 0.1 },  role: "*" },
> { name: "mem",  type: SCALAR, scalar: { value: 
> 64.0 }, role: "*" },
> { name: "disk", type: SCALAR, scalar: { value: 
> 0.0 },  role: "*" },
> ],
> command: { 
> environment { 
> variables [ 
> { name: "SLEEP_SECONDS" value: "15" } 
> ] 
> },
> value: "env | sort && sleep $SLEEP_SECONDS"
> }
> }
> ]
>  }
>  }
>  }
> }
> {code}
> {code:title=Update Status}
> event: {
> type: UPDATE,
> update: { 
> status: { 
> task_id: { value: "task-1" }, 
> state: TASK_FAILED,
> message: "Container destroyed while preparing isolators",
> agent_id: { value: "20151015-125949-16777343-5050-20146-S0" }, 
> timestamp: 1.444939217401241E9,
> executor_id: { value: "task-1" },
> source: SOURCE_AGENT, 
> reason: REASON_MEMORY_LIMIT,
> uuid: "\237g()L\026EQ\222\301\261\265\\\221\224|" 
> } 
> }
> }
> {code}
> {code:title=agent logs}
> I1015 13:15:34.260592 19639 slave.cpp:1270] Got assigned task task-1 for 
> framework e4de5b96-41cc-4713-af44-7cffbdd63ba6-
> I1015 13:15:34.260921 19639 slave.cpp:1386] Launching task task-1 for 
> framework e4de5b96-41cc-4713-af44-7cffbdd63ba6-
> W1015 13:15:34.262243 19639 paths.cpp:423] Failed to chown executor directory 
> '/home/ben.whitehead/opt/mesos/work/slave/work_dir/slaves/e4de5b96-41cc-4713-af44-7cffbdd63ba6-S0/frameworks/e4de5b96-41cc-4713-af44-7cffbdd63ba6-/executors/task-1/runs/3958ff84-8dd9-4c3c-995d-5aba5250541b':
>  Failed to get user information for '': Success
> I1015 13:15:34.262444 19639 slave.cpp:4852] Launching executor task-1 of 
> framework e4de5b96-41cc-4713-af44-7cffbdd63ba6- with resources 
> cpus(*):0.1; mem(*):32 in work directory 
> '/home/ben.whitehead/opt/mesos/work/slave/work_dir/slaves/e4de5b96-41cc-4713-af44-7cffbdd63ba6-S0/frameworks/e4de5b96-41cc-4713-af44-7cffbdd63ba6-/executors/task-1/runs/3958ff84-8dd9-4c3c-995d-5aba5250541b'
> I1015 13:15:34.262581 19639 slave.cpp:1604] Queuing task 'task-1' for 
> executor task-1 of framework 'e4de5b96-41cc-4713-af44-7cffbdd63ba6-
> I1015 13:15:34.262684 19638 docker.cpp:734] No container info found, skipping 
> launch
> I1015 13:15:34.263478 19638 containerizer.cpp:640] Starting container 
> '3958ff84-8dd9-4c3c-995d-5aba5250541b' for executor 'task-1' of framework 
> 'e4de5b96-41cc-4713-af44-7cffbdd63ba6-'
> E1015 13:15:34.264516 19641 slave.cpp:3342] Container 
> '3958ff84-8dd9-4c3c-995d-5aba5250541b' for executor 'task-1' of framework 
> 'e4de5b96-41cc-4713-af44-7cffbdd63ba6-' failed to start: Failed to 
> prepare isolator: Failed to get us

[jira] [Commented] (MESOS-3747) HTTP Scheduler API no longer allows FrameworkInfo.user to be empty string

2015-10-15 Thread Liqiang Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960213#comment-14960213
 ] 

Liqiang Lin commented on MESOS-3747:


Both in mesos::v1::scheduler::MesosProcess::_send(...) and 
Master::Http::scheduler(...) we call the same validation 
validation::scheduler::call::validate(call). If we add not null check for 
FrameworkInfo.user, there should not be 400 BadRequest at Master side since the 
scheduler's subscribe request would not be send out to master. 

> HTTP Scheduler API no longer allows FrameworkInfo.user to be empty string
> -
>
> Key: MESOS-3747
> URL: https://issues.apache.org/jira/browse/MESOS-3747
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Affects Versions: 0.24.0, 0.24.1, 0.25.0
>Reporter: Ben Whitehead
>Priority: Blocker
>
> When using libmesos a framework can set its user to {{""}} (empty string) to 
> inherit the user the agent processes is running as, this behavior now results 
> in a {{TASK_FAILED}}.
> Full messages and relevant agent logs below.
> The error returned to the framework tells me nothing about the user not 
> existing on the agent host instead it tells me the container died due to OOM.
> {code:title=FrameworkInfo}
> call {
> type: SUBSCRIBE
> subscribe: {
> frameworkInfo: {
> user: "",
> name: "testing"
> }
> }
> }
> {code}
> {code:title=TaskInfo}
> call {
> framework_id { value: "20151015-125949-16777343-5050-20146-" },
> type: ACCEPT,
> accept { 
> offer_ids: [{ value: "20151015-125949-16777343-5050-20146-O0" }],
> operations { 
> type: LAUNCH, 
> launch { 
> task_infos [
> {
> name: "task-1",
> task_id: { value: "task-1" },
> agent_id: { value: 
> "20151015-125949-16777343-5050-20146-S0" },
> resources [
> { name: "cpus", type: SCALAR, scalar: { value: 
> 0.1 },  role: "*" },
> { name: "mem",  type: SCALAR, scalar: { value: 
> 64.0 }, role: "*" },
> { name: "disk", type: SCALAR, scalar: { value: 
> 0.0 },  role: "*" },
> ],
> command: { 
> environment { 
> variables [ 
> { name: "SLEEP_SECONDS" value: "15" } 
> ] 
> },
> value: "env | sort && sleep $SLEEP_SECONDS"
> }
> }
> ]
>  }
>  }
>  }
> }
> {code}
> {code:title=Update Status}
> event: {
> type: UPDATE,
> update: { 
> status: { 
> task_id: { value: "task-1" }, 
> state: TASK_FAILED,
> message: "Container destroyed while preparing isolators",
> agent_id: { value: "20151015-125949-16777343-5050-20146-S0" }, 
> timestamp: 1.444939217401241E9,
> executor_id: { value: "task-1" },
> source: SOURCE_AGENT, 
> reason: REASON_MEMORY_LIMIT,
> uuid: "\237g()L\026EQ\222\301\261\265\\\221\224|" 
> } 
> }
> }
> {code}
> {code:title=agent logs}
> I1015 13:15:34.260592 19639 slave.cpp:1270] Got assigned task task-1 for 
> framework e4de5b96-41cc-4713-af44-7cffbdd63ba6-
> I1015 13:15:34.260921 19639 slave.cpp:1386] Launching task task-1 for 
> framework e4de5b96-41cc-4713-af44-7cffbdd63ba6-
> W1015 13:15:34.262243 19639 paths.cpp:423] Failed to chown executor directory 
> '/home/ben.whitehead/opt/mesos/work/slave/work_dir/slaves/e4de5b96-41cc-4713-af44-7cffbdd63ba6-S0/frameworks/e4de5b96-41cc-4713-af44-7cffbdd63ba6-/executors/task-1/runs/3958ff84-8dd9-4c3c-995d-5aba5250541b':
>  Failed to get user information for '': Success
> I1015 13:15:34.262444 19639 slave.cpp:4852] Launching executor task-1 of 
> framework e4de5b96-41cc-4713-af44-7cffbdd63ba6- with resources 
> cpus(*):0.1; mem(*):32 in work directory 
> '/home/ben.whitehead/opt/mesos/work/slave/work_dir/slaves/e4de5b96-41cc-4713-af44-7cffbdd63ba6-S0/frameworks/e4de5b96-41cc-4713-af44-7cffbdd63ba6-/executors/task-1/runs/3958ff84-8dd9-4c3c-995d-5aba5250541b'
> I1015 13:15:34.262581 19639 slave.cpp:1604] Queuing task 'task-1' for 
> executor task-1 of framework 'e4de5b96-41cc-4713-af44-7cffbdd63ba6-
> I1015 13:15:34.262684 19638 docker.cpp:734] No container info found, skipping 
> launch
> I1015 13:15:34.263478 19638 containerizer.cpp:640] Starting container 
> '3958ff84-8dd9-4c3c-995d-5aba5250541b

[jira] [Commented] (MESOS-1848) DRFAllocatorTest.DRFAllocatorProcess is flaky

2015-10-13 Thread Liqiang Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14956386#comment-14956386
 ] 

Liqiang Lin commented on MESOS-1848:


Suggest to close this bug since it's too old to catch the latest test cases. I 
can not find the related test code anymore. 

$ ./mesos-tests  --gtest_filter=DRFAllocatorTest.DRFAllocatorProcess
Source directory: /Users/liqlin/code/mesos
Build directory: /Users/liqlin/code/mesos/build
-
We cannot run any Docker tests because:
Docker tests not supported on non-Linux systems
-
/usr/bin/nc
Note: Google Test filter = 
DRFAllocatorTest.DRFAllocatorProcess-HealthCheckTest.ROOT_DOCKER_DockerHealthyTask:HealthCheckTest.ROOT_DOCKER_DockerHealthStatusChange:HookTest.ROOT_DOCKER_VerifySlavePreLaunchDockerHook:SlaveTest.ROOT_RunTaskWithCommandInfoWithoutUser:SlaveTest.DISABLED_ROOT_RunTaskWithCommandInfoWithUser:DockerContainerizerTest.ROOT_DOCKER_Launch:DockerContainerizerTest.ROOT_DOCKER_Kill:DockerContainerizerTest.ROOT_DOCKER_Usage:DockerContainerizerTest.ROOT_DOCKER_Recover:DockerContainerizerTest.ROOT_DOCKER_SkipRecoverNonDocker:DockerContainerizerTest.ROOT_DOCKER_Logs:DockerContainerizerTest.ROOT_DOCKER_Default_CMD:DockerContainerizerTest.ROOT_DOCKER_Default_CMD_Override:DockerContainerizerTest.ROOT_DOCKER_Default_CMD_Args:DockerContainerizerTest.ROOT_DOCKER_SlaveRecoveryTaskContainer:DockerContainerizerTest.DISABLED_ROOT_DOCKER_SlaveRecoveryExecutorContainer:DockerContainerizerTest.ROOT_DOCKER_NC_PortMapping:DockerContainerizerTest.ROOT_DOCKER_LaunchSandboxWithColon:DockerContainerizerTest.ROOT_DOCKER_DestroyWhileFetching:DockerContainerizerTest.ROOT_DOCKER_DestroyWhilePulling:DockerContainerizerTest.ROOT_DOCKER_ExecutorCleanupWhenLaunchFailed:DockerContainerizerTest.ROOT_DOCKER_FetchFailure:DockerContainerizerTest.ROOT_DOCKER_DockerPullFailure:DockerContainerizerTest.ROOT_DOCKER_DockerInspectDiscard:DockerTest.ROOT_DOCKER_interface:DockerTest.ROOT_DOCKER_parsing_version:DockerTest.ROOT_DOCKER_CheckCommandWithShell:DockerTest.ROOT_DOCKER_CheckPortResource:DockerTest.ROOT_DOCKER_CancelPull:DockerTest.ROOT_DOCKER_MountRelative:DockerTest.ROOT_DOCKER_MountAbsolute:CopyBackendTest.ROOT_CopyBackend:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/0:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/1:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/2:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/3:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/4:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/5:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/6:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/7:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/8:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/9:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/10:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/11:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/12:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/13:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/14:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/15:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/16:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/17:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/18:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/19:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/20:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/21:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/22:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/23:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/24:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/25:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/26:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/27:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/28:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/29:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/30:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/31:SlaveAndFrameworkCount

[jira] [Assigned] (MESOS-1848) DRFAllocatorTest.DRFAllocatorProcess is flaky

2015-10-13 Thread Liqiang Lin (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liqiang Lin reassigned MESOS-1848:
--

Assignee: Liqiang Lin

> DRFAllocatorTest.DRFAllocatorProcess is flaky
> -
>
> Key: MESOS-1848
> URL: https://issues.apache.org/jira/browse/MESOS-1848
> Project: Mesos
>  Issue Type: Bug
>  Components: test
> Environment: Fedora 20
>Reporter: Vinod Kone
>Assignee: Liqiang Lin
>  Labels: flaky
>
> Observed this on CI. This is pretty strange because the authentication of 
> both the framework and slave timed out at the very beginning, even though we 
> don't manipulate clocks.
> {code}
> [ RUN  ] DRFAllocatorTest.DRFAllocatorProcess
> Using temporary directory '/tmp/DRFAllocatorTest_DRFAllocatorProcess_igiR9X'
> I0929 20:11:12.801327 16997 leveldb.cpp:176] Opened db in 489720ns
> I0929 20:11:12.801627 16997 leveldb.cpp:183] Compacted db in 168280ns
> I0929 20:11:12.801784 16997 leveldb.cpp:198] Created db iterator in 5820ns
> I0929 20:11:12.801898 16997 leveldb.cpp:204] Seeked to beginning of db in 
> 1285ns
> I0929 20:11:12.802039 16997 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 792ns
> I0929 20:11:12.802160 16997 replica.cpp:741] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0929 20:11:12.802441 17012 recover.cpp:425] Starting replica recovery
> I0929 20:11:12.802623 17012 recover.cpp:451] Replica is in EMPTY status
> I0929 20:11:12.803251 17012 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I0929 20:11:12.803427 17012 recover.cpp:188] Received a recover response from 
> a replica in EMPTY status
> I0929 20:11:12.803632 17012 recover.cpp:542] Updating replica status to 
> STARTING
> I0929 20:11:12.803911 17012 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 33999ns
> I0929 20:11:12.804033 17012 replica.cpp:320] Persisted replica status to 
> STARTING
> I0929 20:11:12.804245 17012 recover.cpp:451] Replica is in STARTING status
> I0929 20:11:12.804592 17012 replica.cpp:638] Replica in STARTING status 
> received a broadcasted recover request
> I0929 20:11:12.804775 17012 recover.cpp:188] Received a recover response from 
> a replica in STARTING status
> I0929 20:11:12.804952 17012 recover.cpp:542] Updating replica status to VOTING
> I0929 20:11:12.805115 17012 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 15990ns
> I0929 20:11:12.805234 17012 replica.cpp:320] Persisted replica status to 
> VOTING
> I0929 20:11:12.805366 17012 recover.cpp:556] Successfully joined the Paxos 
> group
> I0929 20:11:12.805539 17012 recover.cpp:440] Recover process terminated
> I0929 20:11:12.809062 17017 master.cpp:312] Master 
> 20140929-201112-2759502016-47295-16997 (fedora-20) started on 
> 192.168.122.164:47295
> I0929 20:11:12.809432 17017 master.cpp:358] Master only allowing 
> authenticated frameworks to register
> I0929 20:11:12.809546 17017 master.cpp:363] Master only allowing 
> authenticated slaves to register
> I0929 20:11:12.810169 17017 credentials.hpp:36] Loading credentials for 
> authentication from 
> '/tmp/DRFAllocatorTest_DRFAllocatorProcess_igiR9X/credentials'
> I0929 20:11:12.810510 17017 master.cpp:392] Authorization enabled
> I0929 20:11:12.811841 17016 master.cpp:120] No whitelist given. Advertising 
> offers for all slaves
> I0929 20:11:12.812099 17013 hierarchical_allocator_process.hpp:299] 
> Initializing hierarchical allocator process with master : 
> master@192.168.122.164:47295
> I0929 20:11:12.813006 17017 master.cpp:1241] The newly elected leader is 
> master@192.168.122.164:47295 with id 20140929-201112-2759502016-47295-16997
> I0929 20:11:12.813164 17017 master.cpp:1254] Elected as the leading master!
> I0929 20:11:12.813279 17017 master.cpp:1072] Recovering from registrar
> I0929 20:11:12.813487 17013 registrar.cpp:312] Recovering registrar
> I0929 20:11:12.813824 17013 log.cpp:656] Attempting to start the writer
> I0929 20:11:12.814256 17013 replica.cpp:474] Replica received implicit 
> promise request with proposal 1
> I0929 20:11:12.814419 17013 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 25049ns
> I0929 20:11:12.814581 17013 replica.cpp:342] Persisted promised to 1
> I0929 20:11:12.814909 17013 coordinator.cpp:230] Coordinator attemping to 
> fill missing position
> I0929 20:11:12.815340 17013 replica.cpp:375] Replica received explicit 
> promise request for position 0 with proposal 2
> I0929 20:11:12.815497 17013 leveldb.cpp:343] Persisting action (8 bytes) to 
> leveldb took 19855ns
> I0929 20:11:12.815636 17013 replica.cpp:676] Persisted action at 0
> I0929 20:11:12.816066 17013 replica.cpp:508] Replica received write request 
> for position 0
> I0929 20:11:12.816220 17013 leveldb.cpp:438] Reading

[jira] [Commented] (MESOS-1848) DRFAllocatorTest.DRFAllocatorProcess is flaky

2015-10-13 Thread Liqiang Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14956344#comment-14956344
 ] 

Liqiang Lin commented on MESOS-1848:


I's like to take this JIRA. But I can not reproduce this bug in my OSX machine. 

> DRFAllocatorTest.DRFAllocatorProcess is flaky
> -
>
> Key: MESOS-1848
> URL: https://issues.apache.org/jira/browse/MESOS-1848
> Project: Mesos
>  Issue Type: Bug
>  Components: test
> Environment: Fedora 20
>Reporter: Vinod Kone
>  Labels: flaky
>
> Observed this on CI. This is pretty strange because the authentication of 
> both the framework and slave timed out at the very beginning, even though we 
> don't manipulate clocks.
> {code}
> [ RUN  ] DRFAllocatorTest.DRFAllocatorProcess
> Using temporary directory '/tmp/DRFAllocatorTest_DRFAllocatorProcess_igiR9X'
> I0929 20:11:12.801327 16997 leveldb.cpp:176] Opened db in 489720ns
> I0929 20:11:12.801627 16997 leveldb.cpp:183] Compacted db in 168280ns
> I0929 20:11:12.801784 16997 leveldb.cpp:198] Created db iterator in 5820ns
> I0929 20:11:12.801898 16997 leveldb.cpp:204] Seeked to beginning of db in 
> 1285ns
> I0929 20:11:12.802039 16997 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 792ns
> I0929 20:11:12.802160 16997 replica.cpp:741] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0929 20:11:12.802441 17012 recover.cpp:425] Starting replica recovery
> I0929 20:11:12.802623 17012 recover.cpp:451] Replica is in EMPTY status
> I0929 20:11:12.803251 17012 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I0929 20:11:12.803427 17012 recover.cpp:188] Received a recover response from 
> a replica in EMPTY status
> I0929 20:11:12.803632 17012 recover.cpp:542] Updating replica status to 
> STARTING
> I0929 20:11:12.803911 17012 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 33999ns
> I0929 20:11:12.804033 17012 replica.cpp:320] Persisted replica status to 
> STARTING
> I0929 20:11:12.804245 17012 recover.cpp:451] Replica is in STARTING status
> I0929 20:11:12.804592 17012 replica.cpp:638] Replica in STARTING status 
> received a broadcasted recover request
> I0929 20:11:12.804775 17012 recover.cpp:188] Received a recover response from 
> a replica in STARTING status
> I0929 20:11:12.804952 17012 recover.cpp:542] Updating replica status to VOTING
> I0929 20:11:12.805115 17012 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 15990ns
> I0929 20:11:12.805234 17012 replica.cpp:320] Persisted replica status to 
> VOTING
> I0929 20:11:12.805366 17012 recover.cpp:556] Successfully joined the Paxos 
> group
> I0929 20:11:12.805539 17012 recover.cpp:440] Recover process terminated
> I0929 20:11:12.809062 17017 master.cpp:312] Master 
> 20140929-201112-2759502016-47295-16997 (fedora-20) started on 
> 192.168.122.164:47295
> I0929 20:11:12.809432 17017 master.cpp:358] Master only allowing 
> authenticated frameworks to register
> I0929 20:11:12.809546 17017 master.cpp:363] Master only allowing 
> authenticated slaves to register
> I0929 20:11:12.810169 17017 credentials.hpp:36] Loading credentials for 
> authentication from 
> '/tmp/DRFAllocatorTest_DRFAllocatorProcess_igiR9X/credentials'
> I0929 20:11:12.810510 17017 master.cpp:392] Authorization enabled
> I0929 20:11:12.811841 17016 master.cpp:120] No whitelist given. Advertising 
> offers for all slaves
> I0929 20:11:12.812099 17013 hierarchical_allocator_process.hpp:299] 
> Initializing hierarchical allocator process with master : 
> master@192.168.122.164:47295
> I0929 20:11:12.813006 17017 master.cpp:1241] The newly elected leader is 
> master@192.168.122.164:47295 with id 20140929-201112-2759502016-47295-16997
> I0929 20:11:12.813164 17017 master.cpp:1254] Elected as the leading master!
> I0929 20:11:12.813279 17017 master.cpp:1072] Recovering from registrar
> I0929 20:11:12.813487 17013 registrar.cpp:312] Recovering registrar
> I0929 20:11:12.813824 17013 log.cpp:656] Attempting to start the writer
> I0929 20:11:12.814256 17013 replica.cpp:474] Replica received implicit 
> promise request with proposal 1
> I0929 20:11:12.814419 17013 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 25049ns
> I0929 20:11:12.814581 17013 replica.cpp:342] Persisted promised to 1
> I0929 20:11:12.814909 17013 coordinator.cpp:230] Coordinator attemping to 
> fill missing position
> I0929 20:11:12.815340 17013 replica.cpp:375] Replica received explicit 
> promise request for position 0 with proposal 2
> I0929 20:11:12.815497 17013 leveldb.cpp:343] Persisting action (8 bytes) to 
> leveldb took 19855ns
> I0929 20:11:12.815636 17013 replica.cpp:676] Persisted action at 0
> I0929 20:11:12.816066 17013 replica.cpp:508] Replica received write request 

[jira] [Assigned] (MESOS-3509) SlaveTest.TerminatingSlaveDoesNotReregister is flaky

2015-09-24 Thread Liqiang Lin (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liqiang Lin reassigned MESOS-3509:
--

Assignee: Liqiang Lin

> SlaveTest.TerminatingSlaveDoesNotReregister is flaky
> 
>
> Key: MESOS-3509
> URL: https://issues.apache.org/jira/browse/MESOS-3509
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.25.0
>Reporter: Jie Yu
>Assignee: Liqiang Lin
>
> Observed on Apache CI
> {noformat}
> [ RUN  ] SlaveTest.TerminatingSlaveDoesNotReregister
> Using temporary directory 
> '/tmp/SlaveTest_TerminatingSlaveDoesNotReregister_PiT5jn'
> I0924 04:54:06.645386 29977 leveldb.cpp:176] Opened db in 112.347533ms
> I0924 04:54:06.665618 29977 leveldb.cpp:183] Compacted db in 20.158157ms
> I0924 04:54:06.665704 29977 leveldb.cpp:198] Created db iterator in 24721ns
> I0924 04:54:06.665727 29977 leveldb.cpp:204] Seeked to beginning of db in 
> 3125ns
> I0924 04:54:06.665741 29977 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 256ns
> I0924 04:54:06.665799 29977 replica.cpp:744] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0924 04:54:06.666378 3 recover.cpp:449] Starting replica recovery
> I0924 04:54:06.666831 3 recover.cpp:475] Replica is in EMPTY status
> I0924 04:54:06.667945 30008 replica.cpp:641] Replica in EMPTY status received 
> a broadcasted recover request
> I0924 04:54:06.668473 30005 recover.cpp:195] Received a recover response from 
> a replica in EMPTY status
> I0924 04:54:06.669003 30007 recover.cpp:566] Updating replica status to 
> STARTING
> I0924 04:54:06.669154 3 master.cpp:370] Master 
> bd006350-cdbf-414f-9eef-25f03ccdc5fb (7c9c99874a2d) started on 
> 172.17.1.154:55676
> I0924 04:54:06.669347 3 master.cpp:372] Flags at startup: --acls="" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate="true" --authenticate_slaves="true" --authenticators="crammd5" 
> --authorizers="local" 
> --credentials="/tmp/SlaveTest_TerminatingSlaveDoesNotReregister_PiT5jn/credentials"
>  --framework_sorter="drf" --help="false" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_slave_ping_timeouts="5" --quiet="false" 
> --recovery_slave_removal_limit="100%" --registry="replicated_log" 
> --registry_fetch_timeout="1mins" --registry_store_timeout="25secs" 
> --registry_strict="true" --root_submissions="true" 
> --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" 
> --user_sorter="drf" --version="false" 
> --webui_dir="/mesos/mesos-0.25.0/_inst/share/mesos/webui" 
> --work_dir="/tmp/SlaveTest_TerminatingSlaveDoesNotReregister_PiT5jn/master" 
> --zk_session_timeout="10secs"
> I0924 04:54:06.669595 3 master.cpp:417] Master only allowing 
> authenticated frameworks to register
> I0924 04:54:06.669605 3 master.cpp:422] Master only allowing 
> authenticated slaves to register
> I0924 04:54:06.669613 3 credentials.hpp:37] Loading credentials for 
> authentication from 
> '/tmp/SlaveTest_TerminatingSlaveDoesNotReregister_PiT5jn/credentials'
> I0924 04:54:06.669874 3 master.cpp:461] Using default 'crammd5' 
> authenticator
> I0924 04:54:06.670027 3 master.cpp:498] Authorization enabled
> I0924 04:54:06.670264 30005 hierarchical.hpp:468] Initialized hierarchical 
> allocator process
> I0924 04:54:06.670320 30005 whitelist_watcher.cpp:79] No whitelist given
> I0924 04:54:06.671406 2 master.cpp:1597] The newly elected leader is 
> master@172.17.1.154:55676 with id bd006350-cdbf-414f-9eef-25f03ccdc5fb
> I0924 04:54:06.671438 2 master.cpp:1610] Elected as the leading master!
> I0924 04:54:06.671461 2 master.cpp:1370] Recovering from registrar
> I0924 04:54:06.671612 30011 registrar.cpp:309] Recovering registrar
> I0924 04:54:06.704048 30002 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 34.707125ms
> I0924 04:54:06.704123 30002 replica.cpp:323] Persisted replica status to 
> STARTING
> I0924 04:54:06.704504 29997 recover.cpp:475] Replica is in STARTING status
> I0924 04:54:06.705658 30002 replica.cpp:641] Replica in STARTING status 
> received a broadcasted recover request
> I0924 04:54:06.706466 30007 recover.cpp:195] Received a recover response from 
> a replica in STARTING status
> I0924 04:54:06.707064 30004 recover.cpp:566] Updating replica status to VOTING
> I0924 04:54:06.737488 30004 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 30.261425ms
> I0924 04:54:06.737560 30004 replica.cpp:323] Persisted replica status to 
> VOTING
> I0924 04:54:06.737746 30004 recover.cpp:580] Successfully joined the Paxos 
> group
> I0924 04:54:06.737932 30004 recover.cpp:464] Recover process terminated
> I0924 04:54:06.738535 30009 log.cpp:661] Attempting to start the writer
> 

[jira] [Assigned] (MESOS-3481) Add const accessor to Master flags

2015-09-24 Thread Liqiang Lin (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liqiang Lin reassigned MESOS-3481:
--

Assignee: Liqiang Lin

> Add const accessor to Master flags
> --
>
> Key: MESOS-3481
> URL: https://issues.apache.org/jira/browse/MESOS-3481
> Project: Mesos
>  Issue Type: Task
>Reporter: Joseph Wu
>Assignee: Liqiang Lin
>Priority: Trivial
>  Labels: mesosphere, newbie
>
> It would make sense to have an accessor to the master's flags, especially for 
> tests.
> For example, see [this 
> test|https://github.com/apache/mesos/blob/2876b8c918814347dd56f6f87d461e414a90650a/src/tests/master_maintenance_tests.cpp#L1231-L1235].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3481) Add const accessor to Master flags

2015-09-24 Thread Liqiang Lin (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liqiang Lin updated MESOS-3481:
---
Assignee: (was: Liqiang Lin)

> Add const accessor to Master flags
> --
>
> Key: MESOS-3481
> URL: https://issues.apache.org/jira/browse/MESOS-3481
> Project: Mesos
>  Issue Type: Task
>Reporter: Joseph Wu
>Priority: Trivial
>  Labels: mesosphere, newbie
>
> It would make sense to have an accessor to the master's flags, especially for 
> tests.
> For example, see [this 
> test|https://github.com/apache/mesos/blob/2876b8c918814347dd56f6f87d461e414a90650a/src/tests/master_maintenance_tests.cpp#L1231-L1235].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3147) Allocator refactor

2015-09-10 Thread Liqiang Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14738591#comment-14738591
 ] 

Liqiang Lin commented on MESOS-3147:


I think the current default Mesos allocator actually did not care about the 
real resource usage of each framework, it only cares about "allocated" resource 
which includes both framework used and offered resources. In the case, Mesos 
allocator integrates with other third party scheduler (e.g., Yarn) which only 
records the used resource, it's difficult to let the third party scheduler make 
decision how many resource is used when framework launched tasks with offer. 
Because current Mesos allocator did not have the info of framework used 
resource. Only with one allocator interface recoverResource() and unused 
resource as parameter, it's impossible to get the used resource of framework. 
For this issue, does it make sense to add callback in allocator to get 
framework used resource? Any other suggestions?

> Allocator refactor
> --
>
> Key: MESOS-3147
> URL: https://issues.apache.org/jira/browse/MESOS-3147
> Project: Mesos
>  Issue Type: Task
>  Components: allocation, master
>Reporter: Michael Park
>Assignee: Guangya Liu
>  Labels: mesosphere, tech-debt
>
> With new features such as dynamic reservation, persistent volume, quota, 
> optimistic offers, it has been apparent that we need to refactor the 
> allocator to
> 1. solidify the API (e.g. consolidate {{updateSlave}} and {{updateAvailable}})
> 2. possibly move the offer generation to the allocator from the master
> 3. support for allocator modules where the API involves returning 
> {{libprocess::Future}}
> The sequence of implementation challenges for dynamic reservation master 
> endpoints are captured in [this 
> document|https://docs.google.com/document/d/1cwVz4aKiCYP9Y4MOwHYZkyaiuEv7fArCye-vPvB2lAI/edit?usp=sharing].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1607) Introduce optimistic offers.

2015-09-10 Thread Liqiang Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14738431#comment-14738431
 ] 

Liqiang Lin commented on MESOS-1607:


Is there any design doc I can get?

If it's optimistic offer, who will resolve the resource conflict in accepting 
offer? Allocator or Mesos master? I think sending run task message should be 
after resolving resource conflict. If it's allocator's responsibility to 
resolve resource conflict, will a new interface in allocator be introduced? If 
yes, how this new interface interact with existing recoverResources() function?

Please append more design about how to resolve resource conflict in accepting 
offer.
Thanks

> Introduce optimistic offers.
> 
>
> Key: MESOS-1607
> URL: https://issues.apache.org/jira/browse/MESOS-1607
> Project: Mesos
>  Issue Type: Epic
>  Components: allocation, framework, master
>Reporter: Benjamin Hindman
> Attachments: optimisitic-offers.pdf
>
>
> The current implementation of resource offers only enable a single framework 
> scheduler to make scheduling decisions for some available resources at a 
> time. In some circumstances, this is good, i.e., when we don't want other 
> framework schedulers to have access to some resources. However, in other 
> circumstances, there are advantages to letting multiple framework schedulers 
> attempt to make scheduling decisions for the _same_ allocation of resources 
> in parallel.
> If you think about this from a "concurrency control" perspective, the current 
> implementation of resource offers is _pessimistic_, the resources contained 
> within an offer are _locked_ until the framework scheduler that they were 
> offered to launches tasks with them or declines them. In addition to making 
> pessimistic offers we'd like to give out _optimistic_ offers, where the same 
> resources are offered to multiple framework schedulers at the same time, and 
> framework schedulers "compete" for those resources on a 
> first-come-first-serve basis (i.e., the first to launch a task "wins"). We've 
> always reserved the right to rescind resource offers using the 'rescind' 
> primitive in the API, and a framework scheduler should be prepared to launch 
> a task and have those tasks go lost because another framework already started 
> to use those resources.
> Introducing optimistic offers will enable more sophisticated allocation 
> algorithms. For example, we can optimistically allocate resources that are 
> reserved for a particular framework (role) but are not being used. In 
> conjunction with revocable resources (the concept that using resources not 
> reserved for you means you might get those resources revoked) we can easily 
> create a "spot" market for unused resources, driving up utilization by 
> letting frameworks that are willing to use revocable resources run tasks.
> In the limit, one could imagine always making optimistic resource offers. 
> This bears a striking resemblance with the Google Omega model (an isomorphism 
> even). However, being able to configure what resources should be allocated 
> optimistically and what resources should be allocated pessimistically gives 
> even more control to a datacenter/cluster operator that might want to, for 
> example, never let multiple frameworks (roles) compete for some set of 
> resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)