[jira] [Created] (MESOS-7626) Create a CI job to publish the website

2017-06-05 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-7626:
-

 Summary: Create a CI job to publish the website
 Key: MESOS-7626
 URL: https://issues.apache.org/jira/browse/MESOS-7626
 Project: Mesos
  Issue Type: Task
Reporter: Vinod Kone
Assignee: Vinod Kone


This job periodically scans for changes to the master branch of `mesos` and 
publishes an updated website to `mesos-site`.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7624) Move website from svn to git

2017-06-05 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-7624:
--
Shepherd: Vinod Kone
Story Points: 3
  Sprint: Mesosphere Sprint 57

File a repo request to create "mesos-site" git repo.

Once created, we need to move the contents over from svn repo to git repo.

> Move website from svn to git
> 
>
> Key: MESOS-7624
> URL: https://issues.apache.org/jira/browse/MESOS-7624
> Project: Mesos
>  Issue Type: Task
>Reporter: Vinod Kone
>Assignee: Vinod Kone
>
> Move our website svn repo at https://svn.apache.org/repos/asf/mesos/site to a 
> git repo.
> Having git repo for both the main project and website allows us to deal with 
> one version control system. Also git based projects are easy to automate via 
> CI (e.g., git commit) because ASF CI already has required credentials.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7623) Automatically publish website through CI

2017-06-05 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-7623:
--
Sprint:   (was: Mesosphere Sprint 57)

> Automatically publish website through CI
> 
>
> Key: MESOS-7623
> URL: https://issues.apache.org/jira/browse/MESOS-7623
> Project: Mesos
>  Issue Type: Epic
>Reporter: Vinod Kone
>Assignee: Vinod Kone
>
> Currently, publishing the website is a manual process whereby a committer 
> runs a local docker script, copies the generated `publish` folder to svn copy 
> and does an `svn commit`. This is both cumbersome and error prone.
> We should automate this process by running this as a CI job.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7623) Automatically publish website through CI

2017-06-05 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-7623:
--
Story Points: 8  (was: 1)

> Automatically publish website through CI
> 
>
> Key: MESOS-7623
> URL: https://issues.apache.org/jira/browse/MESOS-7623
> Project: Mesos
>  Issue Type: Epic
>Reporter: Vinod Kone
>Assignee: Vinod Kone
>
> Currently, publishing the website is a manual process whereby a committer 
> runs a local docker script, copies the generated `publish` folder to svn copy 
> and does an `svn commit`. This is both cumbersome and error prone.
> We should automate this process by running this as a CI job.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7625) Create script to automate publishing website

2017-06-05 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-7625:
-

 Summary: Create script to automate publishing website
 Key: MESOS-7625
 URL: https://issues.apache.org/jira/browse/MESOS-7625
 Project: Mesos
  Issue Type: Task
Reporter: Vinod Kone
Assignee: Vinod Kone


These script will be run via ASF CI and be responsible for 

1) checking out the latest master branch
2) build mesos and generate endpoints help
3) generate website contents
4) publish website by doing a git commit to `mesos-site` repo



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Issue Comment Deleted] (MESOS-7623) Automatically publish website through CI

2017-06-05 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-7623:
--
Comment: was deleted

(was: File a repo request to create "mesos-site" git repo.)

> Automatically publish website through CI
> 
>
> Key: MESOS-7623
> URL: https://issues.apache.org/jira/browse/MESOS-7623
> Project: Mesos
>  Issue Type: Epic
>Reporter: Vinod Kone
>Assignee: Vinod Kone
>
> Currently, publishing the website is a manual process whereby a committer 
> runs a local docker script, copies the generated `publish` folder to svn copy 
> and does an `svn commit`. This is both cumbersome and error prone.
> We should automate this process by running this as a CI job.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7623) Automatically publish website through CI

2017-06-05 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-7623:
--
Shepherd: Vinod Kone
Story Points: 1
  Sprint: Mesosphere Sprint 57

> Automatically publish website through CI
> 
>
> Key: MESOS-7623
> URL: https://issues.apache.org/jira/browse/MESOS-7623
> Project: Mesos
>  Issue Type: Epic
>Reporter: Vinod Kone
>Assignee: Vinod Kone
>
> Currently, publishing the website is a manual process whereby a committer 
> runs a local docker script, copies the generated `publish` folder to svn copy 
> and does an `svn commit`. This is both cumbersome and error prone.
> We should automate this process by running this as a CI job.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7623) Automatically publish website through CI

2017-06-05 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16037816#comment-16037816
 ] 

Vinod Kone commented on MESOS-7623:
---

File a repo request to create "mesos-site" git repo.

> Automatically publish website through CI
> 
>
> Key: MESOS-7623
> URL: https://issues.apache.org/jira/browse/MESOS-7623
> Project: Mesos
>  Issue Type: Epic
>Reporter: Vinod Kone
>Assignee: Vinod Kone
>
> Currently, publishing the website is a manual process whereby a committer 
> runs a local docker script, copies the generated `publish` folder to svn copy 
> and does an `svn commit`. This is both cumbersome and error prone.
> We should automate this process by running this as a CI job.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7624) Move website from svn to git

2017-06-05 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-7624:
-

 Summary: Move website from svn to git
 Key: MESOS-7624
 URL: https://issues.apache.org/jira/browse/MESOS-7624
 Project: Mesos
  Issue Type: Task
Reporter: Vinod Kone
Assignee: Vinod Kone


Move our website svn repo at https://svn.apache.org/repos/asf/mesos/site to a 
git repo.

Having git repo for both the main project and website allows us to deal with 
one version control system. Also git based projects are easy to automate via CI 
(e.g., git commit) because ASF CI already has required credentials.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7623) Automatically publish website through CI

2017-06-05 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-7623:
-

 Summary: Automatically publish website through CI
 Key: MESOS-7623
 URL: https://issues.apache.org/jira/browse/MESOS-7623
 Project: Mesos
  Issue Type: Epic
Reporter: Vinod Kone
Assignee: Vinod Kone


Currently, publishing the website is a manual process whereby a committer runs 
a local docker script, copies the generated `publish` folder to svn copy and 
does an `svn commit`. This is both cumbersome and error prone.

We should automate this process by running this as a CI job.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (MESOS-1309) Automate updating the website for a new release

2017-06-05 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-1309:
-

Assignee: (was: Vinod Kone)

> Automate updating the website for a new release
> ---
>
> Key: MESOS-1309
> URL: https://issues.apache.org/jira/browse/MESOS-1309
> Project: Mesos
>  Issue Type: Improvement
>  Components: project website
>Reporter: Vinod Kone
>Priority: Minor
>
> This could be script that lives in our website repo that 
> 1) updates the links in the website to the latest release which appear on the 
> homepage and downloads page
> 2) deletes old release from dist.a.o per MESOS-850



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (MESOS-1309) Automate updating the website for a new release

2017-06-05 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-1309:
-

Assignee: Vinod Kone

> Automate updating the website for a new release
> ---
>
> Key: MESOS-1309
> URL: https://issues.apache.org/jira/browse/MESOS-1309
> Project: Mesos
>  Issue Type: Improvement
>  Components: project website
>Reporter: Vinod Kone
>Assignee: Vinod Kone
>Priority: Minor
>
> This could be script that lives in our website repo that 
> 1) updates the links in the website to the latest release which appear on the 
> homepage and downloads page
> 2) deletes old release from dist.a.o per MESOS-850



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7621) Fetcher does not handle content length and redirects

2017-06-05 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16037677#comment-16037677
 ] 

Charles Allen commented on MESOS-7621:
--

In a basic digging, it looks like 
https://github.com/apache/mesos/blob/1.2.0/3rdparty/stout/include/stout/net.hpp#L101
 does another request to the same location to get the content length.


I do see the following in the logs, even though the files download successfully:

{code}
[1B blob data]
HTTP/1.1 403 Forbidden
x-amz-request-id: REQUEST_ID_REDACTED
x-amz-id-2: ID_REDACTED=
Content-Type: application/xml
Transfer-Encoding: chunked
Date: Mon, 05 Jun 2017 18:25:45 GMT
Server: AmazonS3
{code}

> Fetcher does not handle content length and redirects
> 
>
> Key: MESOS-7621
> URL: https://issues.apache.org/jira/browse/MESOS-7621
> Project: Mesos
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.2.0
>Reporter: Charles Allen
>
> {code}
> $ curl -L -v -O -s http://HOSTNAME_REDACTED/PATH_REDACTED.tar.gz
> *   Trying 172.17.4.10...
> * Connected to HOSTNAME_REDACTED (172.17.4.10) port 80 (#0)
> > GET /PATH_REDACTED.tar.gz HTTP/1.1
> > Host: HOSTNAME_REDACTED
> > User-Agent: curl/7.43.0
> > Accept: */*
> >
> < HTTP/1.1 302 FOUND
> < Server: nginx/1.4.6 (Ubuntu)
> < Date: Mon, 05 Jun 2017 17:58:04 GMT
> < Content-Type: text/html; charset=utf-8
> < Content-Length: 1947
> < Connection: keep-alive
> < Location: 
> https://BUCKET_REDACTED.s3.amazonaws.com:443/PATH_REDACTED?Signature=REDACTED%3D&Expires=1496689084&AWSAccessKeyId=KEY_REDACTED&x-amz-security-token=TOKEN_REDACTED%3D
> <
> * Ignoring the response-body
> { [309 bytes data]
> * Connection #0 to host HOSTNAME_REDACTED left intact
> * Issue another request to this URL: 
> 'https://BUCKET_REDACTED.s3.amazonaws.com:443/PATH_REDACTED.tar.gz?Signature=SIGNATURE_REDACTED%3D&Expires=1496689084&AWSAccessKeyId=KEY_REDACTED&x-amz-security-token=TOKEN_REDACTED%3D'
> *   Trying 54.231.40.75...
> * Connected to BUCKET_REDACTED.s3.amazonaws.com (54.231.40.75) port 443 (#1)
> * TLS 1.2 connection using TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
> * Server certificate: *.s3.amazonaws.com
> * Server certificate: DigiCert Baltimore CA-2 G2
> * Server certificate: Baltimore CyberTrust Root
> > GET 
> > /PATH_REDACTED.tar.gz?Signature=REDACTED&Expires=1496689084&AWSAccessKeyId=KEY_REDACTED&x-amz-security-token=TOKEN_REDACTED%3D
> >  HTTP/1.1
> > Host: BUCKET_REDACTED.s3.amazonaws.com
> > User-Agent: curl/7.43.0
> > Accept: */*
> >
> < HTTP/1.1 200 OK
> < x-amz-id-2: ID_REDACTED=
> < x-amz-request-id: REQUEST_ID_REDACTED
> < Date: Mon, 05 Jun 2017 17:58:07 GMT
> < Last-Modified: Thu, 01 Jun 2017 03:04:49 GMT
> < ETag: "ETAG_REDACTED"
> < Accept-Ranges: bytes
> < Content-Type: application/x-tar
> < Content-Length: 208245664
> < Server: AmazonS3
> <
> { [16360 bytes data]
> {code}
> We have a micro-service which signs temporary urls for services which can't 
> speak natively with S3. The above is an example download using {{curl}}. But 
> when using the mesos fetcher the agent logs contain the following information:
> {code}
> fetcher.cpp:479] Reverting to fetching directly into the sandbox for 
> 'http://HOST_REDACTED/PATH_REDACTED.tar.gz', due to failure to fetch through 
> the cache, with error: Could not determine size of cache file for 
> 'USER_REDACTED@http://HOST_REDACTED/PATH_REDACTED.tar.gz' with error: No URL 
> content-length available
> {code}
> Any idea why this error would occur?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7621) Fetcher does not handle content length and redirects

2017-06-05 Thread Charles Allen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charles Allen updated MESOS-7621:
-
Summary: Fetcher does not handle content length and redirects  (was: 
Fetcher does not handle content length in redirects)

> Fetcher does not handle content length and redirects
> 
>
> Key: MESOS-7621
> URL: https://issues.apache.org/jira/browse/MESOS-7621
> Project: Mesos
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.2.0
>Reporter: Charles Allen
>
> {code}
> $ curl -L -v -O -s http://HOSTNAME_REDACTED/PATH_REDACTED.tar.gz
> *   Trying 172.17.4.10...
> * Connected to HOSTNAME_REDACTED (172.17.4.10) port 80 (#0)
> > GET /PATH_REDACTED.tar.gz HTTP/1.1
> > Host: HOSTNAME_REDACTED
> > User-Agent: curl/7.43.0
> > Accept: */*
> >
> < HTTP/1.1 302 FOUND
> < Server: nginx/1.4.6 (Ubuntu)
> < Date: Mon, 05 Jun 2017 17:58:04 GMT
> < Content-Type: text/html; charset=utf-8
> < Content-Length: 1947
> < Connection: keep-alive
> < Location: 
> https://BUCKET_REDACTED.s3.amazonaws.com:443/PATH_REDACTED?Signature=REDACTED%3D&Expires=1496689084&AWSAccessKeyId=KEY_REDACTED&x-amz-security-token=TOKEN_REDACTED%3D
> <
> * Ignoring the response-body
> { [309 bytes data]
> * Connection #0 to host HOSTNAME_REDACTED left intact
> * Issue another request to this URL: 
> 'https://BUCKET_REDACTED.s3.amazonaws.com:443/PATH_REDACTED.tar.gz?Signature=SIGNATURE_REDACTED%3D&Expires=1496689084&AWSAccessKeyId=KEY_REDACTED&x-amz-security-token=TOKEN_REDACTED%3D'
> *   Trying 54.231.40.75...
> * Connected to BUCKET_REDACTED.s3.amazonaws.com (54.231.40.75) port 443 (#1)
> * TLS 1.2 connection using TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
> * Server certificate: *.s3.amazonaws.com
> * Server certificate: DigiCert Baltimore CA-2 G2
> * Server certificate: Baltimore CyberTrust Root
> > GET 
> > /PATH_REDACTED.tar.gz?Signature=REDACTED&Expires=1496689084&AWSAccessKeyId=KEY_REDACTED&x-amz-security-token=TOKEN_REDACTED%3D
> >  HTTP/1.1
> > Host: BUCKET_REDACTED.s3.amazonaws.com
> > User-Agent: curl/7.43.0
> > Accept: */*
> >
> < HTTP/1.1 200 OK
> < x-amz-id-2: ID_REDACTED=
> < x-amz-request-id: REQUEST_ID_REDACTED
> < Date: Mon, 05 Jun 2017 17:58:07 GMT
> < Last-Modified: Thu, 01 Jun 2017 03:04:49 GMT
> < ETag: "ETAG_REDACTED"
> < Accept-Ranges: bytes
> < Content-Type: application/x-tar
> < Content-Length: 208245664
> < Server: AmazonS3
> <
> { [16360 bytes data]
> {code}
> We have a micro-service which signs temporary urls for services which can't 
> speak natively with S3. The above is an example download using {{curl}}. But 
> when using the mesos fetcher the agent logs contain the following information:
> {code}
> fetcher.cpp:479] Reverting to fetching directly into the sandbox for 
> 'http://HOST_REDACTED/PATH_REDACTED.tar.gz', due to failure to fetch through 
> the cache, with error: Could not determine size of cache file for 
> 'USER_REDACTED@http://HOST_REDACTED/PATH_REDACTED.tar.gz' with error: No URL 
> content-length available
> {code}
> Any idea why this error would occur?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7572) Attach latest symlink when executor is registered.

2017-06-05 Thread Aaron Wood (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Wood updated MESOS-7572:
--
Description: This will assist framework developers in making features that 
need to access the latest sandbox when hitting various operator API endpoints.  
(was: This will assist framework developers in making features that need to 
access the latest sandbox when hitting various operator API endpoints.

https://reviews.apache.org/r/59641/)

> Attach latest symlink when executor is registered.
> --
>
> Key: MESOS-7572
> URL: https://issues.apache.org/jira/browse/MESOS-7572
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent, HTTP API, master
>Reporter: Aaron Wood
>Assignee: Aaron Wood
>
> This will assist framework developers in making features that need to access 
> the latest sandbox when hitting various operator API endpoints.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7572) Attach latest symlink when executor is registered.

2017-06-05 Thread Aaron Wood (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16037664#comment-16037664
 ] 

Aaron Wood commented on MESOS-7572:
---

https://reviews.apache.org/r/59641/

> Attach latest symlink when executor is registered.
> --
>
> Key: MESOS-7572
> URL: https://issues.apache.org/jira/browse/MESOS-7572
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent, HTTP API, master
>Reporter: Aaron Wood
>Assignee: Aaron Wood
>
> This will assist framework developers in making features that need to access 
> the latest sandbox when hitting various operator API endpoints.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7572) Attach latest symlink when executor is registered.

2017-06-05 Thread Aaron Wood (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Wood updated MESOS-7572:
--
Description: 
This will assist framework developers in making features that need to access 
the latest sandbox when hitting various operator API endpoints.

https://reviews.apache.org/r/59641/

  was:
The main benefit of following symlinks in endpoints such as {code}/files{code} 
is that frameworks will be able to construct a path to the sandbox much easier. 
This will assist framework developers in making features that need to provide a 
path when hitting various operator API endpoints. Currently, making use of a 
path ending in {code}runs/latest{code} throws a 404.

One such application could be a scheduler providing the ability for users to 
work with their task's sandbox directly without going to the Mesos UI, API 
endpoints, or the actual system themselves.

https://reviews.apache.org/r/59641/


> Attach latest symlink when executor is registered.
> --
>
> Key: MESOS-7572
> URL: https://issues.apache.org/jira/browse/MESOS-7572
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent, HTTP API, master
>Reporter: Aaron Wood
>Assignee: Aaron Wood
>
> This will assist framework developers in making features that need to access 
> the latest sandbox when hitting various operator API endpoints.
> https://reviews.apache.org/r/59641/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7572) Attach latest symlink when executor is registered.

2017-06-05 Thread Aaron Wood (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Wood updated MESOS-7572:
--
Summary: Attach latest symlink when executor is registered.  (was: Follow 
symlinks when resolving paths in the various master/agent endpoints)

> Attach latest symlink when executor is registered.
> --
>
> Key: MESOS-7572
> URL: https://issues.apache.org/jira/browse/MESOS-7572
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent, HTTP API, master
>Reporter: Aaron Wood
>Assignee: Aaron Wood
>
> The main benefit of following symlinks in endpoints such as 
> {code}/files{code} is that frameworks will be able to construct a path to the 
> sandbox much easier. This will assist framework developers in making features 
> that need to provide a path when hitting various operator API endpoints. 
> Currently, making use of a path ending in {code}runs/latest{code} throws a 
> 404.
> One such application could be a scheduler providing the ability for users to 
> work with their task's sandbox directly without going to the Mesos UI, API 
> endpoints, or the actual system themselves.
> https://reviews.apache.org/r/59641/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7566) Master crash due to failed check in DRFSorter::remove

2017-06-05 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16037599#comment-16037599
 ] 

Benjamin Mahler commented on MESOS-7566:


[~xujyan] can you file a ticket for the race you described? It isn't the issue 
in this ticket AFAICT, but we should capture it and fix it as well.

> Master crash due to failed check in DRFSorter::remove
> -
>
> Key: MESOS-7566
> URL: https://issues.apache.org/jira/browse/MESOS-7566
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.1.2
>Reporter: Zhitao Li
>Priority: Critical
>
> A check in [sorter.cpp#L355 in 1.1.2 | 
> https://github.com/apache/mesos/blob/1.1.2/src/master/allocator/sorter/drf/sorter.cpp#L355]
>  is triggered occasionally in our cluster and crashes the master leader.
> I manually modified that check to print out the related variables, and the 
> following is a master log.
> https://gist.github.com/zhitaoli/0662d9fe1f6d57de344951c05b536bad#file-gistfile1-txt
> From the log, it seems like the check was using an stale value revocable CPU  
> {{26}} while the new value was updated to 25, thus the check crashed.
> So far two verified occurrence of this bug are both observed near an 
> {{UNRESERVE}} operation (see lines above in the log).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7566) Master crash due to failed check in DRFSorter::remove

2017-06-05 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16037576#comment-16037576
 ] 

Benjamin Mahler commented on MESOS-7566:


For posterity, line 773 in [~zhitao]'s version corresponds to:
https://github.com/apache/mesos/blob/1.1.2/src/master/allocator/mesos/hierarchical.cpp#L749

> Master crash due to failed check in DRFSorter::remove
> -
>
> Key: MESOS-7566
> URL: https://issues.apache.org/jira/browse/MESOS-7566
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.1.2
>Reporter: Zhitao Li
>Priority: Critical
>
> A check in [sorter.cpp#L355 in 1.1.2 | 
> https://github.com/apache/mesos/blob/1.1.2/src/master/allocator/sorter/drf/sorter.cpp#L355]
>  is triggered occasionally in our cluster and crashes the master leader.
> I manually modified that check to print out the related variables, and the 
> following is a master log.
> https://gist.github.com/zhitaoli/0662d9fe1f6d57de344951c05b536bad#file-gistfile1-txt
> From the log, it seems like the check was using an stale value revocable CPU  
> {{26}} while the new value was updated to 25, thus the check crashed.
> So far two verified occurrence of this bug are both observed near an 
> {{UNRESERVE}} operation (see lines above in the log).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7566) Master crash due to failed check in DRFSorter::remove

2017-06-05 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16037565#comment-16037565
 ] 

Zhitao Li commented on MESOS-7566:
--

A similar but maybe not identical crash's stack trace reported by gdb:

https://gist.github.com/zhitaoli/180f7aa3c619dab44db19af92fd7d3a1

This is a slightly patched version of `1.1.2`.

> Master crash due to failed check in DRFSorter::remove
> -
>
> Key: MESOS-7566
> URL: https://issues.apache.org/jira/browse/MESOS-7566
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.1.2
>Reporter: Zhitao Li
>Priority: Critical
>
> A check in [sorter.cpp#L355 in 1.1.2 | 
> https://github.com/apache/mesos/blob/1.1.2/src/master/allocator/sorter/drf/sorter.cpp#L355]
>  is triggered occasionally in our cluster and crashes the master leader.
> I manually modified that check to print out the related variables, and the 
> following is a master log.
> https://gist.github.com/zhitaoli/0662d9fe1f6d57de344951c05b536bad#file-gistfile1-txt
> From the log, it seems like the check was using an stale value revocable CPU  
> {{26}} while the new value was updated to 25, thus the check crashed.
> So far two verified occurrence of this bug are both observed near an 
> {{UNRESERVE}} operation (see lines above in the log).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (MESOS-7566) Master crash due to failed check in DRFSorter::remove

2017-06-05 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16037565#comment-16037565
 ] 

Zhitao Li edited comment on MESOS-7566 at 6/5/17 8:50 PM:
--

A similar but maybe not identical crash's stack trace reported by gdb:

https://gist.github.com/zhitaoli/180f7aa3c619dab44db19af92fd7d3a1

This is a slightly patched version of {{1.1.2}}.


was (Author: zhitao):
A similar but maybe not identical crash's stack trace reported by gdb:

https://gist.github.com/zhitaoli/180f7aa3c619dab44db19af92fd7d3a1

This is a slightly patched version of `1.1.2`.

> Master crash due to failed check in DRFSorter::remove
> -
>
> Key: MESOS-7566
> URL: https://issues.apache.org/jira/browse/MESOS-7566
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.1.2
>Reporter: Zhitao Li
>Priority: Critical
>
> A check in [sorter.cpp#L355 in 1.1.2 | 
> https://github.com/apache/mesos/blob/1.1.2/src/master/allocator/sorter/drf/sorter.cpp#L355]
>  is triggered occasionally in our cluster and crashes the master leader.
> I manually modified that check to print out the related variables, and the 
> following is a master log.
> https://gist.github.com/zhitaoli/0662d9fe1f6d57de344951c05b536bad#file-gistfile1-txt
> From the log, it seems like the check was using an stale value revocable CPU  
> {{26}} while the new value was updated to 25, thus the check crashed.
> So far two verified occurrence of this bug are both observed near an 
> {{UNRESERVE}} operation (see lines above in the log).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (MESOS-7423) Add s390x builds to Mesos CI

2017-06-05 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16037444#comment-16037444
 ] 

Vinod Kone edited comment on MESOS-7423 at 6/5/17 7:34 PM:
---

[~Nayana] Can you get `pip` and `docker` installed on these VMs? It's needed 
for our CI jobs.


was (Author: vinodkone):
[~Nayana] Can you get `pip` installed on these VMs? It's needed for our CI jobs 
(in addition to `docker` which I'm assuming is already installed)

> Add s390x builds to Mesos CI
> 
>
> Key: MESOS-7423
> URL: https://issues.apache.org/jira/browse/MESOS-7423
> Project: Mesos
>  Issue Type: Task
>Reporter: Nayana Thorat
>
> Hi Vinod,
> We had raised an issue to add s390x support for mesos which was fixed and 
> resolved.
> https://issues.apache.org/jira/browse/MESOS-6742
> We also want to know about Mesos CI. 
> We need following details about current Mesos CI:
> 1. How is the current Mesos CI infrastructure? Travis/Jenkins?
> 2. Can Mesos CI extended to support s390x systems?
> We are not sure if this is right channel to discuss this topic. 
> Please let us know if you want to start this discussion on some other channel.
> Thanks,



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7423) Add s390x builds to Mesos CI

2017-06-05 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16037444#comment-16037444
 ] 

Vinod Kone commented on MESOS-7423:
---

[~Nayana] Can you get `pip` installed on these VMs? It's needed for our CI jobs 
(in addition to `docker` which I'm assuming is already installed)

> Add s390x builds to Mesos CI
> 
>
> Key: MESOS-7423
> URL: https://issues.apache.org/jira/browse/MESOS-7423
> Project: Mesos
>  Issue Type: Task
>Reporter: Nayana Thorat
>
> Hi Vinod,
> We had raised an issue to add s390x support for mesos which was fixed and 
> resolved.
> https://issues.apache.org/jira/browse/MESOS-6742
> We also want to know about Mesos CI. 
> We need following details about current Mesos CI:
> 1. How is the current Mesos CI infrastructure? Travis/Jenkins?
> 2. Can Mesos CI extended to support s390x systems?
> We are not sure if this is right channel to discuss this topic. 
> Please let us know if you want to start this discussion on some other channel.
> Thanks,



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7423) Add s390x builds to Mesos CI

2017-06-05 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-7423:
--
Summary: Add s390x builds to Mesos CI  (was: Information on Mesos CI)

I'm updating the title of this ticket to capture the work needed to enable 
s390x builds in CI.

> Add s390x builds to Mesos CI
> 
>
> Key: MESOS-7423
> URL: https://issues.apache.org/jira/browse/MESOS-7423
> Project: Mesos
>  Issue Type: Task
>Reporter: Nayana Thorat
>
> Hi Vinod,
> We had raised an issue to add s390x support for mesos which was fixed and 
> resolved.
> https://issues.apache.org/jira/browse/MESOS-6742
> We also want to know about Mesos CI. 
> We need following details about current Mesos CI:
> 1. How is the current Mesos CI infrastructure? Travis/Jenkins?
> 2. Can Mesos CI extended to support s390x systems?
> We are not sure if this is right channel to discuss this topic. 
> Please let us know if you want to start this discussion on some other channel.
> Thanks,



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7622) Agent can crash if a HTTP executor tries to retry subscription in running state.

2017-06-05 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-7622:
--
Target Version/s: 1.2.2, 1.3.1

> Agent can crash if a HTTP executor tries to retry subscription in running 
> state.
> 
>
> Key: MESOS-7622
> URL: https://issues.apache.org/jira/browse/MESOS-7622
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, executor
>Reporter: Aaron Wood
>Assignee: Anand Mazumdar
>Priority: Blocker
>
> It is possible that a running executor might retry its subscribe request. 
> This can lead to a crash if it previously had any launched tasks. Note that 
> the executor would still be able to subscribe again when the agent process 
> restarts and is recovering.
> {code}
> sudo ./mesos-agent --master=10.0.2.15:5050 --work_dir=/tmp/slave 
> --isolation=cgroups/cpu,cgroups/mem,disk/du,network/cni,filesystem/linux,docker/runtime
>  --image_providers=docker --image_provisioner_backend=overlay 
> --containerizers=mesos --launcher_dir=$(pwd) 
> --executor_environment_variables='{"LD_LIBRARY_PATH": 
> "/home/aaron/Code/src/mesos/build/src/.libs"}'
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> I0605 14:58:23.748180 10710 main.cpp:323] Build: 2017-06-02 17:09:05 UTC by 
> aaron
> I0605 14:58:23.748252 10710 main.cpp:324] Version: 1.4.0
> I0605 14:58:23.755409 10710 systemd.cpp:238] systemd version `232` detected
> I0605 14:58:23.755450 10710 main.cpp:433] Initializing systemd state
> I0605 14:58:23.763049 10710 systemd.cpp:326] Started systemd slice 
> `mesos_executors.slice`
> I0605 14:58:23.763777 10710 resolver.cpp:69] Creating default secret resolver
> I0605 14:58:23.764214 10710 containerizer.cpp:230] Using isolation: 
> cgroups/cpu,cgroups/mem,disk/du,network/cni,filesystem/linux,docker/runtime,volume/image,environment_secret
> I0605 14:58:23.767192 10710 linux_launcher.cpp:150] Using 
> /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
> E0605 14:58:23.770179 10710 shell.hpp:107] Command 'hadoop version 2>&1' 
> failed; this is the output:
> sh: 1: hadoop: not found
> I0605 14:58:23.770217 10710 fetcher.cpp:69] Skipping URI fetcher plugin 
> 'hadoop' as it could not be created: Failed to create HDFS client: Failed to 
> execute 'hadoop version 2>&1'; the command was either not found or exited 
> with a non-zero exit status: 127
> I0605 14:58:23.770643 10710 provisioner.cpp:255] Using default backend 
> 'overlay'
> I0605 14:58:23.785892 10710 slave.cpp:248] Mesos agent started on 
> (1)@127.0.1.1:5051
> I0605 14:58:23.785957 10710 slave.cpp:249] Flags at startup: 
> --appc_simple_discovery_uri_prefix="http://"; 
> --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" 
> --authenticate_http_readwrite="false" --authenticatee="crammd5" 
> --authentication_backoff_factor="1secs" --authorizer="local" 
> --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" 
> --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" 
> --cgroups_root="mesos" --container_disk_watch_interval="15secs" 
> --containerizers="mesos" --default_role="*" --disk_watch_interval="1mins" 
> --docker="docker" --docker_kill_orphans="true" 
> --docker_registry="https://registry-1.docker.io"; --docker_remove_delay="6hrs" 
> --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" 
> --docker_store_dir="/tmp/mesos/store/docker" 
> --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" 
> --enforce_container_disk_quota="false" 
> --executor_environment_variables="{"LD_LIBRARY_PATH":"\/home\/aaron\/Code\/src\/mesos\/build\/src\/.libs"}"
>  --executor_registration_timeout="1mins" 
> --executor_reregistration_timeout="2secs" 
> --executor_shutdown_grace_period="5secs" 
> --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" 
> --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" 
> --hadoop_home="" --help="false" --hostname_lookup="true" 
> --http_command_executor="false" --http_heartbeat_interval="30secs" 
> --image_providers="docker" --image_provisioner_backend="overlay" 
> --initialize_driver_logging="true" 
> --isolation="cgroups/cpu,cgroups/mem,disk/du,network/cni,filesystem/linux,docker/runtime"
>  --launcher="linux" --launcher_dir="/home/aaron/Code/src/mesos/build/src" 
> --logbufsecs="0" --logging_level="INFO" --master="10.0.2.15:5050" 
> --max_completed_executors_per_framework="150" 
> --oversubscribed_resources_interval="15secs" --perf_duration="10secs" 
> --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" 
> --quiet="false" --recover="reconnect" --recovery_timeout="15mins" 
> --registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" 
> --runtime_dir="/var/r

[jira] [Updated] (MESOS-7622) Agent can crash if a HTTP executor tries to retry subscription in running state.

2017-06-05 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-7622:
--
Description: 
It is possible that a running executor might retry its subscribe request. This 
can lead to a crash if it previously had any launched tasks. Note that the 
executor would still be able to subscribe again when the agent process restarts 
and is recovering.
{code}
sudo ./mesos-agent --master=10.0.2.15:5050 --work_dir=/tmp/slave 
--isolation=cgroups/cpu,cgroups/mem,disk/du,network/cni,filesystem/linux,docker/runtime
 --image_providers=docker --image_provisioner_backend=overlay 
--containerizers=mesos --launcher_dir=$(pwd) 
--executor_environment_variables='{"LD_LIBRARY_PATH": 
"/home/aaron/Code/src/mesos/build/src/.libs"}'
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0605 14:58:23.748180 10710 main.cpp:323] Build: 2017-06-02 17:09:05 UTC by 
aaron
I0605 14:58:23.748252 10710 main.cpp:324] Version: 1.4.0
I0605 14:58:23.755409 10710 systemd.cpp:238] systemd version `232` detected
I0605 14:58:23.755450 10710 main.cpp:433] Initializing systemd state
I0605 14:58:23.763049 10710 systemd.cpp:326] Started systemd slice 
`mesos_executors.slice`
I0605 14:58:23.763777 10710 resolver.cpp:69] Creating default secret resolver
I0605 14:58:23.764214 10710 containerizer.cpp:230] Using isolation: 
cgroups/cpu,cgroups/mem,disk/du,network/cni,filesystem/linux,docker/runtime,volume/image,environment_secret
I0605 14:58:23.767192 10710 linux_launcher.cpp:150] Using 
/sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
E0605 14:58:23.770179 10710 shell.hpp:107] Command 'hadoop version 2>&1' 
failed; this is the output:
sh: 1: hadoop: not found
I0605 14:58:23.770217 10710 fetcher.cpp:69] Skipping URI fetcher plugin 
'hadoop' as it could not be created: Failed to create HDFS client: Failed to 
execute 'hadoop version 2>&1'; the command was either not found or exited with 
a non-zero exit status: 127
I0605 14:58:23.770643 10710 provisioner.cpp:255] Using default backend 'overlay'
I0605 14:58:23.785892 10710 slave.cpp:248] Mesos agent started on 
(1)@127.0.1.1:5051
I0605 14:58:23.785957 10710 slave.cpp:249] Flags at startup: 
--appc_simple_discovery_uri_prefix="http://"; 
--appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" 
--authenticate_http_readwrite="false" --authenticatee="crammd5" 
--authentication_backoff_factor="1secs" --authorizer="local" 
--cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" 
--cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" 
--cgroups_root="mesos" --container_disk_watch_interval="15secs" 
--containerizers="mesos" --default_role="*" --disk_watch_interval="1mins" 
--docker="docker" --docker_kill_orphans="true" 
--docker_registry="https://registry-1.docker.io"; --docker_remove_delay="6hrs" 
--docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" 
--docker_store_dir="/tmp/mesos/store/docker" 
--docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" 
--enforce_container_disk_quota="false" 
--executor_environment_variables="{"LD_LIBRARY_PATH":"\/home\/aaron\/Code\/src\/mesos\/build\/src\/.libs"}"
 --executor_registration_timeout="1mins" 
--executor_reregistration_timeout="2secs" 
--executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" 
--fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" 
--gc_disk_headroom="0.1" --hadoop_home="" --help="false" 
--hostname_lookup="true" --http_command_executor="false" 
--http_heartbeat_interval="30secs" --image_providers="docker" 
--image_provisioner_backend="overlay" --initialize_driver_logging="true" 
--isolation="cgroups/cpu,cgroups/mem,disk/du,network/cni,filesystem/linux,docker/runtime"
 --launcher="linux" --launcher_dir="/home/aaron/Code/src/mesos/build/src" 
--logbufsecs="0" --logging_level="INFO" --master="10.0.2.15:5050" 
--max_completed_executors_per_framework="150" 
--oversubscribed_resources_interval="15secs" --perf_duration="10secs" 
--perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" 
--quiet="false" --recover="reconnect" --recovery_timeout="15mins" 
--registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" 
--runtime_dir="/var/run/mesos" --sandbox_directory="/mnt/mesos/sandbox" 
--strict="true" --switch_user="true" --systemd_enable_support="true" 
--systemd_runtime_directory="/run/systemd/system" --version="false" 
--work_dir="/tmp/slave"
I0605 14:58:23.786392 10710 slave.cpp:552] Agent resources: cpus(*):6; 
mem(*):6956; disk(*):41113; ports(*):[31000-32000]
I0605 14:58:23.786437 10710 slave.cpp:560] Agent attributes: [  ]
I0605 14:58:23.786468 10710 slave.cpp:565] Agent hostname: U64
I0605 14:58:23.786574 10714 status_update_manager.cpp:177] Pausing sending 
status updates
I0605 14:58:23.787470 10718 state.cpp:62] Recovering state from 
'/tmp/slave/

[jira] [Updated] (MESOS-7622) Agent can crash if a HTTP executor tries to retry subscription in running state.

2017-06-05 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-7622:
--
Description: 
It is possible that a running executor might retry its subscribe request. This 
can lead to a crash if it previously had any launched tasks.
{code}
sudo ./mesos-agent --master=10.0.2.15:5050 --work_dir=/tmp/slave 
--isolation=cgroups/cpu,cgroups/mem,disk/du,network/cni,filesystem/linux,docker/runtime
 --image_providers=docker --image_provisioner_backend=overlay 
--containerizers=mesos --launcher_dir=$(pwd) 
--executor_environment_variables='{"LD_LIBRARY_PATH": 
"/home/aaron/Code/src/mesos/build/src/.libs"}'
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0605 14:58:23.748180 10710 main.cpp:323] Build: 2017-06-02 17:09:05 UTC by 
aaron
I0605 14:58:23.748252 10710 main.cpp:324] Version: 1.4.0
I0605 14:58:23.755409 10710 systemd.cpp:238] systemd version `232` detected
I0605 14:58:23.755450 10710 main.cpp:433] Initializing systemd state
I0605 14:58:23.763049 10710 systemd.cpp:326] Started systemd slice 
`mesos_executors.slice`
I0605 14:58:23.763777 10710 resolver.cpp:69] Creating default secret resolver
I0605 14:58:23.764214 10710 containerizer.cpp:230] Using isolation: 
cgroups/cpu,cgroups/mem,disk/du,network/cni,filesystem/linux,docker/runtime,volume/image,environment_secret
I0605 14:58:23.767192 10710 linux_launcher.cpp:150] Using 
/sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
E0605 14:58:23.770179 10710 shell.hpp:107] Command 'hadoop version 2>&1' 
failed; this is the output:
sh: 1: hadoop: not found
I0605 14:58:23.770217 10710 fetcher.cpp:69] Skipping URI fetcher plugin 
'hadoop' as it could not be created: Failed to create HDFS client: Failed to 
execute 'hadoop version 2>&1'; the command was either not found or exited with 
a non-zero exit status: 127
I0605 14:58:23.770643 10710 provisioner.cpp:255] Using default backend 'overlay'
I0605 14:58:23.785892 10710 slave.cpp:248] Mesos agent started on 
(1)@127.0.1.1:5051
I0605 14:58:23.785957 10710 slave.cpp:249] Flags at startup: 
--appc_simple_discovery_uri_prefix="http://"; 
--appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" 
--authenticate_http_readwrite="false" --authenticatee="crammd5" 
--authentication_backoff_factor="1secs" --authorizer="local" 
--cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" 
--cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" 
--cgroups_root="mesos" --container_disk_watch_interval="15secs" 
--containerizers="mesos" --default_role="*" --disk_watch_interval="1mins" 
--docker="docker" --docker_kill_orphans="true" 
--docker_registry="https://registry-1.docker.io"; --docker_remove_delay="6hrs" 
--docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" 
--docker_store_dir="/tmp/mesos/store/docker" 
--docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" 
--enforce_container_disk_quota="false" 
--executor_environment_variables="{"LD_LIBRARY_PATH":"\/home\/aaron\/Code\/src\/mesos\/build\/src\/.libs"}"
 --executor_registration_timeout="1mins" 
--executor_reregistration_timeout="2secs" 
--executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" 
--fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" 
--gc_disk_headroom="0.1" --hadoop_home="" --help="false" 
--hostname_lookup="true" --http_command_executor="false" 
--http_heartbeat_interval="30secs" --image_providers="docker" 
--image_provisioner_backend="overlay" --initialize_driver_logging="true" 
--isolation="cgroups/cpu,cgroups/mem,disk/du,network/cni,filesystem/linux,docker/runtime"
 --launcher="linux" --launcher_dir="/home/aaron/Code/src/mesos/build/src" 
--logbufsecs="0" --logging_level="INFO" --master="10.0.2.15:5050" 
--max_completed_executors_per_framework="150" 
--oversubscribed_resources_interval="15secs" --perf_duration="10secs" 
--perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" 
--quiet="false" --recover="reconnect" --recovery_timeout="15mins" 
--registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" 
--runtime_dir="/var/run/mesos" --sandbox_directory="/mnt/mesos/sandbox" 
--strict="true" --switch_user="true" --systemd_enable_support="true" 
--systemd_runtime_directory="/run/systemd/system" --version="false" 
--work_dir="/tmp/slave"
I0605 14:58:23.786392 10710 slave.cpp:552] Agent resources: cpus(*):6; 
mem(*):6956; disk(*):41113; ports(*):[31000-32000]
I0605 14:58:23.786437 10710 slave.cpp:560] Agent attributes: [  ]
I0605 14:58:23.786468 10710 slave.cpp:565] Agent hostname: U64
I0605 14:58:23.786574 10714 status_update_manager.cpp:177] Pausing sending 
status updates
I0605 14:58:23.787470 10718 state.cpp:62] Recovering state from 
'/tmp/slave/meta'
I0605 14:58:23.787698 10713 status_update_manager.cpp:203] Recovering status 
update manager
I0605 14:58:23.7

[jira] [Created] (MESOS-7622) Agent crashes if the default executor launches a custom executor which then tries to subscribe

2017-06-05 Thread Aaron Wood (JIRA)
Aaron Wood created MESOS-7622:
-

 Summary: Agent crashes if the default executor launches a custom 
executor which then tries to subscribe
 Key: MESOS-7622
 URL: https://issues.apache.org/jira/browse/MESOS-7622
 Project: Mesos
  Issue Type: Bug
  Components: agent, executor
Reporter: Aaron Wood
Assignee: Anand Mazumdar
Priority: Blocker


{code}
sudo ./mesos-agent --master=10.0.2.15:5050 --work_dir=/tmp/slave 
--isolation=cgroups/cpu,cgroups/mem,disk/du,network/cni,filesystem/linux,docker/runtime
 --image_providers=docker --image_provisioner_backend=overlay 
--containerizers=mesos --launcher_dir=$(pwd) 
--executor_environment_variables='{"LD_LIBRARY_PATH": 
"/home/aaron/Code/src/mesos/build/src/.libs"}'
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0605 14:58:23.748180 10710 main.cpp:323] Build: 2017-06-02 17:09:05 UTC by 
aaron
I0605 14:58:23.748252 10710 main.cpp:324] Version: 1.4.0
I0605 14:58:23.755409 10710 systemd.cpp:238] systemd version `232` detected
I0605 14:58:23.755450 10710 main.cpp:433] Initializing systemd state
I0605 14:58:23.763049 10710 systemd.cpp:326] Started systemd slice 
`mesos_executors.slice`
I0605 14:58:23.763777 10710 resolver.cpp:69] Creating default secret resolver
I0605 14:58:23.764214 10710 containerizer.cpp:230] Using isolation: 
cgroups/cpu,cgroups/mem,disk/du,network/cni,filesystem/linux,docker/runtime,volume/image,environment_secret
I0605 14:58:23.767192 10710 linux_launcher.cpp:150] Using 
/sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
E0605 14:58:23.770179 10710 shell.hpp:107] Command 'hadoop version 2>&1' 
failed; this is the output:
sh: 1: hadoop: not found
I0605 14:58:23.770217 10710 fetcher.cpp:69] Skipping URI fetcher plugin 
'hadoop' as it could not be created: Failed to create HDFS client: Failed to 
execute 'hadoop version 2>&1'; the command was either not found or exited with 
a non-zero exit status: 127
I0605 14:58:23.770643 10710 provisioner.cpp:255] Using default backend 'overlay'
I0605 14:58:23.785892 10710 slave.cpp:248] Mesos agent started on 
(1)@127.0.1.1:5051
I0605 14:58:23.785957 10710 slave.cpp:249] Flags at startup: 
--appc_simple_discovery_uri_prefix="http://"; 
--appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" 
--authenticate_http_readwrite="false" --authenticatee="crammd5" 
--authentication_backoff_factor="1secs" --authorizer="local" 
--cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" 
--cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" 
--cgroups_root="mesos" --container_disk_watch_interval="15secs" 
--containerizers="mesos" --default_role="*" --disk_watch_interval="1mins" 
--docker="docker" --docker_kill_orphans="true" 
--docker_registry="https://registry-1.docker.io"; --docker_remove_delay="6hrs" 
--docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" 
--docker_store_dir="/tmp/mesos/store/docker" 
--docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" 
--enforce_container_disk_quota="false" 
--executor_environment_variables="{"LD_LIBRARY_PATH":"\/home\/aaron\/Code\/src\/mesos\/build\/src\/.libs"}"
 --executor_registration_timeout="1mins" 
--executor_reregistration_timeout="2secs" 
--executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" 
--fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" 
--gc_disk_headroom="0.1" --hadoop_home="" --help="false" 
--hostname_lookup="true" --http_command_executor="false" 
--http_heartbeat_interval="30secs" --image_providers="docker" 
--image_provisioner_backend="overlay" --initialize_driver_logging="true" 
--isolation="cgroups/cpu,cgroups/mem,disk/du,network/cni,filesystem/linux,docker/runtime"
 --launcher="linux" --launcher_dir="/home/aaron/Code/src/mesos/build/src" 
--logbufsecs="0" --logging_level="INFO" --master="10.0.2.15:5050" 
--max_completed_executors_per_framework="150" 
--oversubscribed_resources_interval="15secs" --perf_duration="10secs" 
--perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" 
--quiet="false" --recover="reconnect" --recovery_timeout="15mins" 
--registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" 
--runtime_dir="/var/run/mesos" --sandbox_directory="/mnt/mesos/sandbox" 
--strict="true" --switch_user="true" --systemd_enable_support="true" 
--systemd_runtime_directory="/run/systemd/system" --version="false" 
--work_dir="/tmp/slave"
I0605 14:58:23.786392 10710 slave.cpp:552] Agent resources: cpus(*):6; 
mem(*):6956; disk(*):41113; ports(*):[31000-32000]
I0605 14:58:23.786437 10710 slave.cpp:560] Agent attributes: [  ]
I0605 14:58:23.786468 10710 slave.cpp:565] Agent hostname: U64
I0605 14:58:23.786574 10714 status_update_manager.cpp:177] Pausing sending 
status updates
I0605 14:58:23.787470 10718 state.cpp:62] Recovering state from 
'/

[jira] [Created] (MESOS-7621) Fetcher does not handle content length in redirects

2017-06-05 Thread Charles Allen (JIRA)
Charles Allen created MESOS-7621:


 Summary: Fetcher does not handle content length in redirects
 Key: MESOS-7621
 URL: https://issues.apache.org/jira/browse/MESOS-7621
 Project: Mesos
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.2.0
Reporter: Charles Allen


{code}
$ curl -L -v -O -s http://HOSTNAME_REDACTED/PATH_REDACTED.tar.gz
*   Trying 172.17.4.10...
* Connected to HOSTNAME_REDACTED (172.17.4.10) port 80 (#0)
> GET /PATH_REDACTED.tar.gz HTTP/1.1
> Host: HOSTNAME_REDACTED
> User-Agent: curl/7.43.0
> Accept: */*
>
< HTTP/1.1 302 FOUND
< Server: nginx/1.4.6 (Ubuntu)
< Date: Mon, 05 Jun 2017 17:58:04 GMT
< Content-Type: text/html; charset=utf-8
< Content-Length: 1947
< Connection: keep-alive
< Location: 
https://BUCKET_REDACTED.s3.amazonaws.com:443/PATH_REDACTED?Signature=REDACTED%3D&Expires=1496689084&AWSAccessKeyId=KEY_REDACTED&x-amz-security-token=TOKEN_REDACTED%3D
<
* Ignoring the response-body
{ [309 bytes data]
* Connection #0 to host HOSTNAME_REDACTED left intact
* Issue another request to this URL: 
'https://BUCKET_REDACTED.s3.amazonaws.com:443/PATH_REDACTED.tar.gz?Signature=SIGNATURE_REDACTED%3D&Expires=1496689084&AWSAccessKeyId=KEY_REDACTED&x-amz-security-token=TOKEN_REDACTED%3D'
*   Trying 54.231.40.75...
* Connected to BUCKET_REDACTED.s3.amazonaws.com (54.231.40.75) port 443 (#1)
* TLS 1.2 connection using TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
* Server certificate: *.s3.amazonaws.com
* Server certificate: DigiCert Baltimore CA-2 G2
* Server certificate: Baltimore CyberTrust Root
> GET 
> /PATH_REDACTED.tar.gz?Signature=REDACTED&Expires=1496689084&AWSAccessKeyId=KEY_REDACTED&x-amz-security-token=TOKEN_REDACTED%3D
>  HTTP/1.1
> Host: BUCKET_REDACTED.s3.amazonaws.com
> User-Agent: curl/7.43.0
> Accept: */*
>
< HTTP/1.1 200 OK
< x-amz-id-2: ID_REDACTED=
< x-amz-request-id: REQUEST_ID_REDACTED
< Date: Mon, 05 Jun 2017 17:58:07 GMT
< Last-Modified: Thu, 01 Jun 2017 03:04:49 GMT
< ETag: "ETAG_REDACTED"
< Accept-Ranges: bytes
< Content-Type: application/x-tar
< Content-Length: 208245664
< Server: AmazonS3
<
{ [16360 bytes data]
{code}

We have a micro-service which signs temporary urls for services which can't 
speak natively with S3. The above is an example download using {{curl}}. But 
when using the mesos fetcher the agent logs contain the following information:

{code}
fetcher.cpp:479] Reverting to fetching directly into the sandbox for 
'http://HOST_REDACTED/PATH_REDACTED.tar.gz', due to failure to fetch through 
the cache, with error: Could not determine size of cache file for 
'USER_REDACTED@http://HOST_REDACTED/PATH_REDACTED.tar.gz' with error: No URL 
content-length available
{code}

Any idea why this error would occur?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7619) Framework Upgrade Resulting in Jan 1, 1070 Date

2017-06-05 Thread Andy Cook (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Cook updated MESOS-7619:
-
Attachment: state.json

> Framework Upgrade Resulting in Jan 1, 1070 Date
> ---
>
> Key: MESOS-7619
> URL: https://issues.apache.org/jira/browse/MESOS-7619
> Project: Mesos
>  Issue Type: Bug
>Reporter: Ken Sipe
> Attachments: Pasted image at 2017_05_31 09_30 AM.png, state.json
>
>
> In the process of upgrading Apache Mesos and Marathon (in HA mode).. marathon 
> ended up with a new framework ID and the older framework ID is listed as 
> being from Jan 1, 1970 (47 years ago).
> The issue with Marathon getting a new framework Id is understood and was 
> worked out with mesosphere's marathon team.  Must of the detail is in the 
> #marathon channel of Apache Mesos slack.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7619) Framework Upgrade Resulting in Jan 1, 1070 Date

2017-06-05 Thread Andy Cook (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Cook updated MESOS-7619:
-
Attachment: (was: state.json)

> Framework Upgrade Resulting in Jan 1, 1070 Date
> ---
>
> Key: MESOS-7619
> URL: https://issues.apache.org/jira/browse/MESOS-7619
> Project: Mesos
>  Issue Type: Bug
>Reporter: Ken Sipe
> Attachments: Pasted image at 2017_05_31 09_30 AM.png, state.json
>
>
> In the process of upgrading Apache Mesos and Marathon (in HA mode).. marathon 
> ended up with a new framework ID and the older framework ID is listed as 
> being from Jan 1, 1970 (47 years ago).
> The issue with Marathon getting a new framework Id is understood and was 
> worked out with mesosphere's marathon team.  Must of the detail is in the 
> #marathon channel of Apache Mesos slack.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7619) Framework Upgrade Resulting in Jan 1, 1070 Date

2017-06-05 Thread Andy Cook (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16037280#comment-16037280
 ] 

Andy Cook commented on MESOS-7619:
--

Hello,

Please find the {{state.json}} info attached.  I've removed all of the info 
about running and completed tasks.  The old framework (with all of the running 
tasks) is {{80f08ece-91c7-43cb-bae7-6a2b41e25ec0-0001}}.  You'll see in the 
screenshot that Mesos believes it was registered 47 years ago.  Similarly, the 
state.json shows a registered_time of {{0}}.

{noformat}
  "failover_timeout":604800.0,
  "checkpoint":true,
  "registered_time":0.0,
  "unregistered_time":0.0,
{noformat}

> Framework Upgrade Resulting in Jan 1, 1070 Date
> ---
>
> Key: MESOS-7619
> URL: https://issues.apache.org/jira/browse/MESOS-7619
> Project: Mesos
>  Issue Type: Bug
>Reporter: Ken Sipe
> Attachments: Pasted image at 2017_05_31 09_30 AM.png, state.json
>
>
> In the process of upgrading Apache Mesos and Marathon (in HA mode).. marathon 
> ended up with a new framework ID and the older framework ID is listed as 
> being from Jan 1, 1970 (47 years ago).
> The issue with Marathon getting a new framework Id is understood and was 
> worked out with mesosphere's marathon team.  Must of the detail is in the 
> #marathon channel of Apache Mesos slack.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7619) Framework Upgrade Resulting in Jan 1, 1070 Date

2017-06-05 Thread Andy Cook (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Cook updated MESOS-7619:
-
Attachment: state.json

> Framework Upgrade Resulting in Jan 1, 1070 Date
> ---
>
> Key: MESOS-7619
> URL: https://issues.apache.org/jira/browse/MESOS-7619
> Project: Mesos
>  Issue Type: Bug
>Reporter: Ken Sipe
> Attachments: Pasted image at 2017_05_31 09_30 AM.png, state.json
>
>
> In the process of upgrading Apache Mesos and Marathon (in HA mode).. marathon 
> ended up with a new framework ID and the older framework ID is listed as 
> being from Jan 1, 1970 (47 years ago).
> The issue with Marathon getting a new framework Id is understood and was 
> worked out with mesosphere's marathon team.  Must of the detail is in the 
> #marathon channel of Apache Mesos slack.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7620) GET_VOLUMES call referenced in API docs, but the call doesn't exist

2017-06-05 Thread James DeFelice (JIRA)
James DeFelice created MESOS-7620:
-

 Summary: GET_VOLUMES call referenced in API docs, but the call 
doesn't exist
 Key: MESOS-7620
 URL: https://issues.apache.org/jira/browse/MESOS-7620
 Project: Mesos
  Issue Type: Bug
Reporter: James DeFelice


https://github.com/apache/mesos/blob/d624255394b864ed477838e32f9712d7e63fc86f/include/mesos/v1/master/master.proto#L150

{code}
  // Create persistent volumes on reserved resources. The request is forwarded
  // asynchronously to the Mesos agent where the reserved resources are located.
  // That asynchronous message may not be delivered or creating the volumes at
  // the agent might fail. Volume creation can be verified by sending a
  // `GET_VOLUMES` call.
{code}

It's either a documentation bug, or a missing/overlooked feature.

/cc [~vinodkone] [~jieyu]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7619) Framework Upgrade Resulting in Jan 1, 1070 Date

2017-06-05 Thread Ken Sipe (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ken Sipe updated MESOS-7619:

Attachment: Pasted image at 2017_05_31 09_30 AM.png

screen shot

> Framework Upgrade Resulting in Jan 1, 1070 Date
> ---
>
> Key: MESOS-7619
> URL: https://issues.apache.org/jira/browse/MESOS-7619
> Project: Mesos
>  Issue Type: Bug
>Reporter: Ken Sipe
> Attachments: Pasted image at 2017_05_31 09_30 AM.png
>
>
> In the process of upgrading Apache Mesos and Marathon (in HA mode).. marathon 
> ended up with a new framework ID and the older framework ID is listed as 
> being from Jan 1, 1970 (47 years ago).
> The issue with Marathon getting a new framework Id is understood and was 
> worked out with mesosphere's marathon team.  Must of the detail is in the 
> #marathon channel of Apache Mesos slack.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7619) Framework Upgrade Resulting in Jan 1, 1070 Date

2017-06-05 Thread Ken Sipe (JIRA)
Ken Sipe created MESOS-7619:
---

 Summary: Framework Upgrade Resulting in Jan 1, 1070 Date
 Key: MESOS-7619
 URL: https://issues.apache.org/jira/browse/MESOS-7619
 Project: Mesos
  Issue Type: Bug
Reporter: Ken Sipe


In the process of upgrading Apache Mesos and Marathon (in HA mode).. marathon 
ended up with a new framework ID and the older framework ID is listed as being 
from Jan 1, 1970 (47 years ago).

The issue with Marathon getting a new framework Id is understood and was worked 
out with mesosphere's marathon team.  Must of the detail is in the #marathon 
channel of Apache Mesos slack.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)