[jira] [Commented] (SLIDER-1259) Slider does not work in multi homed environments

2018-03-14 Thread Billie Rinaldi (JIRA)

[ 
https://issues.apache.org/jira/browse/SLIDER-1259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16399013#comment-16399013
 ] 

Billie Rinaldi commented on SLIDER-1259:


Looks fine to me as well.

> Slider does not work in multi homed environments
> 
>
> Key: SLIDER-1259
> URL: https://issues.apache.org/jira/browse/SLIDER-1259
> Project: Slider
>  Issue Type: Bug
>  Components: appmaster
>Affects Versions: Slider 0.92
>Reporter: Lev Bronshtein
>Assignee: Steve Loughran
>Priority: Minor
> Attachments: SLIDER-1259-001.patch
>
>
> In an an environment where Hadoop Worker nodes bind the Node Manager to an 
> interface with a hostname different from the one returned by socket.getfqdn() 
> for example in our test environment a difference between f-bcpc-vm3 and just 
> bcpc-vm3, which is the hostname bound to the management interface, but not 
> the interface for hadoop/production traffic.  This results in our inability 
> to introspect running jobs.
>  
> For example running  *slider registry --name slider_poc --listexp* results in 
> the following output in the ResourceManager logs
> {quote}2018-01-26 17:30:32,147 INFO 
> org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet: ubuntu is 
> accessing unchecked 
> [http://bcpc-vm3.bcpc.example.com:46391/ws/v1/slider/publisher/exports] which 
> is the app master GUI of application_1516910361403_0094 owned by ubuntu 
>  2018-01-26 17:31:13,639 WARN org.mortbay.log: 
> /proxy/application_1516910361403_0094/ws/v1/slider/publisher/exports: 
> java.net.ConnectException: Connection timed out (Connection timed out) 
> {quote}
>  
> Note how the redirect is to 
> [http://bcpc-vm3.bcpc.example.com:46391/ws/v1/slider/publisher/exports,] 
> where as it should have been to 
> [http://f-bcpc-vm3.bcpc.example.com:46391/ws/v1/slider/publisher/exports.]  
> Renaming the host to f-bcpc-vm3 results in appropriate behavior.
>  
> perhaps *hostname.py* can be instructed to look at one of before registering 
> *yarn.nodemanager.address*
>  *yarn.nodemanager.bind-host*
>  *yarn.nodemanager.hostname*
>  
> When called in Register.py
> register = {'responseId': int(id),
>   'timestamp': timestamp,
>   'label': self.config.getLabel(),
>   *'publicHostname': hostname.public_hostname(),*
>   'agentVersion': version,
>   'actualState': actualState,
>   'expectedState': expectedState,
>   'allocatedPorts': allocated_ports,
>   'logFolders': log_folders,
>   'tags': tags
>  }



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (SLIDER-1262) Slider functests are failing in Kerberized environment

2018-03-05 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi resolved SLIDER-1262.

   Resolution: Fixed
Fix Version/s: Slider 1.0.0

> Slider functests are failing in Kerberized environment
> --
>
> Key: SLIDER-1262
> URL: https://issues.apache.org/jira/browse/SLIDER-1262
> Project: Slider
>  Issue Type: Bug
>  Components: test
>Affects Versions: Slider 0.92
>Reporter: Gyula Komlossi
>Assignee: Gyula Komlossi
>Priority: Major
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1262.patch
>
>
> After a change in Hadoop's Configuration class (introduced parser 
> restriction) majority of the integration tests started failing with an error 
> message:
> *AssertionError: Auth User is not Kerberized  (auth:SIMPLE) -security 
> has already been set up with the wrong authentication method. This can occur* 
> *if* ** *a file system has already been created prior to the loading of the 
> security configuration*
>  
> It is most likely because of the early initialisation of the Hadoop 
> configuration. The base class of the Slider integration tests loads the 
> Slider client configuration (slider-client.xml) using Hadoop’s Configuration 
> class, which contains a static initialiser block. When this block gets 
> executed, the settings from the core-default.xml is loaded with SIMPLE 
> authentication, but the core-site.xml containing Kerberos as authentication 
> is not available on the classpath. During the load of the configuration files 
> (because of the mentioned change above), Hadoop implicitly logs in the test 
> user with SIMPLE authentication and later when the tests try to authenticate 
> the same user with the Kerberos keytab it throws an error.
>  
> The tests could be fixed by adding the actual *-site.xml files to the 
> classpath in the pom.xml of slider-funtest (using 
>  for the maven-failsafe-plugin).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (SLIDER-1262) Slider functests are failing in Kerberized environment

2018-03-05 Thread Billie Rinaldi (JIRA)

[ 
https://issues.apache.org/jira/browse/SLIDER-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386335#comment-16386335
 ] 

Billie Rinaldi commented on SLIDER-1262:


Thanks for the patch, [~gkomlossi]! This change looks fine to me. I'll commit 
it to develop.

> Slider functests are failing in Kerberized environment
> --
>
> Key: SLIDER-1262
> URL: https://issues.apache.org/jira/browse/SLIDER-1262
> Project: Slider
>  Issue Type: Bug
>  Components: test
>Affects Versions: Slider 0.92
>Reporter: Gyula Komlossi
>Assignee: Gyula Komlossi
>Priority: Major
> Attachments: SLIDER-1262.patch
>
>
> After a change in Hadoop's Configuration class (introduced parser 
> restriction) majority of the integration tests started failing with an error 
> message:
> *AssertionError: Auth User is not Kerberized  (auth:SIMPLE) -security 
> has already been set up with the wrong authentication method. This can occur* 
> *if* ** *a file system has already been created prior to the loading of the 
> security configuration*
>  
> It is most likely because of the early initialisation of the Hadoop 
> configuration. The base class of the Slider integration tests loads the 
> Slider client configuration (slider-client.xml) using Hadoop’s Configuration 
> class, which contains a static initialiser block. When this block gets 
> executed, the settings from the core-default.xml is loaded with SIMPLE 
> authentication, but the core-site.xml containing Kerberos as authentication 
> is not available on the classpath. During the load of the configuration files 
> (because of the mentioned change above), Hadoop implicitly logs in the test 
> user with SIMPLE authentication and later when the tests try to authenticate 
> the same user with the Kerberos keytab it throws an error.
>  
> The tests could be fixed by adding the actual *-site.xml files to the 
> classpath in the pom.xml of slider-funtest (using 
>  for the maven-failsafe-plugin).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (SLIDER-1233) Lost nodes should not contribute to container failures

2017-10-11 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi resolved SLIDER-1233.

Resolution: Fixed

> Lost nodes should not contribute to container failures
> --
>
> Key: SLIDER-1233
> URL: https://issues.apache.org/jira/browse/SLIDER-1233
> Project: Slider
>  Issue Type: Bug
>  Components: core
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1233.001.patch, SLIDER-1233.002.patch
>
>
> If a container completes due to an NM being lost, we should not count this 
> towards container failures that may eventually cause the AM to fail the 
> application. We are already using a ContainerOutcome of Completed (rather 
> than Failed) for this type of container exit, so we just need to change the 
> failure counting in that case. Other failure types associated with Completed 
> are killed by the AM, killed by the RM, and killed after app completion, none 
> of which need to contribute to container failures.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (SLIDER-1246) Application health should not be affected by faulty nodes

2017-09-29 Thread Billie Rinaldi (JIRA)

[ 
https://issues.apache.org/jira/browse/SLIDER-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16186187#comment-16186187
 ] 

Billie Rinaldi commented on SLIDER-1246:


bq. it seems like the health threshold check will not be effective unless we 
consider the age of the containers
After further contemplation of the feature, it seems the effective failure 
condition for apps under this implementation is (# of non-blacklisted nodes < 
health fraction * desired containers) for an amount of time greater than the 
health window. IMO this is not ideal, as the condition would never be true 
without blacklisting and "less than 80% of containers healthy" is a much more 
understandable criterion. This could be solved by placing a condition on how 
long a container must be running before it would be counted as healthy.

However, this implementation does meet my basic requirement that it will 
eventually kill an app whose containers are constantly failing. I am okay with 
us committing the feature without a health condition for containers (once the 
other issues are addressed).

> Application health should not be affected by faulty nodes
> -
>
> Key: SLIDER-1246
> URL: https://issues.apache.org/jira/browse/SLIDER-1246
> Project: Slider
>  Issue Type: Bug
>  Components: appmaster, core
>Affects Versions: Slider 0.92
>Reporter: Prasanth Jayachandran
>Assignee: Gour Saha
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1246.01.patch, SLIDER-1246.02.patch, 
> SLIDER-1246.03.patch
>
>
> In case of a faulty node, multiple container failures will be deemed as an 
> application failure. 
> Observed this in HIVE-16927, where container failures in certain nodes brings 
> down entire application. Slider has to provide a way to not mark application 
> as unhealthy if certain threshold of containers are running. Tuning failure 
> threshold is not optimal as setting the correct default on large cluster is 
> not trivial. Beyond certain failures, slider should mark the node as 
> unhealthy and report that back to client/AM. Application could continue to 
> run as long as container request is satisfied partially (example: 80% 
> containers are running).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (SLIDER-1246) Application health should not be affected by faulty nodes

2017-09-29 Thread Billie Rinaldi (JIRA)

[ 
https://issues.apache.org/jira/browse/SLIDER-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16186065#comment-16186065
 ] 

Billie Rinaldi commented on SLIDER-1246:


[~gsaha], thanks for the new patch! I think we can still clean up the the 
global/final config handling in scheduleHealthThresholdMonitor (this is not 
needed because global properties have already been propagated to the component 
properties … so if the property exists in global and not in component, it will 
have been copied to the component).

Secondly, I realized this morning that there will be an issue if unique 
component names is enabled. When unique component names are enabled, there is a 
separate ProviderRole and RoleStatus for each instance (solr1, solr2, etc.) and 
the desired count for each is 1 (or 0), so the desired count for the role group 
can’t be obtained from the RoleStatus.

If you have an app or unit test that you are using for testing, I would 
recommend running the same test with and without unique component names 
enabled. I would expect there to be the same behavior for both.

> Application health should not be affected by faulty nodes
> -
>
> Key: SLIDER-1246
> URL: https://issues.apache.org/jira/browse/SLIDER-1246
> Project: Slider
>  Issue Type: Bug
>  Components: appmaster, core
>Affects Versions: Slider 0.92
>Reporter: Prasanth Jayachandran
>Assignee: Gour Saha
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1246.01.patch, SLIDER-1246.02.patch, 
> SLIDER-1246.03.patch
>
>
> In case of a faulty node, multiple container failures will be deemed as an 
> application failure. 
> Observed this in HIVE-16927, where container failures in certain nodes brings 
> down entire application. Slider has to provide a way to not mark application 
> as unhealthy if certain threshold of containers are running. Tuning failure 
> threshold is not optimal as setting the correct default on large cluster is 
> not trivial. Beyond certain failures, slider should mark the node as 
> unhealthy and report that back to client/AM. Application could continue to 
> run as long as container request is satisfied partially (example: 80% 
> containers are running).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (SLIDER-1246) Application health should not be affected by faulty nodes

2017-09-29 Thread Billie Rinaldi (JIRA)

[ 
https://issues.apache.org/jira/browse/SLIDER-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16185999#comment-16185999
 ] 

Billie Rinaldi commented on SLIDER-1246:


bq. there is an advantage for the containers such that if one fails and a new 
one is allocated by Yarn (within the poll frequency) then it might not dip the 
health percent at all
This is not an advantage for an unhealthy app where the containers always fail. 
I think what you're saying is that nodes will eventually be blacklisted and 
this will cause the health threshold to dip, once enough nodes are blacklisted 
that the container requests can't be satisfied. Let's say we have 100 nodes and 
an app with 10 containers with a health threshold of 80%. We would need 93 
nodes to be blacklisted to fall below the health threshold, which would mean at 
least 93*(3 + 1) = 372 container failures before the health threshold would be 
invoked. Seems like a lot, but this is better than I thought it would be 
because I had forgotten to consider the blacklisting feature.

> Application health should not be affected by faulty nodes
> -
>
> Key: SLIDER-1246
> URL: https://issues.apache.org/jira/browse/SLIDER-1246
> Project: Slider
>  Issue Type: Bug
>  Components: appmaster, core
>Affects Versions: Slider 0.92
>Reporter: Prasanth Jayachandran
>Assignee: Gour Saha
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1246.01.patch, SLIDER-1246.02.patch, 
> SLIDER-1246.03.patch
>
>
> In case of a faulty node, multiple container failures will be deemed as an 
> application failure. 
> Observed this in HIVE-16927, where container failures in certain nodes brings 
> down entire application. Slider has to provide a way to not mark application 
> as unhealthy if certain threshold of containers are running. Tuning failure 
> threshold is not optimal as setting the correct default on large cluster is 
> not trivial. Beyond certain failures, slider should mark the node as 
> unhealthy and report that back to client/AM. Application could continue to 
> run as long as container request is satisfied partially (example: 80% 
> containers are running).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (SLIDER-1246) Application health should not be affected by faulty nodes

2017-09-28 Thread Billie Rinaldi (JIRA)

[ 
https://issues.apache.org/jira/browse/SLIDER-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16184698#comment-16184698
 ] 

Billie Rinaldi commented on SLIDER-1246:


Hi [~gsaha], thanks for taking on this patch. Comments follow:
* rename CONTAINER_HEALTH_THRESHOLD_DISABLED_PERCENT to 
CONTAINER_HEALTH_THRESHOLD_PERCENT_DISABLED for clarity / to match existing 
convention
* create a DEFAULT_CONTAINER_HEALTH_THRESHOLD_PERCENT property which is set to 
CONTAINER_HEALTH_THRESHOLD_PERCENT_DISABLED and use it (some method or 
constructor calls that take a default are using the disabled percent and some 
are hardcoded to -1)
* in scheduleHealthThresholdMonitor the global vs. component option handling is 
unnecessary and should be removed. slider handles this automatically, so you 
only need to retrieve the component options
* based on the javadocs, it looks like appMaster.queue should be used instead 
of appMaster.signalAMComplete to queue the stop action
* in MonitorHealthThreshold, i would set currentTimestamp = now() and then 
optionally set firstOccurrenceTimestamp = currentTimestamp
* the separation between AppMaster#scheduleHealthThresholdMonitor, 
MonitorHealthThreshold, and AppState is a bit muddy. RoleStatus and 
ProviderRole do not need to be used outside of AppState. 
AppMaster#scheduleHealthThresholdMonitor can iterate over the resource 
components instead of the role status map. in MonitorHealthThreshold you can 
store the name instead of the role status. and you would just need to add a 
couple of methods like appState.isHealthThresholdMet(name) and 
appState.setHealthThresholdMonitorEnabled(name)
* as discussed previously offline, i don't think the failure threshold should 
be automatically disabled when the health percent is enabled. but since we 
disagree on this, i am okay with having the automatic disable until someone 
expresses interest in using both features
* it seems like the health threshold check will not be effective unless we 
consider the age of the containers. i can imagine that an app that is 
restarting containers constantly would by chance be able to meet the health 
threshold. have you tested this scenario?

> Application health should not be affected by faulty nodes
> -
>
> Key: SLIDER-1246
> URL: https://issues.apache.org/jira/browse/SLIDER-1246
> Project: Slider
>  Issue Type: Bug
>  Components: appmaster, core
>Affects Versions: Slider 0.92
>Reporter: Prasanth Jayachandran
>Assignee: Gour Saha
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1246.01.patch, SLIDER-1246.02.patch
>
>
> In case of a faulty node, multiple container failures will be deemed as an 
> application failure. 
> Observed this in HIVE-16927, where container failures in certain nodes brings 
> down entire application. Slider has to provide a way to not mark application 
> as unhealthy if certain threshold of containers are running. Tuning failure 
> threshold is not optimal as setting the correct default on large cluster is 
> not trivial. Beyond certain failures, slider should mark the node as 
> unhealthy and report that back to client/AM. Application could continue to 
> run as long as container request is satisfied partially (example: 80% 
> containers are running).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (SLIDER-1242) Review uses of double-checked locking

2017-09-06 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi resolved SLIDER-1242.

Resolution: Fixed

> Review uses of double-checked locking
> -
>
> Key: SLIDER-1242
> URL: https://issues.apache.org/jira/browse/SLIDER-1242
> Project: Slider
>  Issue Type: Bug
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1242.1.patch
>
>
> There are several places where we perform double-checked locking. Even though 
> the practice is discouraged, I believe it is technically correct when the 
> check is performed on the presence of a ConcurrentHashMap key, which is how 
> we are using it.
> However, in two places, AgentProviderService#getCurrentExports and 
> AgentProviderService#getAllocatedPorts, containsKey is used instead of get to 
> perform the check. I am seeing some indication that containsKey is not 
> sufficient, and that get must be used for double-checked locking to be 
> correct. There is a comment in the ConcurrentHashMap#containsKey method that 
> says "same as get() except no need for volatile value read" -- and I think 
> that volatile value read is what we need for correctness.
> Also, in the [ConcurrentHashMap api 
> doc|https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ConcurrentHashMap.html],
>  it specifically mentions get and does not mention containsKey: "Any non-null 
> result returned from get(key) and related access methods bears a 
> happens-before relation with the associated insertion or update" and "an 
> update operation for a given key bears a happens-before relation with any 
> (non-null) retrieval for that key reporting the updated value."



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (SLIDER-1242) Review uses of double-checked locking

2017-09-06 Thread Billie Rinaldi (JIRA)

[ 
https://issues.apache.org/jira/browse/SLIDER-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16155454#comment-16155454
 ] 

Billie Rinaldi commented on SLIDER-1242:


I missed one use of containsKey in ComponentTagProvider.

> Review uses of double-checked locking
> -
>
> Key: SLIDER-1242
> URL: https://issues.apache.org/jira/browse/SLIDER-1242
> Project: Slider
>  Issue Type: Bug
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1242.1.patch
>
>
> There are several places where we perform double-checked locking. Even though 
> the practice is discouraged, I believe it is technically correct when the 
> check is performed on the presence of a ConcurrentHashMap key, which is how 
> we are using it.
> However, in two places, AgentProviderService#getCurrentExports and 
> AgentProviderService#getAllocatedPorts, containsKey is used instead of get to 
> perform the check. I am seeing some indication that containsKey is not 
> sufficient, and that get must be used for double-checked locking to be 
> correct. There is a comment in the ConcurrentHashMap#containsKey method that 
> says "same as get() except no need for volatile value read" -- and I think 
> that volatile value read is what we need for correctness.
> Also, in the [ConcurrentHashMap api 
> doc|https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ConcurrentHashMap.html],
>  it specifically mentions get and does not mention containsKey: "Any non-null 
> result returned from get(key) and related access methods bears a 
> happens-before relation with the associated insertion or update" and "an 
> update operation for a given key bears a happens-before relation with any 
> (non-null) retrieval for that key reporting the updated value."



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Reopened] (SLIDER-1242) Review uses of double-checked locking

2017-09-06 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi reopened SLIDER-1242:


> Review uses of double-checked locking
> -
>
> Key: SLIDER-1242
> URL: https://issues.apache.org/jira/browse/SLIDER-1242
> Project: Slider
>  Issue Type: Bug
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1242.1.patch
>
>
> There are several places where we perform double-checked locking. Even though 
> the practice is discouraged, I believe it is technically correct when the 
> check is performed on the presence of a ConcurrentHashMap key, which is how 
> we are using it.
> However, in two places, AgentProviderService#getCurrentExports and 
> AgentProviderService#getAllocatedPorts, containsKey is used instead of get to 
> perform the check. I am seeing some indication that containsKey is not 
> sufficient, and that get must be used for double-checked locking to be 
> correct. There is a comment in the ConcurrentHashMap#containsKey method that 
> says "same as get() except no need for volatile value read" -- and I think 
> that volatile value read is what we need for correctness.
> Also, in the [ConcurrentHashMap api 
> doc|https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ConcurrentHashMap.html],
>  it specifically mentions get and does not mention containsKey: "Any non-null 
> result returned from get(key) and related access methods bears a 
> happens-before relation with the associated insertion or update" and "an 
> update operation for a given key bears a happens-before relation with any 
> (non-null) retrieval for that key reporting the updated value."



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (SLIDER-1242) Review uses of double-checked locking

2017-09-05 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi resolved SLIDER-1242.

Resolution: Fixed

> Review uses of double-checked locking
> -
>
> Key: SLIDER-1242
> URL: https://issues.apache.org/jira/browse/SLIDER-1242
> Project: Slider
>  Issue Type: Bug
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1242.1.patch
>
>
> There are several places where we perform double-checked locking. Even though 
> the practice is discouraged, I believe it is technically correct when the 
> check is performed on the presence of a ConcurrentHashMap key, which is how 
> we are using it.
> However, in two places, AgentProviderService#getCurrentExports and 
> AgentProviderService#getAllocatedPorts, containsKey is used instead of get to 
> perform the check. I am seeing some indication that containsKey is not 
> sufficient, and that get must be used for double-checked locking to be 
> correct. There is a comment in the ConcurrentHashMap#containsKey method that 
> says "same as get() except no need for volatile value read" -- and I think 
> that volatile value read is what we need for correctness.
> Also, in the [ConcurrentHashMap api 
> doc|https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ConcurrentHashMap.html],
>  it specifically mentions get and does not mention containsKey: "Any non-null 
> result returned from get(key) and related access methods bears a 
> happens-before relation with the associated insertion or update" and "an 
> update operation for a given key bears a happens-before relation with any 
> (non-null) retrieval for that key reporting the updated value."



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (SLIDER-1242) Review uses of double-checked locking

2017-09-05 Thread Billie Rinaldi (JIRA)

[ 
https://issues.apache.org/jira/browse/SLIDER-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16154245#comment-16154245
 ] 

Billie Rinaldi commented on SLIDER-1242:


Thanks for the review, [~gsaha]!

> Review uses of double-checked locking
> -
>
> Key: SLIDER-1242
> URL: https://issues.apache.org/jira/browse/SLIDER-1242
> Project: Slider
>  Issue Type: Bug
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1242.1.patch
>
>
> There are several places where we perform double-checked locking. Even though 
> the practice is discouraged, I believe it is technically correct when the 
> check is performed on the presence of a ConcurrentHashMap key, which is how 
> we are using it.
> However, in two places, AgentProviderService#getCurrentExports and 
> AgentProviderService#getAllocatedPorts, containsKey is used instead of get to 
> perform the check. I am seeing some indication that containsKey is not 
> sufficient, and that get must be used for double-checked locking to be 
> correct. There is a comment in the ConcurrentHashMap#containsKey method that 
> says "same as get() except no need for volatile value read" -- and I think 
> that volatile value read is what we need for correctness.
> Also, in the [ConcurrentHashMap api 
> doc|https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ConcurrentHashMap.html],
>  it specifically mentions get and does not mention containsKey: "Any non-null 
> result returned from get(key) and related access methods bears a 
> happens-before relation with the associated insertion or update" and "an 
> update operation for a given key bears a happens-before relation with any 
> (non-null) retrieval for that key reporting the updated value."



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (SLIDER-1244) Stop logging openssl commands on exception

2017-08-31 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi resolved SLIDER-1244.

Resolution: Fixed

> Stop logging openssl commands on exception
> --
>
> Key: SLIDER-1244
> URL: https://issues.apache.org/jira/browse/SLIDER-1244
> Project: Slider
>  Issue Type: Bug
>Affects Versions: Slider 0.92
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1244.1.patch
>
>
> In CertificateManager#runCommand, if the openssl command fails, the command 
> string is passed to SliderException as a message, which could result in it 
> being logged.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (SLIDER-1245) Clean up AgentResource

2017-08-31 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi updated SLIDER-1245:
---
Attachment: SLIDER-1245.1.patch

> Clean up AgentResource
> --
>
> Key: SLIDER-1245
> URL: https://issues.apache.org/jira/browse/SLIDER-1245
> Project: Slider
>  Issue Type: Bug
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1245.1.patch
>
>
> Remove the unused variable agent_name.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (SLIDER-1245) Clean up AgentResource

2017-08-31 Thread Billie Rinaldi (JIRA)
Billie Rinaldi created SLIDER-1245:
--

 Summary: Clean up AgentResource
 Key: SLIDER-1245
 URL: https://issues.apache.org/jira/browse/SLIDER-1245
 Project: Slider
  Issue Type: Bug
Reporter: Billie Rinaldi
Assignee: Billie Rinaldi
 Fix For: Slider 1.0.0


Remove the unused variable agent_name.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (SLIDER-1242) Review uses of double-checked locking

2017-08-30 Thread Billie Rinaldi (JIRA)

[ 
https://issues.apache.org/jira/browse/SLIDER-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148114#comment-16148114
 ] 

Billie Rinaldi commented on SLIDER-1242:


It looks like the apidoc says that ConcurrentHashMap does not allow null 
values. In fact, I happened to change my SDK to Java 1.8, and the containsKey 
method no longer has the comment about not doing a volatile value read; it just 
uses return get(key) != null. However, it still might be useful to apply this 
patch since other versions of Java might be used.

> Review uses of double-checked locking
> -
>
> Key: SLIDER-1242
> URL: https://issues.apache.org/jira/browse/SLIDER-1242
> Project: Slider
>  Issue Type: Bug
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1242.1.patch
>
>
> There are several places where we perform double-checked locking. Even though 
> the practice is discouraged, I believe it is technically correct when the 
> check is performed on the presence of a ConcurrentHashMap key, which is how 
> we are using it.
> However, in two places, AgentProviderService#getCurrentExports and 
> AgentProviderService#getAllocatedPorts, containsKey is used instead of get to 
> perform the check. I am seeing some indication that containsKey is not 
> sufficient, and that get must be used for double-checked locking to be 
> correct. There is a comment in the ConcurrentHashMap#containsKey method that 
> says "same as get() except no need for volatile value read" -- and I think 
> that volatile value read is what we need for correctness.
> Also, in the [ConcurrentHashMap api 
> doc|https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ConcurrentHashMap.html],
>  it specifically mentions get and does not mention containsKey: "Any non-null 
> result returned from get(key) and related access methods bears a 
> happens-before relation with the associated insertion or update" and "an 
> update operation for a given key bears a happens-before relation with any 
> (non-null) retrieval for that key reporting the updated value."



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (SLIDER-1244) Stop logging openssl commands on exception

2017-08-30 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi updated SLIDER-1244:
---
Attachment: SLIDER-1244.1.patch

> Stop logging openssl commands on exception
> --
>
> Key: SLIDER-1244
> URL: https://issues.apache.org/jira/browse/SLIDER-1244
> Project: Slider
>  Issue Type: Bug
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1244.1.patch
>
>
> In CertificateManager#runCommand, if the openssl command fails, the command 
> string is passed to SliderException as a message, which could result in it 
> being logged.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (SLIDER-1244) Stop logging openssl commands on exception

2017-08-30 Thread Billie Rinaldi (JIRA)
Billie Rinaldi created SLIDER-1244:
--

 Summary: Stop logging openssl commands on exception
 Key: SLIDER-1244
 URL: https://issues.apache.org/jira/browse/SLIDER-1244
 Project: Slider
  Issue Type: Bug
Reporter: Billie Rinaldi
Assignee: Billie Rinaldi
 Fix For: Slider 1.0.0


In CertificateManager#runCommand, if the openssl command fails, the command 
string is passed to SliderException as a message, which could result in it 
being logged.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (SLIDER-1243) Enable XML validation in ConfigHelper

2017-08-29 Thread Billie Rinaldi (JIRA)

[ 
https://issues.apache.org/jira/browse/SLIDER-1243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16146179#comment-16146179
 ] 

Billie Rinaldi commented on SLIDER-1243:


+1, looks fine to me.

> Enable XML validation in ConfigHelper
> -
>
> Key: SLIDER-1243
> URL: https://issues.apache.org/jira/browse/SLIDER-1243
> Project: Slider
>  Issue Type: Bug
>  Components: core
>Affects Versions: Slider 0.92
>Reporter: Gour Saha
>Assignee: Gour Saha
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1243.01.patch
>
>
> In method parseConfigXML() in ConfigHelper.java there is no validation before 
> parsing XML from inputstream



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (SLIDER-1242) Review uses of double-checked locking

2017-08-25 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi updated SLIDER-1242:
---
Attachment: SLIDER-1242.1.patch

> Review uses of double-checked locking
> -
>
> Key: SLIDER-1242
> URL: https://issues.apache.org/jira/browse/SLIDER-1242
> Project: Slider
>  Issue Type: Bug
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1242.1.patch
>
>
> There are several places where we perform double-checked locking. Even though 
> the practice is discouraged, I believe it is technically correct when the 
> check is performed on the presence of a ConcurrentHashMap key, which is how 
> we are using it.
> However, in two places, AgentProviderService#getCurrentExports and 
> AgentProviderService#getAllocatedPorts, containsKey is used instead of get to 
> perform the check. I am seeing some indication that containsKey is not 
> sufficient, and that get must be used for double-checked locking to be 
> correct. There is a comment in the ConcurrentHashMap#containsKey method that 
> says "same as get() except no need for volatile value read" -- and I think 
> that volatile value read is what we need for correctness.
> Also, in the [ConcurrentHashMap api 
> doc|https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ConcurrentHashMap.html],
>  it specifically mentions get and does not mention containsKey: "Any non-null 
> result returned from get(key) and related access methods bears a 
> happens-before relation with the associated insertion or update" and "an 
> update operation for a given key bears a happens-before relation with any 
> (non-null) retrieval for that key reporting the updated value."



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (SLIDER-1242) Review uses of double-checked locking

2017-08-25 Thread Billie Rinaldi (JIRA)
Billie Rinaldi created SLIDER-1242:
--

 Summary: Review uses of double-checked locking
 Key: SLIDER-1242
 URL: https://issues.apache.org/jira/browse/SLIDER-1242
 Project: Slider
  Issue Type: Bug
Reporter: Billie Rinaldi
Assignee: Billie Rinaldi
 Fix For: Slider 1.0.0


There are several places where we perform double-checked locking. Even though 
the practice is discouraged, I believe it is technically correct when the check 
is performed on the presence of a ConcurrentHashMap key, which is how we are 
using it.

However, in two places, AgentProviderService#getCurrentExports and 
AgentProviderService#getAllocatedPorts, containsKey is used instead of get to 
perform the check. I am seeing some indication that containsKey is not 
sufficient, and that get must be used for double-checked locking to be correct. 
There is a comment in the ConcurrentHashMap#containsKey method that says "same 
as get() except no need for volatile value read" -- and I think that volatile 
value read is what we need for correctness.

Also, in the [ConcurrentHashMap api 
doc|https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ConcurrentHashMap.html],
 it specifically mentions get and does not mention containsKey: "Any non-null 
result returned from get(key) and related access methods bears a happens-before 
relation with the associated insertion or update" and "an update operation for 
a given key bears a happens-before relation with any (non-null) retrieval for 
that key reporting the updated value."



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (SLIDER-1237) Remove usages of printStackTrace

2017-08-22 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi resolved SLIDER-1237.

Resolution: Fixed

> Remove usages of printStackTrace
> 
>
> Key: SLIDER-1237
> URL: https://issues.apache.org/jira/browse/SLIDER-1237
> Project: Slider
>  Issue Type: Bug
>Affects Versions: Slider 0.92
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
>Priority: Minor
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1237.1.patch, SLIDER-1237.2.patch
>
>
> We should clean up usages of printStackTrace in favor of better log messages.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (SLIDER-1238) Remove unused method in AMWebClient

2017-08-22 Thread Billie Rinaldi (JIRA)

[ 
https://issues.apache.org/jira/browse/SLIDER-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16136851#comment-16136851
 ] 

Billie Rinaldi commented on SLIDER-1238:


+1, the patch looks good to me.

> Remove unused method in AMWebClient
> ---
>
> Key: SLIDER-1238
> URL: https://issues.apache.org/jira/browse/SLIDER-1238
> Project: Slider
>  Issue Type: Bug
>  Components: core, registry
>Affects Versions: Slider 0.92
>Reporter: Gour Saha
>Assignee: Gour Saha
> Attachments: SLIDER-1238.01.patch
>
>
> Remove the unused method getUrlConnectionClientHandler from AMWebClient



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (SLIDER-1237) Remove usages of printStackTrace

2017-08-22 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi updated SLIDER-1237:
---
Attachment: SLIDER-1237.2.patch

Thanks Gour, I'll commit this patch including your suggestions after it 
completes a unit test run.

> Remove usages of printStackTrace
> 
>
> Key: SLIDER-1237
> URL: https://issues.apache.org/jira/browse/SLIDER-1237
> Project: Slider
>  Issue Type: Bug
>Affects Versions: Slider 0.92
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
>Priority: Minor
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1237.1.patch, SLIDER-1237.2.patch
>
>
> We should clean up usages of printStackTrace in favor of better log messages.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (SLIDER-1237) Remove usages of printStackTrace

2017-08-18 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi updated SLIDER-1237:
---
Attachment: SLIDER-1237.1.patch

> Remove usages of printStackTrace
> 
>
> Key: SLIDER-1237
> URL: https://issues.apache.org/jira/browse/SLIDER-1237
> Project: Slider
>  Issue Type: Bug
>Affects Versions: Slider 0.92
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
>Priority: Minor
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1237.1.patch
>
>
> We should clean up usages of printStackTrace in favor of better log messages.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (SLIDER-1237) Remove usages of printStackTrace

2017-08-18 Thread Billie Rinaldi (JIRA)
Billie Rinaldi created SLIDER-1237:
--

 Summary: Remove usages of printStackTrace
 Key: SLIDER-1237
 URL: https://issues.apache.org/jira/browse/SLIDER-1237
 Project: Slider
  Issue Type: Bug
Affects Versions: Slider 0.92
Reporter: Billie Rinaldi
Assignee: Billie Rinaldi
Priority: Minor
 Fix For: Slider 1.0.0


We should clean up usages of printStackTrace in favor of better log messages.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (SLIDER-1236) Unnecessary 10 second sleep before installation

2017-08-18 Thread Billie Rinaldi (JIRA)

[ 
https://issues.apache.org/jira/browse/SLIDER-1236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16133172#comment-16133172
 ] 

Billie Rinaldi commented on SLIDER-1236:


I think it would be okay to leave in the change of HEARTBEAT_IDDLE_INTERVAL_SEC 
from 10 to 1 here, but I agree we should open an additional ticket to evaluate 
failure scenarios further.

> Unnecessary 10 second sleep before installation
> ---
>
> Key: SLIDER-1236
> URL: https://issues.apache.org/jira/browse/SLIDER-1236
> Project: Slider
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Gour Saha
>
> Noticed when starting LLAP on a 2-node cluster. Slider AM logs:
> {noformat}
> 2017-05-22 22:04:33,047 [956937652@qtp-624693846-4] INFO  
> agent.AgentProviderService - Registration response: 
> RegistrationResponse{response=OK, responseId=0, statusCommands=null}
> ...
> 2017-05-22 22:04:34,946 [956937652@qtp-624693846-4] INFO  
> agent.AgentProviderService - Registration response: 
> RegistrationResponse{response=OK, responseId=0, statusCommands=null}
> {noformat}
> Then nothing useful goes on for a while, until:
> {noformat}
> 2017-05-22 22:04:43,099 [956937652@qtp-624693846-4] INFO  
> agent.AgentProviderService - Installing LLAP on 
> container_1495490227300_0002_01_02.
> {noformat}
> If you look at the corresponding logs from both agents, you can see that they 
> both have a gap that's pretty much exactly 10sec.
> After the gap, they talk back to AM; after ~30ms for each container 
> (corresponding to the end of its gap), presumably after hearing from it, the 
> AM starts installing LLAP.
> {noformat}
> INFO 2017-05-22 22:04:33,055 Controller.py:180 - Registered with the server 
> with {u'exitstatus': 0,
> INFO 2017-05-22 22:04:33,055 Controller.py:630 - Response from server = OK
> INFO 2017-05-22 22:04:43,065 AgentToggleLogger.py:40 - Queue result: 
> {'componentStatus': [], 'reports': []}
> INFO 2017-05-22 22:04:43,065 AgentToggleLogger.py:40 - Sending heartbeat with 
> response id: 0 and timestamp: 1495490683064. Command(s) in progress: False. 
> Components mapped: False
> INFO 2017-05-22 22:04:34,948 Controller.py:180 - Registered with the server 
> with {u'exitstatus': 0,
> INFO 2017-05-22 22:04:34,948 Controller.py:630 - Response from server = OK
> INFO 2017-05-22 22:04:44,959 AgentToggleLogger.py:40 - Queue result: 
> {'componentStatus': [], 'reports': []}
> INFO 2017-05-22 22:04:44,960 AgentToggleLogger.py:40 - Sending heartbeat with 
> response id: 0 and timestamp: 1495490684959. Command(s) in progress: False. 
> Components mapped: False
> {noformat}
> I've observed the same on multiple different clusters.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (SLIDER-1236) Unnecessary 10 second sleep before installation

2017-08-17 Thread Billie Rinaldi (JIRA)

[ 
https://issues.apache.org/jira/browse/SLIDER-1236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16130893#comment-16130893
 ] 

Billie Rinaldi commented on SLIDER-1236:


It might be easiest to make it configurable through the agent.ini file (though 
configuring that is relatively inconvenient). The number of apps that are 
trying to reconnect to their AM is less important than the number of 
containers. Is it still okay when one AM fails in an app that has a large 
number of containers? Will the AM easily support 10x as many heartbeats when 
there are a large number of containers? What happens if the AM hasn't failed, 
and it just wasn't responding for 3 seconds? (I would imagine the AM is robust 
to unnecessary re-registration, but am not familiar with that part of the code.)

It also appears that there are random sleeps between 0 and 30 seconds if an 
Exception is thrown from a registration or heartbeat (presumably the sleep is 
randomized so that subsequent attempts are staggered when there are a lot of 
agents). I don't know what could cause the Exception, but perhaps the 30 should 
be lowered a bit as well?

> Unnecessary 10 second sleep before installation
> ---
>
> Key: SLIDER-1236
> URL: https://issues.apache.org/jira/browse/SLIDER-1236
> Project: Slider
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Gour Saha
>
> Noticed when starting LLAP on a 2-node cluster. Slider AM logs:
> {noformat}
> 2017-05-22 22:04:33,047 [956937652@qtp-624693846-4] INFO  
> agent.AgentProviderService - Registration response: 
> RegistrationResponse{response=OK, responseId=0, statusCommands=null}
> ...
> 2017-05-22 22:04:34,946 [956937652@qtp-624693846-4] INFO  
> agent.AgentProviderService - Registration response: 
> RegistrationResponse{response=OK, responseId=0, statusCommands=null}
> {noformat}
> Then nothing useful goes on for a while, until:
> {noformat}
> 2017-05-22 22:04:43,099 [956937652@qtp-624693846-4] INFO  
> agent.AgentProviderService - Installing LLAP on 
> container_1495490227300_0002_01_02.
> {noformat}
> If you look at the corresponding logs from both agents, you can see that they 
> both have a gap that's pretty much exactly 10sec.
> After the gap, they talk back to AM; after ~30ms for each container 
> (corresponding to the end of its gap), presumably after hearing from it, the 
> AM starts installing LLAP.
> {noformat}
> INFO 2017-05-22 22:04:33,055 Controller.py:180 - Registered with the server 
> with {u'exitstatus': 0,
> INFO 2017-05-22 22:04:33,055 Controller.py:630 - Response from server = OK
> INFO 2017-05-22 22:04:43,065 AgentToggleLogger.py:40 - Queue result: 
> {'componentStatus': [], 'reports': []}
> INFO 2017-05-22 22:04:43,065 AgentToggleLogger.py:40 - Sending heartbeat with 
> response id: 0 and timestamp: 1495490683064. Command(s) in progress: False. 
> Components mapped: False
> INFO 2017-05-22 22:04:34,948 Controller.py:180 - Registered with the server 
> with {u'exitstatus': 0,
> INFO 2017-05-22 22:04:34,948 Controller.py:630 - Response from server = OK
> INFO 2017-05-22 22:04:44,959 AgentToggleLogger.py:40 - Queue result: 
> {'componentStatus': [], 'reports': []}
> INFO 2017-05-22 22:04:44,960 AgentToggleLogger.py:40 - Sending heartbeat with 
> response id: 0 and timestamp: 1495490684959. Command(s) in progress: False. 
> Components mapped: False
> {noformat}
> I've observed the same on multiple different clusters.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (SLIDER-1236) Unnecessary 10 second sleep before installation

2017-08-17 Thread Billie Rinaldi (JIRA)

[ 
https://issues.apache.org/jira/browse/SLIDER-1236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16130370#comment-16130370
 ] 

Billie Rinaldi commented on SLIDER-1236:


Hey [~gsaha], there could be a problem with decreasing the 
HEARTBEAT_IDDLE_INTERVAL_SEC so much. The Controller uses a count * 
HEARTBEAT_IDDLE_INTERVAL_SEC to determine when it should re-read the AM 
location from ZK and re-register with the AM. So now this will be triggered 
after 3 seconds instead of after 30 seconds.

> Unnecessary 10 second sleep before installation
> ---
>
> Key: SLIDER-1236
> URL: https://issues.apache.org/jira/browse/SLIDER-1236
> Project: Slider
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Gour Saha
>
> Noticed when starting LLAP on a 2-node cluster. Slider AM logs:
> {noformat}
> 2017-05-22 22:04:33,047 [956937652@qtp-624693846-4] INFO  
> agent.AgentProviderService - Registration response: 
> RegistrationResponse{response=OK, responseId=0, statusCommands=null}
> ...
> 2017-05-22 22:04:34,946 [956937652@qtp-624693846-4] INFO  
> agent.AgentProviderService - Registration response: 
> RegistrationResponse{response=OK, responseId=0, statusCommands=null}
> {noformat}
> Then nothing useful goes on for a while, until:
> {noformat}
> 2017-05-22 22:04:43,099 [956937652@qtp-624693846-4] INFO  
> agent.AgentProviderService - Installing LLAP on 
> container_1495490227300_0002_01_02.
> {noformat}
> If you look at the corresponding logs from both agents, you can see that they 
> both have a gap that's pretty much exactly 10sec.
> After the gap, they talk back to AM; after ~30ms for each container 
> (corresponding to the end of its gap), presumably after hearing from it, the 
> AM starts installing LLAP.
> {noformat}
> INFO 2017-05-22 22:04:33,055 Controller.py:180 - Registered with the server 
> with {u'exitstatus': 0,
> INFO 2017-05-22 22:04:33,055 Controller.py:630 - Response from server = OK
> INFO 2017-05-22 22:04:43,065 AgentToggleLogger.py:40 - Queue result: 
> {'componentStatus': [], 'reports': []}
> INFO 2017-05-22 22:04:43,065 AgentToggleLogger.py:40 - Sending heartbeat with 
> response id: 0 and timestamp: 1495490683064. Command(s) in progress: False. 
> Components mapped: False
> INFO 2017-05-22 22:04:34,948 Controller.py:180 - Registered with the server 
> with {u'exitstatus': 0,
> INFO 2017-05-22 22:04:34,948 Controller.py:630 - Response from server = OK
> INFO 2017-05-22 22:04:44,959 AgentToggleLogger.py:40 - Queue result: 
> {'componentStatus': [], 'reports': []}
> INFO 2017-05-22 22:04:44,960 AgentToggleLogger.py:40 - Sending heartbeat with 
> response id: 0 and timestamp: 1495490684959. Command(s) in progress: False. 
> Components mapped: False
> {noformat}
> I've observed the same on multiple different clusters.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (SLIDER-1234) Slider JsonSerDeser can use readFully instead of read to avoid " Read finished prematurely"

2017-08-01 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi resolved SLIDER-1234.

Resolution: Fixed

> Slider JsonSerDeser can use readFully instead of read to avoid " Read 
> finished prematurely"
> ---
>
> Key: SLIDER-1234
> URL: https://issues.apache.org/jira/browse/SLIDER-1234
> Project: Slider
>  Issue Type: Bug
>  Components: core
>Affects Versions: Slider 0.92
>Reporter: Prabhu Joseph
>Assignee: Billie Rinaldi
> Attachments: SLIDER-1234.1.patch
>
>
> Slider JsonSerDeser uses FSDataInputStream#read() inside load method which 
> fails sometimes with "Read finished prematurely" when there is a data loss 
> for the data sent over the socket.
> Better to use readFully() which can avoid this.
> {code}
> Exception: Read finished prematurely 
> 2017-05-16 12:34:33,329 [main] ERROR main.ServiceLauncher - Exception: Read 
> finished prematurely 
> java.io.EOFException: Read finished prematurely 
> at org.apache.slider.core.persist.JsonSerDeser.load(JsonSerDeser.java:204) 
> at 
> org.apache.slider.core.persist.ConfPersister.loadConf(ConfPersister.java:230) 
> at org.apache.slider.core.persist.ConfPersister.load(ConfPersister.java:277) 
> at 
> org.apache.slider.core.build.InstanceIO.loadInstanceDefinitionUnresolved(InstanceIO.java:54)
>  
> at 
> org.apache.slider.client.SliderClient.loadInstanceDefinitionUnresolved(SliderClient.java:1913)
>  
> at org.apache.slider.client.SliderClient.actionCreate(SliderClient.java:703) 
> at org.apache.slider.client.SliderClient.exec(SliderClient.java:388) 
> at org.apache.slider.client.SliderClient.runService(SliderClient.java:349) 
> at 
> org.apache.slider.core.main.ServiceLauncher.launchService(ServiceLauncher.java:188)
>  
> at 
> org.apache.slider.core.main.ServiceLauncher.launchServiceRobustly(ServiceLauncher.java:475)
>  
> at 
> org.apache.slider.core.main.ServiceLauncher.launchServiceAndExit(ServiceLauncher.java:403)
>  
> at 
> org.apache.slider.core.main.ServiceLauncher.serviceMain(ServiceLauncher.java:630)
>  
> at org.apache.slider.Slider.main(Slider.java:49) 
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (SLIDER-1234) Slider JsonSerDeser can use readFully instead of read to avoid " Read finished prematurely"

2017-08-01 Thread Billie Rinaldi (JIRA)

[ 
https://issues.apache.org/jira/browse/SLIDER-1234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16109287#comment-16109287
 ] 

Billie Rinaldi commented on SLIDER-1234:


Thanks, Gour. UTs pass for me as well, so I am committing.

> Slider JsonSerDeser can use readFully instead of read to avoid " Read 
> finished prematurely"
> ---
>
> Key: SLIDER-1234
> URL: https://issues.apache.org/jira/browse/SLIDER-1234
> Project: Slider
>  Issue Type: Bug
>  Components: core
>Affects Versions: Slider 0.92
>Reporter: Prabhu Joseph
>Assignee: Billie Rinaldi
> Attachments: SLIDER-1234.1.patch
>
>
> Slider JsonSerDeser uses FSDataInputStream#read() inside load method which 
> fails sometimes with "Read finished prematurely" when there is a data loss 
> for the data sent over the socket.
> Better to use readFully() which can avoid this.
> {code}
> Exception: Read finished prematurely 
> 2017-05-16 12:34:33,329 [main] ERROR main.ServiceLauncher - Exception: Read 
> finished prematurely 
> java.io.EOFException: Read finished prematurely 
> at org.apache.slider.core.persist.JsonSerDeser.load(JsonSerDeser.java:204) 
> at 
> org.apache.slider.core.persist.ConfPersister.loadConf(ConfPersister.java:230) 
> at org.apache.slider.core.persist.ConfPersister.load(ConfPersister.java:277) 
> at 
> org.apache.slider.core.build.InstanceIO.loadInstanceDefinitionUnresolved(InstanceIO.java:54)
>  
> at 
> org.apache.slider.client.SliderClient.loadInstanceDefinitionUnresolved(SliderClient.java:1913)
>  
> at org.apache.slider.client.SliderClient.actionCreate(SliderClient.java:703) 
> at org.apache.slider.client.SliderClient.exec(SliderClient.java:388) 
> at org.apache.slider.client.SliderClient.runService(SliderClient.java:349) 
> at 
> org.apache.slider.core.main.ServiceLauncher.launchService(ServiceLauncher.java:188)
>  
> at 
> org.apache.slider.core.main.ServiceLauncher.launchServiceRobustly(ServiceLauncher.java:475)
>  
> at 
> org.apache.slider.core.main.ServiceLauncher.launchServiceAndExit(ServiceLauncher.java:403)
>  
> at 
> org.apache.slider.core.main.ServiceLauncher.serviceMain(ServiceLauncher.java:630)
>  
> at org.apache.slider.Slider.main(Slider.java:49) 
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (SLIDER-1234) Slider JsonSerDeser can use readFully instead of read to avoid " Read finished prematurely"

2017-08-01 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi updated SLIDER-1234:
---
Attachment: SLIDER-1234.1.patch

> Slider JsonSerDeser can use readFully instead of read to avoid " Read 
> finished prematurely"
> ---
>
> Key: SLIDER-1234
> URL: https://issues.apache.org/jira/browse/SLIDER-1234
> Project: Slider
>  Issue Type: Bug
>  Components: core
>Affects Versions: Slider 0.92
>Reporter: Prabhu Joseph
>Assignee: Billie Rinaldi
> Attachments: SLIDER-1234.1.patch
>
>
> Slider JsonSerDeser uses FSDataInputStream#read() inside load method which 
> fails sometimes with "Read finished prematurely" when there is a data loss 
> for the data sent over the socket.
> Better to use readFully() which can avoid this.
> {code}
> Exception: Read finished prematurely 
> 2017-05-16 12:34:33,329 [main] ERROR main.ServiceLauncher - Exception: Read 
> finished prematurely 
> java.io.EOFException: Read finished prematurely 
> at org.apache.slider.core.persist.JsonSerDeser.load(JsonSerDeser.java:204) 
> at 
> org.apache.slider.core.persist.ConfPersister.loadConf(ConfPersister.java:230) 
> at org.apache.slider.core.persist.ConfPersister.load(ConfPersister.java:277) 
> at 
> org.apache.slider.core.build.InstanceIO.loadInstanceDefinitionUnresolved(InstanceIO.java:54)
>  
> at 
> org.apache.slider.client.SliderClient.loadInstanceDefinitionUnresolved(SliderClient.java:1913)
>  
> at org.apache.slider.client.SliderClient.actionCreate(SliderClient.java:703) 
> at org.apache.slider.client.SliderClient.exec(SliderClient.java:388) 
> at org.apache.slider.client.SliderClient.runService(SliderClient.java:349) 
> at 
> org.apache.slider.core.main.ServiceLauncher.launchService(ServiceLauncher.java:188)
>  
> at 
> org.apache.slider.core.main.ServiceLauncher.launchServiceRobustly(ServiceLauncher.java:475)
>  
> at 
> org.apache.slider.core.main.ServiceLauncher.launchServiceAndExit(ServiceLauncher.java:403)
>  
> at 
> org.apache.slider.core.main.ServiceLauncher.serviceMain(ServiceLauncher.java:630)
>  
> at org.apache.slider.Slider.main(Slider.java:49) 
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (SLIDER-1234) Slider JsonSerDeser can use readFully instead of read to avoid " Read finished prematurely"

2017-08-01 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi reassigned SLIDER-1234:
--

Assignee: Billie Rinaldi

> Slider JsonSerDeser can use readFully instead of read to avoid " Read 
> finished prematurely"
> ---
>
> Key: SLIDER-1234
> URL: https://issues.apache.org/jira/browse/SLIDER-1234
> Project: Slider
>  Issue Type: Bug
>  Components: core
>Affects Versions: Slider 0.92
>Reporter: Prabhu Joseph
>Assignee: Billie Rinaldi
>
> Slider JsonSerDeser uses FSDataInputStream#read() inside load method which 
> fails sometimes with "Read finished prematurely" when there is a data loss 
> for the data sent over the socket.
> Better to use readFully() which can avoid this.
> {code}
> Exception: Read finished prematurely 
> 2017-05-16 12:34:33,329 [main] ERROR main.ServiceLauncher - Exception: Read 
> finished prematurely 
> java.io.EOFException: Read finished prematurely 
> at org.apache.slider.core.persist.JsonSerDeser.load(JsonSerDeser.java:204) 
> at 
> org.apache.slider.core.persist.ConfPersister.loadConf(ConfPersister.java:230) 
> at org.apache.slider.core.persist.ConfPersister.load(ConfPersister.java:277) 
> at 
> org.apache.slider.core.build.InstanceIO.loadInstanceDefinitionUnresolved(InstanceIO.java:54)
>  
> at 
> org.apache.slider.client.SliderClient.loadInstanceDefinitionUnresolved(SliderClient.java:1913)
>  
> at org.apache.slider.client.SliderClient.actionCreate(SliderClient.java:703) 
> at org.apache.slider.client.SliderClient.exec(SliderClient.java:388) 
> at org.apache.slider.client.SliderClient.runService(SliderClient.java:349) 
> at 
> org.apache.slider.core.main.ServiceLauncher.launchService(ServiceLauncher.java:188)
>  
> at 
> org.apache.slider.core.main.ServiceLauncher.launchServiceRobustly(ServiceLauncher.java:475)
>  
> at 
> org.apache.slider.core.main.ServiceLauncher.launchServiceAndExit(ServiceLauncher.java:403)
>  
> at 
> org.apache.slider.core.main.ServiceLauncher.serviceMain(ServiceLauncher.java:630)
>  
> at org.apache.slider.Slider.main(Slider.java:49) 
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (SLIDER-1233) Lost nodes should not contribute to container failures

2017-07-25 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi updated SLIDER-1233:
---
Attachment: SLIDER-1233.002.patch

> Lost nodes should not contribute to container failures
> --
>
> Key: SLIDER-1233
> URL: https://issues.apache.org/jira/browse/SLIDER-1233
> Project: Slider
>  Issue Type: Bug
>  Components: core
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1233.001.patch, SLIDER-1233.002.patch
>
>
> If a container completes due to an NM being lost, we should not count this 
> towards container failures that may eventually cause the AM to fail the 
> application. We are already using a ContainerOutcome of Completed (rather 
> than Failed) for this type of container exit, so we just need to change the 
> failure counting in that case. Other failure types associated with Completed 
> are killed by the AM, killed by the RM, and killed after app completion, none 
> of which need to contribute to container failures.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (SLIDER-1233) Lost nodes should not contribute to container failures

2017-07-25 Thread Billie Rinaldi (JIRA)

[ 
https://issues.apache.org/jira/browse/SLIDER-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16100281#comment-16100281
 ] 

Billie Rinaldi commented on SLIDER-1233:


Sounds good, I will adjust the comment and then commit. Thanks, [~gsaha]!

> Lost nodes should not contribute to container failures
> --
>
> Key: SLIDER-1233
> URL: https://issues.apache.org/jira/browse/SLIDER-1233
> Project: Slider
>  Issue Type: Bug
>  Components: core
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1233.001.patch
>
>
> If a container completes due to an NM being lost, we should not count this 
> towards container failures that may eventually cause the AM to fail the 
> application. We are already using a ContainerOutcome of Completed (rather 
> than Failed) for this type of container exit, so we just need to change the 
> failure counting in that case. Other failure types associated with Completed 
> are killed by the AM, killed by the RM, and killed after app completion, none 
> of which need to contribute to container failures.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (SLIDER-1227) Component name with 3 "_" gives NPE in 0.92, is working in 0.80

2017-05-15 Thread Billie Rinaldi (JIRA)

[ 
https://issues.apache.org/jira/browse/SLIDER-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16011081#comment-16011081
 ] 

Billie Rinaldi commented on SLIDER-1227:


[~manojsamel], I have not thought of a good way to determine the group from the 
component name, since numbers and separators are allowed characters. This is 
why the group and name are both explicitly included in the label.

In my opinion, the current design is cleaner because the user does not need to 
know about Slider's internal concept of a group (except in this case where a 
bug surfaced). The components are declared the same way as they were before, 
and there is only the addition of a boolean configuration property to give the 
component instances unique IDs.

> Component name with 3 "_" gives NPE in 0.92, is working in 0.80
> ---
>
> Key: SLIDER-1227
> URL: https://issues.apache.org/jira/browse/SLIDER-1227
> Project: Slider
>  Issue Type: Bug
>  Components: core
>Affects Versions: Slider 0.92
>Reporter: Manoj Samel
>
> Running slider 0.92 on CDH 5.5.1 (which is Hadoop 2.6), with Kerberos
> I am deploying a application with multiple components. The components start 
> but fail to heart beat to slider AM. The slider AM log shows NPE at container 
> heartbeat URLs as below.
> 2017-04-12 00:44:05,741 [2011871076@qtp-814377348-5] INFO  
> agent.AgentProviderService - Handling registration: responseId=-1
> timestamp=1491957845550
> label=container_e95_1476898378926_91401_01_03___solo___super
> hostname=node1078
> expectedState=INIT
> actualState=INIT
> appVersion=null
> 2017-04-12 00:44:05,741 [2011871076@qtp-814377348-5] INFO  
> agent.AgentProviderService - label: 
> container_e95_1476898378926_91401_01_03___solo___super pkg: null
> 2017-04-12 00:44:05,741 [2011871076@qtp-814377348-5] INFO  
> agent.AgentProviderService - Registration response: 
> RegistrationResponse{response=OK, responseId=0, statusCommands=null}
> 2017-04-12 00:44:05,871 [Socket Reader #1 for port 32120] INFO  ipc.Server - 
> Auth successful for slideradmin@BIGDATA (auth:SIMPLE)
> 2017-04-12 00:44:05,873 [Socket Reader #1 for port 32120] INFO  
> authorize.ServiceAuthorizationManager - Authorization successful for 
> slideradmin@BIGDATA (auth:TOKEN) for protocol=interface 
> org.apache.slider.server.appmaster.rpc.SliderClusterProtocolPB
> 2017-04-12 00:44:15,749 [100585@qtp-814377348-7] ERROR mortbay.log - 
> /ws/v1/slider/agents/container_e95_1476898378926_91401_01_02___pdx__svt___ten85/heartbeat
> java.lang.NullPointerException
> at 
> org.apache.slider.providers.agent.AgentProviderService.handleHeartBeat(AgentProviderService.java:1090)
> at 
> org.apache.slider.server.appmaster.web.rest.agent.AgentResource.heartbeat(AgentResource.java:98)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at 
> com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
> at 
> com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)
> at 
> com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
> at 
> com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288)
> at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
> at 
> com.sun.jersey.server.impl.uri.rules.SubLocatorRule.accept(SubLocatorRule.java:134)
> at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
> at 
> com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
> at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
> at 
> com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
> at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469)
> at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400)
> at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349)
> at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339)
> at 
> 

[jira] [Resolved] (SLIDER-1228) flooding Slider-AM log for "app state clusterNodes"

2017-05-12 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi resolved SLIDER-1228.

   Resolution: Fixed
Fix Version/s: Slider 1.0.0

> flooding Slider-AM log for "app state clusterNodes"
> ---
>
> Key: SLIDER-1228
> URL: https://issues.apache.org/jira/browse/SLIDER-1228
> Project: Slider
>  Issue Type: Bug
>Affects Versions: Slider 0.92
>Reporter: kyungwan nam
>Assignee: kyungwan nam
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1228.001.patch
>
>
> following message are printed to Slider-AM log continuously.
> as a result, slider-am log file are rotated shortly, and more important logs 
> can be lost.
> also, It makes difficult to see the other logs.
> {code}
> 2017-05-11 22:03:13,706 [61820230@qtp-579805167-115] INFO  state.AppState - 
> app state clusterNodes 
> {HBASE_REGIONSERVER={container_e14_1490326153792_0075_01_05=container_e14_1490326153792_0075_01_05:
>  3
> state: 3
> role: HBASE_REGIONSERVER
> host: test1304.test.com
> hostURL: http://test1304.test.com:8042
> command: python2.6 ./infra/agent/slider-agent/agent/main.py --label 
> container_e14_1490326153792_0075_01_05___HBASE_REGIONSERVER --zk-quorum 
> test1300.test.com:2181,test1301.test.com:2181,test1302.test.com:2181 
> --zk-reg-path /registry/users/yarn/services/org-apache-slider/hbase1 > 
> /slider-agent.out 2>&1 ;
> logLink: 
> http://test1304.test.com:8042/node/containerlogs/container_e14_1490326153792_0075_01_05/yarn
> , 
> container_e14_1490326153792_0075_01_03=container_e14_1490326153792_0075_01_03:
>  3
> state: 3
> role: HBASE_REGIONSERVER
> host: test1304.test.com
> hostURL: http://test1304.test.com:8042
> command: python2.6 ./infra/agent/slider-agent/agent/main.py --label 
> container_e14_1490326153792_0075_01_03___HBASE_REGIONSERVER --zk-quorum 
> test1300.test.com:2181,test1301.test.com:2181,test1302.test.com:2181 
> --zk-reg-path /registry/users/yarn/services/org-apache-slider/hbase1 > 
> /slider-agent.out 2>&1 ;
> logLink: 
> http://test1304.test.com:8042/node/containerlogs/container_e14_1490326153792_0075_01_03/yarn
> }, 
> HBASE_MASTER={container_e14_1490326153792_0075_01_02=container_e14_1490326153792_0075_01_02:
>  3
> state: 3
> role: HBASE_MASTER
> host: test1307.test.com
> hostURL: http://test1307.test.com:8042
> command: python2.6 ./infra/agent/slider-agent/agent/main.py --label 
> container_e14_1490326153792_0075_01_02___HBASE_MASTER --zk-quorum 
> test1300.test.com:2181,test1301.test.com:2181,test1302.test.com:2181 
> --zk-reg-path /registry/users/yarn/services/org-apache-slider/hbase1 > 
> /slider-agent.out 2>&1 ;
> logLink: 
> http://test1307.test.com:8042/node/containerlogs/container_e14_1490326153792_0075_01_02/yarn
> }, 
> slider-appmaster={container_e14_1490326153792_0075_01_01=container_e14_1490326153792_0075_01_01:
>  3
> state: 3
> role: slider-appmaster
> host: test1309.test.com
> hostURL: http://test1309.test.com:42842
> }}
> 2017-05-11 22:03:13,707 [1942998487@qtp-579805167-113] INFO  state.AppState - 
> app state clusterNodes 
> {HBASE_REGIONSERVER={container_e14_1490326153792_0075_01_05=container_e14_1490326153792_0075_01_05:
>  3
> state: 3
> role: HBASE_REGIONSERVER
> host: test1304.test.com
> hostURL: http://test1304.test.com:8042
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (SLIDER-1224) the outstanding request that has been escalated is failed to cleanup

2017-05-12 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi resolved SLIDER-1224.

   Resolution: Fixed
Fix Version/s: Slider 1.0.0

> the outstanding request that has been escalated is failed to cleanup
> 
>
> Key: SLIDER-1224
> URL: https://issues.apache.org/jira/browse/SLIDER-1224
> Project: Slider
>  Issue Type: Bug
>Affects Versions: Slider 0.92
>Reporter: kyungwan nam
>Assignee: kyungwan nam
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1224.001.patch, SLIDER-1224.002.patch
>
>
> There is a slider app with placement policy is normal ( 0 ).
> If an outstanding request is escalated, the container can be allocated to 
> another node, not desired node. 
> But, when container is allocated to another node, the outstanding request is 
> kept, not to remove
> I think it is the same issue with SLIDER-1104.
> it can still happen with slider-0.92.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (SLIDER-1228) flooding Slider-AM log for "app state clusterNodes"

2017-05-12 Thread Billie Rinaldi (JIRA)

[ 
https://issues.apache.org/jira/browse/SLIDER-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16008303#comment-16008303
 ] 

Billie Rinaldi commented on SLIDER-1228:


Thanks for the patch, [~kyungwan nam]!

> flooding Slider-AM log for "app state clusterNodes"
> ---
>
> Key: SLIDER-1228
> URL: https://issues.apache.org/jira/browse/SLIDER-1228
> Project: Slider
>  Issue Type: Bug
>Affects Versions: Slider 0.92
>Reporter: kyungwan nam
>Assignee: kyungwan nam
> Attachments: SLIDER-1228.001.patch
>
>
> following message are printed to Slider-AM log continuously.
> as a result, slider-am log file are rotated shortly, and more important logs 
> can be lost.
> also, It makes difficult to see the other logs.
> {code}
> 2017-05-11 22:03:13,706 [61820230@qtp-579805167-115] INFO  state.AppState - 
> app state clusterNodes 
> {HBASE_REGIONSERVER={container_e14_1490326153792_0075_01_05=container_e14_1490326153792_0075_01_05:
>  3
> state: 3
> role: HBASE_REGIONSERVER
> host: test1304.test.com
> hostURL: http://test1304.test.com:8042
> command: python2.6 ./infra/agent/slider-agent/agent/main.py --label 
> container_e14_1490326153792_0075_01_05___HBASE_REGIONSERVER --zk-quorum 
> test1300.test.com:2181,test1301.test.com:2181,test1302.test.com:2181 
> --zk-reg-path /registry/users/yarn/services/org-apache-slider/hbase1 > 
> /slider-agent.out 2>&1 ;
> logLink: 
> http://test1304.test.com:8042/node/containerlogs/container_e14_1490326153792_0075_01_05/yarn
> , 
> container_e14_1490326153792_0075_01_03=container_e14_1490326153792_0075_01_03:
>  3
> state: 3
> role: HBASE_REGIONSERVER
> host: test1304.test.com
> hostURL: http://test1304.test.com:8042
> command: python2.6 ./infra/agent/slider-agent/agent/main.py --label 
> container_e14_1490326153792_0075_01_03___HBASE_REGIONSERVER --zk-quorum 
> test1300.test.com:2181,test1301.test.com:2181,test1302.test.com:2181 
> --zk-reg-path /registry/users/yarn/services/org-apache-slider/hbase1 > 
> /slider-agent.out 2>&1 ;
> logLink: 
> http://test1304.test.com:8042/node/containerlogs/container_e14_1490326153792_0075_01_03/yarn
> }, 
> HBASE_MASTER={container_e14_1490326153792_0075_01_02=container_e14_1490326153792_0075_01_02:
>  3
> state: 3
> role: HBASE_MASTER
> host: test1307.test.com
> hostURL: http://test1307.test.com:8042
> command: python2.6 ./infra/agent/slider-agent/agent/main.py --label 
> container_e14_1490326153792_0075_01_02___HBASE_MASTER --zk-quorum 
> test1300.test.com:2181,test1301.test.com:2181,test1302.test.com:2181 
> --zk-reg-path /registry/users/yarn/services/org-apache-slider/hbase1 > 
> /slider-agent.out 2>&1 ;
> logLink: 
> http://test1307.test.com:8042/node/containerlogs/container_e14_1490326153792_0075_01_02/yarn
> }, 
> slider-appmaster={container_e14_1490326153792_0075_01_01=container_e14_1490326153792_0075_01_01:
>  3
> state: 3
> role: slider-appmaster
> host: test1309.test.com
> hostURL: http://test1309.test.com:42842
> }}
> 2017-05-11 22:03:13,707 [1942998487@qtp-579805167-113] INFO  state.AppState - 
> app state clusterNodes 
> {HBASE_REGIONSERVER={container_e14_1490326153792_0075_01_05=container_e14_1490326153792_0075_01_05:
>  3
> state: 3
> role: HBASE_REGIONSERVER
> host: test1304.test.com
> hostURL: http://test1304.test.com:8042
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (SLIDER-1228) flooding Slider-AM log for "app state clusterNodes"

2017-05-12 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi reassigned SLIDER-1228:
--

Assignee: kyungwan nam

> flooding Slider-AM log for "app state clusterNodes"
> ---
>
> Key: SLIDER-1228
> URL: https://issues.apache.org/jira/browse/SLIDER-1228
> Project: Slider
>  Issue Type: Bug
>Affects Versions: Slider 0.92
>Reporter: kyungwan nam
>Assignee: kyungwan nam
> Attachments: SLIDER-1228.001.patch
>
>
> following message are printed to Slider-AM log continuously.
> as a result, slider-am log file are rotated shortly, and more important logs 
> can be lost.
> also, It makes difficult to see the other logs.
> {code}
> 2017-05-11 22:03:13,706 [61820230@qtp-579805167-115] INFO  state.AppState - 
> app state clusterNodes 
> {HBASE_REGIONSERVER={container_e14_1490326153792_0075_01_05=container_e14_1490326153792_0075_01_05:
>  3
> state: 3
> role: HBASE_REGIONSERVER
> host: test1304.test.com
> hostURL: http://test1304.test.com:8042
> command: python2.6 ./infra/agent/slider-agent/agent/main.py --label 
> container_e14_1490326153792_0075_01_05___HBASE_REGIONSERVER --zk-quorum 
> test1300.test.com:2181,test1301.test.com:2181,test1302.test.com:2181 
> --zk-reg-path /registry/users/yarn/services/org-apache-slider/hbase1 > 
> /slider-agent.out 2>&1 ;
> logLink: 
> http://test1304.test.com:8042/node/containerlogs/container_e14_1490326153792_0075_01_05/yarn
> , 
> container_e14_1490326153792_0075_01_03=container_e14_1490326153792_0075_01_03:
>  3
> state: 3
> role: HBASE_REGIONSERVER
> host: test1304.test.com
> hostURL: http://test1304.test.com:8042
> command: python2.6 ./infra/agent/slider-agent/agent/main.py --label 
> container_e14_1490326153792_0075_01_03___HBASE_REGIONSERVER --zk-quorum 
> test1300.test.com:2181,test1301.test.com:2181,test1302.test.com:2181 
> --zk-reg-path /registry/users/yarn/services/org-apache-slider/hbase1 > 
> /slider-agent.out 2>&1 ;
> logLink: 
> http://test1304.test.com:8042/node/containerlogs/container_e14_1490326153792_0075_01_03/yarn
> }, 
> HBASE_MASTER={container_e14_1490326153792_0075_01_02=container_e14_1490326153792_0075_01_02:
>  3
> state: 3
> role: HBASE_MASTER
> host: test1307.test.com
> hostURL: http://test1307.test.com:8042
> command: python2.6 ./infra/agent/slider-agent/agent/main.py --label 
> container_e14_1490326153792_0075_01_02___HBASE_MASTER --zk-quorum 
> test1300.test.com:2181,test1301.test.com:2181,test1302.test.com:2181 
> --zk-reg-path /registry/users/yarn/services/org-apache-slider/hbase1 > 
> /slider-agent.out 2>&1 ;
> logLink: 
> http://test1307.test.com:8042/node/containerlogs/container_e14_1490326153792_0075_01_02/yarn
> }, 
> slider-appmaster={container_e14_1490326153792_0075_01_01=container_e14_1490326153792_0075_01_01:
>  3
> state: 3
> role: slider-appmaster
> host: test1309.test.com
> hostURL: http://test1309.test.com:42842
> }}
> 2017-05-11 22:03:13,707 [1942998487@qtp-579805167-113] INFO  state.AppState - 
> app state clusterNodes 
> {HBASE_REGIONSERVER={container_e14_1490326153792_0075_01_05=container_e14_1490326153792_0075_01_05:
>  3
> state: 3
> role: HBASE_REGIONSERVER
> host: test1304.test.com
> hostURL: http://test1304.test.com:8042
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (SLIDER-1104) failed to track a outstanding request which was escalated

2017-05-11 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi resolved SLIDER-1104.

Resolution: Duplicate

> failed to track a outstanding request which was escalated
> -
>
> Key: SLIDER-1104
> URL: https://issues.apache.org/jira/browse/SLIDER-1104
> Project: Slider
>  Issue Type: Bug
>  Components: appmaster
>Affects Versions: Slider 0.81
>Reporter: kyungwan nam
>Assignee: kyungwan nam
> Attachments: SLIDER-1104.patch
>
>
> when a outstanding request is escalated, the new container request, which has 
> relaxed placement and a changed priority (the same node as the original one) 
> will be sent.
> because of relaxed placement, RM can allocate a container to a node which 
> didn't specify in container request.
> but, it is determined whether a outstanding request has been allocated only 
> when the node in the allocated container is the same as the node in the 
> outstanding request.
> as a result, it will be failed to track the outstanding request, and Slider 
> AM will keep the request which has already been allocated.
> hear is Slider AM log when I met this problem.
> {code}
> 2016-03-27 11:14:21,225 [AMRM Callback Handler Thread] INFO  
> appmaster.SliderAppMaster - onContainersAllocated(1)
> 2016-03-27 11:14:21,225 [AMRM Callback Handler Thread] DEBUG state.AppState - 
> onContainersAllocated(): Total containers allocated = 1
> 2016-03-27 11:14:21,226 [AMRM Callback Handler Thread] DEBUG 
> state.OutstandingRequestTracker - Processing allocation for role 1  on 
> ContainerID=container_e14_1458884021812_0006_01_04 
> nodeID=n1.mycompany.com:45454 http=n1.mycompany.com:8042 priority=1073741825 
> resource=
> 2016-03-27 11:14:21,226 [AMRM Callback Handler Thread] WARN  
> state.OutstandingRequestTracker - No open request found for container 
> ContainerID=container_e14_1458884021812_0006_01_04 
> nodeID=n1.mycompany.com:45454 http=n1.mycompany.com:8042 priority=1073741825 
> resource=, outstanding queue has 0 entries 
> 2016-03-27 11:14:21,226 [AMRM Callback Handler Thread] INFO  
> state.RoleHistory - Adding 1 hosts for role 1
> 2016-03-27 11:14:21,227 [AMRM Callback Handler Thread] WARN  state.AppState - 
> Unexpected allocation of container 
> ContainerID=container_e14_1458884021812_0006_01_04 
> nodeID=n1.mycompany.com:45454 http=n1.mycompany.com:8042 priority=1073741825 
> resource=
> 2016-03-27 11:14:21,227 [AMRM Callback Handler Thread] INFO  state.AppState - 
> Assigning role HBASE_MASTER to container 
> container_e14_1458884021812_0006_01_04, on n1.mycompany.com:45454,
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (SLIDER-1224) the outstanding request that has been escalated is failed to cleanup

2017-05-11 Thread Billie Rinaldi (JIRA)

[ 
https://issues.apache.org/jira/browse/SLIDER-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16006856#comment-16006856
 ] 

Billie Rinaldi commented on SLIDER-1224:


[~kyungwan nam], thanks for the patch! This looks great, so I will close 
SLIDER-1104 as a duplicate of this. I took your test from SLIDER-1104 and 
expanded it a bit to test this patch, and I found one more thing that needs to 
be changed:
https://github.com/apache/incubator-slider/blob/develop/slider-core/src/main/java/org/apache/slider/server/appmaster/state/OutstandingRequestTracker.java#L179

The outcome is set to open when a request is found in openRequests. We should 
add a check here that says if the request mayEscalate and isEscalated, the 
outcome should be set to escalated. This is similar to what is done when a 
request is found in placedRequests, but the mayEscalate property needs to be 
checked as well.

Please also include the test in your patch:
{noformat}
diff --git 
a/slider-core/src/test/groovy/org/apache/slider/server/appmaster/model/history/TestRoleHistoryOutstandingRequestTracker.groovy
 
b/slider-core/src/test/groovy/org/apache/slider/server/appmaster/model/history/TestRoleHistoryOutstandingRequestTracker.groovy
index 7be01ad..574ded5 100644
--- 
a/slider-core/src/test/groovy/org/apache/slider/server/appmaster/model/history/TestRoleHistoryOutstandingRequestTracker.groovy
+++ 
b/slider-core/src/test/groovy/org/apache/slider/server/appmaster/model/history/TestRoleHistoryOutstandingRequestTracker.groovy
@@ -143,6 +143,52 @@ class TestRoleHistoryOutstandingRequestTracker extends 
BaseMockAppStateTest  {
   }
 
   @Test
+  public void testIssuedEscalatedRequest() throws Throwable {
+def req1 = tracker.newRequest(host1, 0)
+def resource = factory.newResource()
+resource.virtualCores = 1
+resource.memory = 48;
+def yarnRequest = req1.buildContainerRequest(resource, role0Status, 0)
+assert tracker.listPlacedRequests().size() == 1
+assert tracker.listOpenRequests().size() == 0
+
+tracker.escalateOutstandingRequests(role0Status.placementTimeoutSeconds * 
1000)
+assert !req1.isEscalated()
+assert tracker.listPlacedRequests().size() == 1
+assert tracker.listOpenRequests().size() == 0
+
+tracker.escalateOutstandingRequests(role0Status.placementTimeoutSeconds * 
1000 + 1)
+assert req1.isEscalated()
+assert tracker.listPlacedRequests().size() == 0
+assert tracker.listOpenRequests().size() == 1
+
+def c1 = factory.newContainer()
+
+def nodeId = factory.newNodeId()
+c1.nodeId = nodeId
+// if request was escalated, container can be allocated to another host
+// by relaxed placement.
+nodeId.host = "host9"
+
+def pri = ContainerPriority.buildPriority(0, false)
+assert pri > 0
+c1.setPriority(new MockPriority(pri))
+
+c1.setResource(resource)
+
+def issued = req1.issuedRequest
+assert issued.capability == resource
+assert issued.priority.priority == c1.getPriority().getPriority()
+assert req1.resourceRequirementsMatch(resource)
+
+def allocation = tracker.onContainerAllocated(0, nodeId.host, c1)
+assert tracker.listPlacedRequests().size() == 0
+assert tracker.listOpenRequests().size() == 0
+assert allocation.outcome == ContainerAllocationOutcome.Escalated;
+assert allocation.origin.is(req1)
+  }
+
+  @Test
   public void testResetEntries() throws Throwable {
 tracker.newRequest(host1, 0)
 tracker.newRequest(host2, 0)
{noformat}

> the outstanding request that has been escalated is failed to cleanup
> 
>
> Key: SLIDER-1224
> URL: https://issues.apache.org/jira/browse/SLIDER-1224
> Project: Slider
>  Issue Type: Bug
>Affects Versions: Slider 0.92
>Reporter: kyungwan nam
>Assignee: kyungwan nam
> Attachments: SLIDER-1224.001.patch
>
>
> There is a slider app with placement policy is normal ( 0 ).
> If an outstanding request is escalated, the container can be allocated to 
> another node, not desired node. 
> But, when container is allocated to another node, the outstanding request is 
> kept, not to remove
> I think it is the same issue with SLIDER-1104.
> it can still happen with slider-0.92.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (SLIDER-1224) the outstanding request that has been escalated is failed to cleanup

2017-05-11 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi reassigned SLIDER-1224:
--

Assignee: kyungwan nam

> the outstanding request that has been escalated is failed to cleanup
> 
>
> Key: SLIDER-1224
> URL: https://issues.apache.org/jira/browse/SLIDER-1224
> Project: Slider
>  Issue Type: Bug
>Affects Versions: Slider 0.92
>Reporter: kyungwan nam
>Assignee: kyungwan nam
> Attachments: SLIDER-1224.001.patch
>
>
> There is a slider app with placement policy is normal ( 0 ).
> If an outstanding request is escalated, the container can be allocated to 
> another node, not desired node. 
> But, when container is allocated to another node, the outstanding request is 
> kept, not to remove
> I think it is the same issue with SLIDER-1104.
> it can still happen with slider-0.92.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (SLIDER-1223) failed to connect between slider-agent and sliderAM

2017-04-24 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi resolved SLIDER-1223.

Resolution: Duplicate

I believe this is a duplicate of SLIDER-942, which has been resolved in 0.92.

> failed to connect between slider-agent and sliderAM
> ---
>
> Key: SLIDER-1223
> URL: https://issues.apache.org/jira/browse/SLIDER-1223
> Project: Slider
>  Issue Type: Bug
>Reporter: kyungwan nam
>
> python-2.7.10 has been installed in my cluster.
> when i submit an application, it seems that it does not connect between 
> slider-agent and SliderAM.
> slider-agent logs are as follows.
> {code}
> INFO 2017-04-24 11:41:26,761 connection.py:573 - Connecting to zk01.com:2181
> INFO 2017-04-24 11:41:26,763 client.py:439 - Zookeeper connection 
> established, state: CONNECTED
> INFO 2017-04-24 11:41:26,766 connection.py:540 - Closing connection to 
> zk01.com:2181
> INFO 2017-04-24 11:41:26,766 client.py:443 - Zookeeper session lost, state: 
> CLOSED
> INFO 2017-04-24 11:41:26,767 Registry.py:69 - AM Host = my01.com, AM Secured 
> Port = 46081, ping port = 37738
> INFO 2017-04-24 11:41:26,767 main.py:291 - Connecting to the server at: 
> https://my01.com:37738/ws/v1/slider/agents/
> INFO 2017-04-24 11:41:26,767 NetUtil.py:67 - DEBUG: Trying to connect to the 
> server at https://my01.com:37738/ws/v1/slider/agents/
> INFO 2017-04-24 11:41:26,767 NetUtil.py:38 - Connecting to the following url 
> https://my01.com:37738/ws/v1/slider/agents/
> ERROR 2017-04-24 11:41:26,925 NetUtil.py:52 - [SSL: 
> CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:590)
> ERROR 2017-04-24 11:41:26,925 NetUtil.py:54 - SSLError: Failed to connect. 
> Please check openssl library versions. 
> Refer to: https://bugzilla.redhat.com/show_bug.cgi?id=1022468 for more 
> details.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (SLIDER-1220) Some funtests fail when sasl security is configured for registry

2017-03-21 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi resolved SLIDER-1220.

Resolution: Fixed

> Some funtests fail when sasl security is configured for registry
> 
>
> Key: SLIDER-1220
> URL: https://issues.apache.org/jira/browse/SLIDER-1220
> Project: Slider
>  Issue Type: Bug
>  Components: test
>Affects Versions: Slider 0.92
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1220.1.patch
>
>
> The tests AppsUpgradeIT, AMConfigPublishingIT, and ExternalComponentIT are 
> not adding AM keytab and principal properties to all slider create / upgrade 
> commands when Kerberos is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (SLIDER-1220) Some funtests fail when sasl security is configured for registry

2017-03-21 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi updated SLIDER-1220:
---
Attachment: SLIDER-1220.1.patch

I am testing out the attached patch to address this issue.

> Some funtests fail when sasl security is configured for registry
> 
>
> Key: SLIDER-1220
> URL: https://issues.apache.org/jira/browse/SLIDER-1220
> Project: Slider
>  Issue Type: Bug
>  Components: test
>Affects Versions: Slider 0.92
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1220.1.patch
>
>
> The tests AppsUpgradeIT, AMConfigPublishingIT, and ExternalComponentIT are 
> not adding AM keytab and principal properties to all slider create / upgrade 
> commands when Kerberos is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (SLIDER-1220) Some funtests fail when sasl security is configured for registry

2017-03-21 Thread Billie Rinaldi (JIRA)
Billie Rinaldi created SLIDER-1220:
--

 Summary: Some funtests fail when sasl security is configured for 
registry
 Key: SLIDER-1220
 URL: https://issues.apache.org/jira/browse/SLIDER-1220
 Project: Slider
  Issue Type: Bug
  Components: test
Affects Versions: Slider 0.92
Reporter: Billie Rinaldi
Assignee: Billie Rinaldi
 Fix For: Slider 1.0.0


The tests AppsUpgradeIT, AMConfigPublishingIT, and ExternalComponentIT are not 
adding AM keytab and principal properties to all slider create / upgrade 
commands when Kerberos is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (SLIDER-1209) Provide information on whether a slider app was killed / stopped via a request

2017-03-03 Thread Billie Rinaldi (JIRA)

[ 
https://issues.apache.org/jira/browse/SLIDER-1209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894964#comment-15894964
 ] 

Billie Rinaldi commented on SLIDER-1209:


bq. Also, I ran all the current unit tests successfully with this patch.

So did I. :) +1

> Provide information on whether a slider app was killed / stopped via a request
> --
>
> Key: SLIDER-1209
> URL: https://issues.apache.org/jira/browse/SLIDER-1209
> Project: Slider
>  Issue Type: Sub-task
>  Components: appmaster, client
>Reporter: Siddharth Seth
>Assignee: Gour Saha
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1209.01.patch
>
>
> I am adding a new enum SliderExitReason with the high level reason for an 
> application failure.
> For most of the cases it is difficult to decipher if the Slider app failed 
> due to an application error. This gap can be bridged a little better when we 
> get to SLIDER-1208.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (SLIDER-1209) Provide information on whether a slider app was killed / stopped via a request

2017-03-03 Thread Billie Rinaldi (JIRA)

[ 
https://issues.apache.org/jira/browse/SLIDER-1209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894861#comment-15894861
 ] 

Billie Rinaldi commented on SLIDER-1209:


Visual inspection of this code looks okay. I haven't had a chance to test it. 
Did you try it out, [~gsaha]?

> Provide information on whether a slider app was killed / stopped via a request
> --
>
> Key: SLIDER-1209
> URL: https://issues.apache.org/jira/browse/SLIDER-1209
> Project: Slider
>  Issue Type: Sub-task
>  Components: appmaster, client
>Reporter: Siddharth Seth
>Assignee: Gour Saha
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1209.01.patch
>
>
> I am adding a new enum SliderExitReason with the high level reason for an 
> application failure.
> For most of the cases it is difficult to decipher if the Slider app failed 
> due to an application error. This gap can be bridged a little better when we 
> get to SLIDER-1208.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (SLIDER-1197) Provide information on pending allocations, and last allocation time, yarn mem available

2017-02-16 Thread Billie Rinaldi (JIRA)

[ 
https://issues.apache.org/jira/browse/SLIDER-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870019#comment-15870019
 ] 

Billie Rinaldi commented on SLIDER-1197:


+1, this looks good to me.

> Provide information on pending allocations, and last allocation time, yarn 
> mem available
> 
>
> Key: SLIDER-1197
> URL: https://issues.apache.org/jira/browse/SLIDER-1197
> Project: Slider
>  Issue Type: Sub-task
>  Components: appmaster, client
>Reporter: Siddharth Seth
>Assignee: Gour Saha
>Priority: Critical
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1197.01.patch, SLIDER-1197.02.patch
>
>
> This helps with debugging misconfigured apps - where the cluster does not 
> have adequate capacity to launch the app



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (SLIDER-1197) Provide information on pending allocations, and last allocation time, yarn mem available

2017-02-15 Thread Billie Rinaldi (JIRA)

[ 
https://issues.apache.org/jira/browse/SLIDER-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868557#comment-15868557
 ] 

Billie Rinaldi commented on SLIDER-1197:


[~gsaha], instead of modifying ProviderAppState and storing an instance of the 
RM client there, I would suggest creating a wrapper class called something like 
ResourceInformationGetter or RMClientAccessForAppState. This class would have a 
private variable which is set to the RM client and would have one getter method 
that calls amRmClientAsync.getAvailableResources and returns 
ResourceInformation. An instance of the wrapper class can be constructed in 
SliderAppMaster and passed in AppState's constructor, and then AppState can 
call the getter method in AppState.getApplicationLivenessInformation.

> Provide information on pending allocations, and last allocation time, yarn 
> mem available
> 
>
> Key: SLIDER-1197
> URL: https://issues.apache.org/jira/browse/SLIDER-1197
> Project: Slider
>  Issue Type: Sub-task
>  Components: appmaster, client
>Reporter: Siddharth Seth
>Priority: Critical
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1197.01.patch
>
>
> This helps with debugging misconfigured apps - where the cluster does not 
> have adequate capacity to launch the app



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (SLIDER-1199) Blacklist nodes that exceed the node failure threshold for a role

2017-02-08 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi resolved SLIDER-1199.

Resolution: Fixed

> Blacklist nodes that exceed the node failure threshold for a role
> -
>
> Key: SLIDER-1199
> URL: https://issues.apache.org/jira/browse/SLIDER-1199
> Project: Slider
>  Issue Type: Bug
>  Components: appmaster
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1199.1.patch, SLIDER-1199.2.patch, 
> SLIDER-1199.3.patch, SLIDER-1199.4.patch, SLIDER-1199.5.patch
>
>
> From the code, it seems like when the node failure threshold for a role is 
> exceeded, that node is no longer suggested for placement. But there is 
> nothing preventing the RM from selecting the node again. If the node were 
> blacklisted, perhaps that would prevent new allocations on problem nodes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (SLIDER-1199) Blacklist nodes that exceed the node failure threshold for a role

2017-02-08 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi updated SLIDER-1199:
---
Attachment: SLIDER-1199.5.patch

When doing a last round of testing I noticed a typo on one line of the previous 
patch (patch 2 was working, but I introduced this error in patch 3 after some 
failed experimentation with another approach). Patch 5 fixes this error, and I 
will commit this one now. 

> Blacklist nodes that exceed the node failure threshold for a role
> -
>
> Key: SLIDER-1199
> URL: https://issues.apache.org/jira/browse/SLIDER-1199
> Project: Slider
>  Issue Type: Bug
>  Components: appmaster
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1199.1.patch, SLIDER-1199.2.patch, 
> SLIDER-1199.3.patch, SLIDER-1199.4.patch, SLIDER-1199.5.patch
>
>
> From the code, it seems like when the node failure threshold for a role is 
> exceeded, that node is no longer suggested for placement. But there is 
> nothing preventing the RM from selecting the node again. If the node were 
> blacklisted, perhaps that would prevent new allocations on problem nodes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (SLIDER-1205) Fix issues in uber apps and AM config generation

2017-02-07 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi resolved SLIDER-1205.

Resolution: Fixed

> Fix issues in uber apps and AM config generation
> 
>
> Key: SLIDER-1205
> URL: https://issues.apache.org/jira/browse/SLIDER-1205
> Project: Slider
>  Issue Type: Bug
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1205.1.patch, SLIDER-1205.2.patch
>
>
> Some portions of YARN-5701 apply to Slider and should be ported.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (SLIDER-1206) AgentFailuresIT and AgentFailures2IT failing due to increase in heartbeat loss interval

2017-02-07 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi resolved SLIDER-1206.

Resolution: Fixed

> AgentFailuresIT and AgentFailures2IT failing due to increase in heartbeat 
> loss interval
> ---
>
> Key: SLIDER-1206
> URL: https://issues.apache.org/jira/browse/SLIDER-1206
> Project: Slider
>  Issue Type: Bug
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1206.1.patch
>
>
> The increase in heartbeat lost interval is making some container failure 
> tests fail due to containers not failing as soon as expected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (SLIDER-1204) Slider handles "per.component" for multiple components incorrectly

2017-02-07 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi resolved SLIDER-1204.

Resolution: Fixed

> Slider handles "per.component" for multiple components incorrectly
> --
>
> Key: SLIDER-1204
> URL: https://issues.apache.org/jira/browse/SLIDER-1204
> Project: Slider
>  Issue Type: Bug
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1204.1.patch
>
>
> Port YARN-5941 to Slider.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (SLIDER-1206) AgentFailuresIT and AgentFailures2IT failing due to increase in heartbeat loss interval

2017-02-07 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi updated SLIDER-1206:
---
Summary: AgentFailuresIT and AgentFailures2IT failing due to increase in 
heartbeat loss interval  (was: AgentFailuresIT and AgentFailures2IT due to 
increase in heartbeat loss interval)

> AgentFailuresIT and AgentFailures2IT failing due to increase in heartbeat 
> loss interval
> ---
>
> Key: SLIDER-1206
> URL: https://issues.apache.org/jira/browse/SLIDER-1206
> Project: Slider
>  Issue Type: Bug
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1206.1.patch
>
>
> The increase in heartbeat lost interval is making some container failure 
> tests fail due to containers not failing as soon as expected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (SLIDER-1205) Fix issues in uber apps

2017-02-07 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi updated SLIDER-1205:
---
Attachment: SLIDER-1205.2.patch

Noticed I missed quoting the token in replaceAll.

> Fix issues in uber apps
> ---
>
> Key: SLIDER-1205
> URL: https://issues.apache.org/jira/browse/SLIDER-1205
> Project: Slider
>  Issue Type: Bug
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1205.1.patch, SLIDER-1205.2.patch
>
>
> Some portions of YARN-5701 apply to Slider and should be ported.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (SLIDER-1181) Keep Slider AM running during RM failure

2017-02-07 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi resolved SLIDER-1181.

Resolution: Fixed

> Keep Slider AM running during RM failure
> 
>
> Key: SLIDER-1181
> URL: https://issues.apache.org/jira/browse/SLIDER-1181
> Project: Slider
>  Issue Type: Bug
>  Components: appmaster
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1181.1.patch
>
>
> YARN-5944 and YARN-5996 made the native services AM more robust to temporary 
> RM failures. We should apply these to the Slider AM as well. YARN-5996 
> requires YARN change YARN-5999.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (SLIDER-1199) Blacklist nodes that exceed the node failure threshold for a role

2017-02-07 Thread Billie Rinaldi (JIRA)

[ 
https://issues.apache.org/jira/browse/SLIDER-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15856345#comment-15856345
 ] 

Billie Rinaldi commented on SLIDER-1199:


[~gsaha], thanks for the review. I was thinking that this block should be 
synchronized with 
[SliderAppMaster#executeNodeReview|https://github.com/apache/incubator-slider/blob/develop/slider-core/src/main/java/org/apache/slider/server/appmaster/SliderAppMaster.java#L1948-L1966].
 That method calls appState.reviewRequestAndReleaseNodes (which is calling 
appState.updateBlacklist), and then executes all of the operations returned, 
which may include a blacklist operation. I think "creating a blacklist 
operation and executing it" should be synchronized.

Blacklist is a YARN concept. updateBlacklist, blacklistAdditions, and 
blacklistRemovals are taken directly from the YARN API. I think we should 
continue to use the same terminology that YARN does.

> Blacklist nodes that exceed the node failure threshold for a role
> -
>
> Key: SLIDER-1199
> URL: https://issues.apache.org/jira/browse/SLIDER-1199
> Project: Slider
>  Issue Type: Bug
>  Components: appmaster
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1199.1.patch, SLIDER-1199.2.patch, 
> SLIDER-1199.3.patch, SLIDER-1199.4.patch
>
>
> From the code, it seems like when the node failure threshold for a role is 
> exceeded, that node is no longer suggested for placement. But there is 
> nothing preventing the RM from selecting the node again. If the node were 
> blacklisted, perhaps that would prevent new allocations on problem nodes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (SLIDER-1205) Fix issues in uber apps

2017-02-06 Thread Billie Rinaldi (JIRA)

[ 
https://issues.apache.org/jira/browse/SLIDER-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15855113#comment-15855113
 ] 

Billie Rinaldi commented on SLIDER-1205:


Yes, that's right. Console is a better solution for password input than 
BufferedReader, but it doesn't work with the Slider python CLI script. That's 
why we had to use BufferedReader for Slider, but I was able to switch to 
Console for yarn-native-services.

> Fix issues in uber apps
> ---
>
> Key: SLIDER-1205
> URL: https://issues.apache.org/jira/browse/SLIDER-1205
> Project: Slider
>  Issue Type: Bug
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1205.1.patch
>
>
> Some portions of YARN-5701 apply to Slider and should be ported.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (SLIDER-1206) AgentFailuresIT and AgentFailures2IT due to increase in heartbeat loss interval

2017-02-06 Thread Billie Rinaldi (JIRA)

[ 
https://issues.apache.org/jira/browse/SLIDER-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15854828#comment-15854828
 ] 

Billie Rinaldi commented on SLIDER-1206:


Previously the value was hardcoded to 2 x the heartbeat monitor interval, so we 
can use that value for the tests.

> AgentFailuresIT and AgentFailures2IT due to increase in heartbeat loss 
> interval
> ---
>
> Key: SLIDER-1206
> URL: https://issues.apache.org/jira/browse/SLIDER-1206
> Project: Slider
>  Issue Type: Bug
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1206.1.patch
>
>
> The increase in heartbeat lost interval is making some container failure 
> tests fail due to containers not failing as soon as expected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (SLIDER-1206) AgentFailuresIT and AgentFailures2IT due to increase in heartbeat loss interval

2017-02-06 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi updated SLIDER-1206:
---
Attachment: SLIDER-1206.1.patch

> AgentFailuresIT and AgentFailures2IT due to increase in heartbeat loss 
> interval
> ---
>
> Key: SLIDER-1206
> URL: https://issues.apache.org/jira/browse/SLIDER-1206
> Project: Slider
>  Issue Type: Bug
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1206.1.patch
>
>
> The increase in heartbeat lost interval is making some container failure 
> tests fail due to containers not failing as soon as expected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (SLIDER-1206) AgentFailuresIT and AgentFailures2IT due to increase in heartbeat loss interval

2017-02-06 Thread Billie Rinaldi (JIRA)
Billie Rinaldi created SLIDER-1206:
--

 Summary: AgentFailuresIT and AgentFailures2IT due to increase in 
heartbeat loss interval
 Key: SLIDER-1206
 URL: https://issues.apache.org/jira/browse/SLIDER-1206
 Project: Slider
  Issue Type: Bug
Reporter: Billie Rinaldi
Assignee: Billie Rinaldi


The increase in heartbeat lost interval is making some container failure tests 
fail due to containers not failing as soon as expected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (SLIDER-1206) AgentFailuresIT and AgentFailures2IT due to increase in heartbeat loss interval

2017-02-06 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi updated SLIDER-1206:
---
Fix Version/s: Slider 1.0.0

> AgentFailuresIT and AgentFailures2IT due to increase in heartbeat loss 
> interval
> ---
>
> Key: SLIDER-1206
> URL: https://issues.apache.org/jira/browse/SLIDER-1206
> Project: Slider
>  Issue Type: Bug
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
> Fix For: Slider 1.0.0
>
>
> The increase in heartbeat lost interval is making some container failure 
> tests fail due to containers not failing as soon as expected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (SLIDER-1204) Slider handles "per.component" for multiple components incorrectly

2017-02-06 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi updated SLIDER-1204:
---
Attachment: SLIDER-1204.1.patch

> Slider handles "per.component" for multiple components incorrectly
> --
>
> Key: SLIDER-1204
> URL: https://issues.apache.org/jira/browse/SLIDER-1204
> Project: Slider
>  Issue Type: Bug
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1204.1.patch
>
>
> Port YARN-5941 to Slider.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (SLIDER-1205) Fix issues in uber apps

2017-02-06 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi updated SLIDER-1205:
---
Attachment: SLIDER-1205.1.patch

> Fix issues in uber apps
> ---
>
> Key: SLIDER-1205
> URL: https://issues.apache.org/jira/browse/SLIDER-1205
> Project: Slider
>  Issue Type: Bug
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1205.1.patch
>
>
> Some portions of YARN-5701 apply to Slider and should be ported.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (SLIDER-1205) Fix issues in uber apps

2017-02-06 Thread Billie Rinaldi (JIRA)
Billie Rinaldi created SLIDER-1205:
--

 Summary: Fix issues in uber apps
 Key: SLIDER-1205
 URL: https://issues.apache.org/jira/browse/SLIDER-1205
 Project: Slider
  Issue Type: Bug
Reporter: Billie Rinaldi
Assignee: Billie Rinaldi
 Fix For: Slider 1.0.0


Some portions of YARN-5701 apply to Slider and should be ported.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (SLIDER-1204) Slider handles "per.component" for multiple components incorrectly

2017-02-06 Thread Billie Rinaldi (JIRA)
Billie Rinaldi created SLIDER-1204:
--

 Summary: Slider handles "per.component" for multiple components 
incorrectly
 Key: SLIDER-1204
 URL: https://issues.apache.org/jira/browse/SLIDER-1204
 Project: Slider
  Issue Type: Bug
Reporter: Billie Rinaldi
Assignee: Billie Rinaldi
 Fix For: Slider 1.0.0


Port YARN-5941 to Slider.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (SLIDER-1183) Slider AM should not kill application if onError is called

2017-02-03 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi resolved SLIDER-1183.

Resolution: Fixed

Resolved and opened SLIDER-1202.

> Slider AM should not kill application if onError is called
> --
>
> Key: SLIDER-1183
> URL: https://issues.apache.org/jira/browse/SLIDER-1183
> Project: Slider
>  Issue Type: Sub-task
>  Components: appmaster
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1183.1.patch, SLIDER-1183.2.patch, 
> SLIDER-1183.3.patch
>
>
> Slider AM should not kill the application if the onError callback occurs. 
> Once YARN-5999 is applied, the Slider AM can ignore onError as in YARN-5996.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (SLIDER-1202) Document SLIDER-1183 Hadoop version requirements

2017-02-03 Thread Billie Rinaldi (JIRA)
Billie Rinaldi created SLIDER-1202:
--

 Summary: Document SLIDER-1183 Hadoop version requirements
 Key: SLIDER-1202
 URL: https://issues.apache.org/jira/browse/SLIDER-1202
 Project: Slider
  Issue Type: Bug
Reporter: Billie Rinaldi
Priority: Blocker
 Fix For: Slider 1.0.0


See SLIDER-1183.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (SLIDER-1199) Blacklist nodes that exceed the node failure threshold for a role

2017-02-03 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi updated SLIDER-1199:
---
Attachment: SLIDER-1199.4.patch

Fixed one mistake after comparing with patch 2.

> Blacklist nodes that exceed the node failure threshold for a role
> -
>
> Key: SLIDER-1199
> URL: https://issues.apache.org/jira/browse/SLIDER-1199
> Project: Slider
>  Issue Type: Bug
>  Components: appmaster
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1199.1.patch, SLIDER-1199.2.patch, 
> SLIDER-1199.3.patch, SLIDER-1199.4.patch
>
>
> From the code, it seems like when the node failure threshold for a role is 
> exceeded, that node is no longer suggested for placement. But there is 
> nothing preventing the RM from selecting the node again. If the node were 
> blacklisted, perhaps that would prevent new allocations on problem nodes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (SLIDER-1183) Slider AM should not kill application if onError is called

2017-02-03 Thread Billie Rinaldi (JIRA)

[ 
https://issues.apache.org/jira/browse/SLIDER-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15851612#comment-15851612
 ] 

Billie Rinaldi commented on SLIDER-1183:


Absolutely. I meant to do that.

> Slider AM should not kill application if onError is called
> --
>
> Key: SLIDER-1183
> URL: https://issues.apache.org/jira/browse/SLIDER-1183
> Project: Slider
>  Issue Type: Sub-task
>  Components: appmaster
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1183.1.patch, SLIDER-1183.2.patch
>
>
> Slider AM should not kill the application if the onError callback occurs. 
> Once YARN-5999 is applied, the Slider AM can ignore onError as in YARN-5996.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (SLIDER-1183) Slider AM should not kill application if onError is called

2017-02-02 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi updated SLIDER-1183:
---
Attachment: SLIDER-1183.2.patch

> Slider AM should not kill application if onError is called
> --
>
> Key: SLIDER-1183
> URL: https://issues.apache.org/jira/browse/SLIDER-1183
> Project: Slider
>  Issue Type: Sub-task
>  Components: appmaster
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1183.1.patch, SLIDER-1183.2.patch
>
>
> Slider AM should not kill the application if the onError callback occurs. 
> Once YARN-5999 is applied, the Slider AM can ignore onError as in YARN-5996.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (SLIDER-1183) Slider AM should not kill application if onError is called

2017-02-02 Thread Billie Rinaldi (JIRA)

[ 
https://issues.apache.org/jira/browse/SLIDER-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15850563#comment-15850563
 ] 

Billie Rinaldi commented on SLIDER-1183:


[~gsaha], Thanks for the heads up! I did not realize there were some valid 
exceptions thrown there.

> Slider AM should not kill application if onError is called
> --
>
> Key: SLIDER-1183
> URL: https://issues.apache.org/jira/browse/SLIDER-1183
> Project: Slider
>  Issue Type: Sub-task
>  Components: appmaster
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1183.1.patch
>
>
> Slider AM should not kill the application if the onError callback occurs. 
> Once YARN-5999 is applied, the Slider AM can ignore onError as in YARN-5996.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (SLIDER-1199) Blacklist nodes that exceed the node failure threshold for a role

2017-02-01 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi updated SLIDER-1199:
---
Attachment: SLIDER-1199.2.patch

Here's one that updates the blacklist after the failure counts have been reset, 
keeping nodes from remaining blacklisted forever.

> Blacklist nodes that exceed the node failure threshold for a role
> -
>
> Key: SLIDER-1199
> URL: https://issues.apache.org/jira/browse/SLIDER-1199
> Project: Slider
>  Issue Type: Bug
>  Components: appmaster
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1199.1.patch, SLIDER-1199.2.patch
>
>
> From the code, it seems like when the node failure threshold for a role is 
> exceeded, that node is no longer suggested for placement. But there is 
> nothing preventing the RM from selecting the node again. If the node were 
> blacklisted, perhaps that would prevent new allocations on problem nodes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (SLIDER-1199) Blacklist nodes that exceed the node failure threshold for a role

2017-02-01 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi updated SLIDER-1199:
---
Attachment: SLIDER-1199.1.patch

Submitting a preview patch for discussion. I am not very familiar with this 
area (neither Slider role placement nor RM blacklisting), so any ideas / 
comments would be appreciated. Still trying to figure out how to test this, as 
well. cc [~gsaha], [~ste...@apache.org]

> Blacklist nodes that exceed the node failure threshold for a role
> -
>
> Key: SLIDER-1199
> URL: https://issues.apache.org/jira/browse/SLIDER-1199
> Project: Slider
>  Issue Type: Bug
>  Components: appmaster
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1199.1.patch
>
>
> From the code, it seems like when the node failure threshold for a role is 
> exceeded, that node is no longer suggested for placement. But there is 
> nothing preventing the RM from selecting the node again. If the node were 
> blacklisted, perhaps that would prevent new allocations on problem nodes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (SLIDER-1199) Blacklist nodes that exceed the node failure threshold for a role

2017-02-01 Thread Billie Rinaldi (JIRA)
Billie Rinaldi created SLIDER-1199:
--

 Summary: Blacklist nodes that exceed the node failure threshold 
for a role
 Key: SLIDER-1199
 URL: https://issues.apache.org/jira/browse/SLIDER-1199
 Project: Slider
  Issue Type: Bug
  Components: appmaster
Reporter: Billie Rinaldi
Assignee: Billie Rinaldi
 Fix For: Slider 1.0.0


>From the code, it seems like when the node failure threshold for a role is 
>exceeded, that node is no longer suggested for placement. But there is nothing 
>preventing the RM from selecting the node again. If the node were blacklisted, 
>perhaps that would prevent new allocations on problem nodes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (SLIDER-1189) Agent never connects to new AM if AM restart takes too long

2017-01-25 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi resolved SLIDER-1189.

Resolution: Fixed

Thanks, [~gsaha].

> Agent never connects to new AM if AM restart takes too long
> ---
>
> Key: SLIDER-1189
> URL: https://issues.apache.org/jira/browse/SLIDER-1189
> Project: Slider
>  Issue Type: Bug
>  Components: agent
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
>Priority: Critical
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1189.1.patch, SLIDER-1189.2.patch, 
> SLIDER-1189.3.patch
>
>
> In testing RM and AM failure scenarios, I killed my RM, killed the AM, waited 
> for a bit, then restarted the RM. The AM is restarted, but running agents 
> never connect to the new AM. The AM data is re-read from the ZK registry once 
> if the heartbeat retry threshold is reached, at which point the agent tries 
> re-registering with the AM. However, if the AM data is stale at that point, 
> it never re-reads the data from the ZK registry, and retries registering with 
> the nonexistent AM forever (until it is timed out due to heartbeat loss and 
> killed by the new AM).
> Note this happens when AM restart is delayed more than about a minute, which 
> can occur if the RM is down or the RM is up but busy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (SLIDER-1189) Agent never connects to new AM

2017-01-25 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi updated SLIDER-1189:
---
Attachment: SLIDER-1189.3.patch

> Agent never connects to new AM
> --
>
> Key: SLIDER-1189
> URL: https://issues.apache.org/jira/browse/SLIDER-1189
> Project: Slider
>  Issue Type: Bug
>  Components: agent
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
>Priority: Critical
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1189.1.patch, SLIDER-1189.2.patch, 
> SLIDER-1189.3.patch
>
>
> In testing RM and AM failure scenarios, I killed my RM, killed the AM, waited 
> for a bit, then restarted the RM. The AM is restarted, but running agents 
> never connect to the new AM. The AM data is re-read from the ZK registry once 
> if the heartbeat retry threshold is reached, at which point the agent tries 
> re-registering with the AM. However, if the AM data is stale at that point, 
> it never re-reads the data from the ZK registry, and retries registering with 
> the nonexistent AM forever (until it is timed out due to heartbeat loss and 
> killed by the new AM).
> Note this happens when AM restart is delayed more than about a minute, which 
> can occur if the RM is down or the RM is up but busy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (SLIDER-1188) Make AM agent heartbeat loss configurable / increase default

2017-01-24 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi resolved SLIDER-1188.

Resolution: Fixed

> Make AM agent heartbeat loss configurable / increase default
> 
>
> Key: SLIDER-1188
> URL: https://issues.apache.org/jira/browse/SLIDER-1188
> Project: Slider
>  Issue Type: Bug
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1188.1.patch, SLIDER-1188.2.patch, 
> SLIDER-1188.3.patch
>
>
> Currently containers are marked as lost after a couple of minutes, which is 
> too sensitive for a busy cluster. We should increase the defaults and make 
> the container timeout configurable. We may also want to increase the number 
> of times the agent will retry heartbeating to the AM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (SLIDER-1187) Create app diagnostics resource with placeholder for containers (live/dead)

2017-01-24 Thread Billie Rinaldi (JIRA)

[ 
https://issues.apache.org/jira/browse/SLIDER-1187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15836763#comment-15836763
 ] 

Billie Rinaldi commented on SLIDER-1187:


+1, patch 4 lgtm.

> Create app diagnostics resource with placeholder for containers (live/dead)
> ---
>
> Key: SLIDER-1187
> URL: https://issues.apache.org/jira/browse/SLIDER-1187
> Project: Slider
>  Issue Type: Sub-task
>  Components: appmaster, client
>Affects Versions: Slider 0.91
>Reporter: Gour Saha
>Assignee: Gour Saha
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1187.001.patch, SLIDER-1187.002.patch, 
> SLIDER-1187.003.patch, SLIDER-1187.004.patch
>
>
> This is a sample JSON structure of the proposed diagnostics resource -
> {code}
> {
>   "finalStatus": "SUCCEEDED", 
>   "finalMessage": "stop command issued", 
>   "containers": [
> {
>   "containerId": "container_e3374_1485226679409_0016_01_04", 
>   "component": "COMMAND_LOGGER", 
>   "appVersion": "1.0.0", 
>   "state": 3, 
>   "exitCode": -1000, 
>   "diagnostics": "", 
>   "createTime": 1485285533968, 
>   "startTime": 1485285533989, 
>   "host": "cn008.l42scl.hortonworks.com", 
>   "hostURL": "http://cn008.l42scl.hortonworks.com:8042;, 
>   "logLink": 
> "http://cn007.l42scl.hortonworks.com:19888/jobhistory/logs/cn008.l42scl.hortonworks.com:45454/container_e3374_1485226679409_0016_01_04/ctx/root;
> }, 
> {
>   "containerId": "container_e3374_1485226679409_0016_01_03", 
>   "component": "COMMAND_LOGGER", 
>   "appVersion": "1.0.0", 
>   "state": 3, 
>   "exitCode": -1000, 
>   "diagnostics": "", 
>   "createTime": 1485285120456, 
>   "startTime": 1485285120723, 
>   "host": "cn005.l42scl.hortonworks.com", 
>   "hostURL": "http://cn005.l42scl.hortonworks.com:8042;, 
>   "logLink": 
> "http://cn007.l42scl.hortonworks.com:19888/jobhistory/logs/cn005.l42scl.hortonworks.com:45454/container_e3374_1485226679409_0016_01_03/ctx/root;
> }, 
> {
>   "containerId": "container_e3374_1485226679409_0016_01_02", 
>   "component": "COMMAND_LOGGER", 
>   "appVersion": "1.0.0", 
>   "state": 4, 
>   "exitCode": -100, 
>   "diagnostics": "Container released by application", 
>   "createTime": 1485285120464, 
>   "startTime": 1485285120522, 
>   "host": "cn008.l42scl.hortonworks.com", 
>   "hostURL": "http://cn008.l42scl.hortonworks.com:8042;, 
>   "logLink": 
> "http://cn007.l42scl.hortonworks.com:19888/jobhistory/logs/cn008.l42scl.hortonworks.com:45454/container_e3374_1485226679409_0016_01_02/ctx/root;
> }
>   ]
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (SLIDER-1187) Create app diagnostics resource with placeholder for containers (live/dead)

2017-01-24 Thread Billie Rinaldi (JIRA)

[ 
https://issues.apache.org/jira/browse/SLIDER-1187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15836593#comment-15836593
 ] 

Billie Rinaldi commented on SLIDER-1187:


SliderAppMaster#storeContainerDiagnostics and AppState#getLiveLogLink are 
unused. Otherwise looks good; I am running a test now.

> Create app diagnostics resource with placeholder for containers (live/dead)
> ---
>
> Key: SLIDER-1187
> URL: https://issues.apache.org/jira/browse/SLIDER-1187
> Project: Slider
>  Issue Type: Sub-task
>  Components: appmaster, client
>Affects Versions: Slider 0.91
>Reporter: Gour Saha
>Assignee: Gour Saha
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1187.001.patch, SLIDER-1187.002.patch, 
> SLIDER-1187.003.patch
>
>
> This is a sample JSON structure of the proposed diagnostics resource -
> {code}
> {
>   "finalStatus": "SUCCEEDED", 
>   "finalMessage": "stop command issued", 
>   "containers": [
> {
>   "containerId": "container_e3374_1485226679409_0016_01_04", 
>   "component": "COMMAND_LOGGER", 
>   "appVersion": "1.0.0", 
>   "state": 3, 
>   "exitCode": -1000, 
>   "diagnostics": "", 
>   "createTime": 1485285533968, 
>   "startTime": 1485285533989, 
>   "host": "cn008.l42scl.hortonworks.com", 
>   "hostURL": "http://cn008.l42scl.hortonworks.com:8042;, 
>   "logLink": 
> "http://cn007.l42scl.hortonworks.com:19888/jobhistory/logs/cn008.l42scl.hortonworks.com:45454/container_e3374_1485226679409_0016_01_04/ctx/root;
> }, 
> {
>   "containerId": "container_e3374_1485226679409_0016_01_03", 
>   "component": "COMMAND_LOGGER", 
>   "appVersion": "1.0.0", 
>   "state": 3, 
>   "exitCode": -1000, 
>   "diagnostics": "", 
>   "createTime": 1485285120456, 
>   "startTime": 1485285120723, 
>   "host": "cn005.l42scl.hortonworks.com", 
>   "hostURL": "http://cn005.l42scl.hortonworks.com:8042;, 
>   "logLink": 
> "http://cn007.l42scl.hortonworks.com:19888/jobhistory/logs/cn005.l42scl.hortonworks.com:45454/container_e3374_1485226679409_0016_01_03/ctx/root;
> }, 
> {
>   "containerId": "container_e3374_1485226679409_0016_01_02", 
>   "component": "COMMAND_LOGGER", 
>   "appVersion": "1.0.0", 
>   "state": 4, 
>   "exitCode": -100, 
>   "diagnostics": "Container released by application", 
>   "createTime": 1485285120464, 
>   "startTime": 1485285120522, 
>   "host": "cn008.l42scl.hortonworks.com", 
>   "hostURL": "http://cn008.l42scl.hortonworks.com:8042;, 
>   "logLink": 
> "http://cn007.l42scl.hortonworks.com:19888/jobhistory/logs/cn008.l42scl.hortonworks.com:45454/container_e3374_1485226679409_0016_01_02/ctx/root;
> }
>   ]
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (SLIDER-1188) Make AM agent heartbeat loss configurable / increase default

2017-01-19 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi updated SLIDER-1188:
---
Summary: Make AM agent heartbeat loss configurable / increase default  
(was: Make AM agent heartbeat loss configurable / increase defaults)

> Make AM agent heartbeat loss configurable / increase default
> 
>
> Key: SLIDER-1188
> URL: https://issues.apache.org/jira/browse/SLIDER-1188
> Project: Slider
>  Issue Type: Bug
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1188.1.patch, SLIDER-1188.2.patch, 
> SLIDER-1188.3.patch
>
>
> Currently containers are marked as lost after a couple of minutes, which is 
> too sensitive for a busy cluster. We should increase the defaults and make 
> the container timeout configurable. We may also want to increase the number 
> of times the agent will retry heartbeating to the AM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (SLIDER-1189) Agent never connects to new AM

2017-01-19 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi updated SLIDER-1189:
---
Description: 
In testing RM and AM failure scenarios, I killed my RM, killed the AM, waited 
for a bit, then restarted the RM. The AM is restarted, but running agents never 
connect to the new AM. The AM data is re-read from the ZK registry once if the 
heartbeat retry threshold is reached, at which point the agent tries 
re-registering with the AM. However, if the AM data is stale at that point, it 
never re-reads the data from the ZK registry, and retries registering with the 
nonexistent AM forever (until it is timed out due to heartbeat loss and killed 
by the new AM).

Note this happens when AM restart is delayed more than about a minute, which 
can occur if the RM is down or the RM is up but busy.

  was:In testing RM and AM failure scenarios, I killed my RM, killed the AM, 
waited for a bit, then restarted the RM. The AM is restarted, but running 
agents never connect to the new AM. The AM data is re-read from the ZK registry 
once if the heartbeat retry threshold is reached, at which point the agent 
tries re-registering with the AM. However, if the AM data is stale at that 
point, it never re-reads the data from the ZK registry, and retries registering 
with the nonexistent AM forever (until it is timed out due to heartbeat loss 
and killed by the new AM).


> Agent never connects to new AM
> --
>
> Key: SLIDER-1189
> URL: https://issues.apache.org/jira/browse/SLIDER-1189
> Project: Slider
>  Issue Type: Bug
>  Components: agent
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
>Priority: Critical
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1189.1.patch
>
>
> In testing RM and AM failure scenarios, I killed my RM, killed the AM, waited 
> for a bit, then restarted the RM. The AM is restarted, but running agents 
> never connect to the new AM. The AM data is re-read from the ZK registry once 
> if the heartbeat retry threshold is reached, at which point the agent tries 
> re-registering with the AM. However, if the AM data is stale at that point, 
> it never re-reads the data from the ZK registry, and retries registering with 
> the nonexistent AM forever (until it is timed out due to heartbeat loss and 
> killed by the new AM).
> Note this happens when AM restart is delayed more than about a minute, which 
> can occur if the RM is down or the RM is up but busy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (SLIDER-1189) Agent never connects to new AM

2017-01-19 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi updated SLIDER-1189:
---
Priority: Critical  (was: Major)

> Agent never connects to new AM
> --
>
> Key: SLIDER-1189
> URL: https://issues.apache.org/jira/browse/SLIDER-1189
> Project: Slider
>  Issue Type: Bug
>  Components: agent
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
>Priority: Critical
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1189.1.patch
>
>
> In testing RM and AM failure scenarios, I killed my RM, killed the AM, waited 
> for a bit, then restarted the RM. The AM is restarted, but running agents 
> never connect to the new AM. The AM data is re-read from the ZK registry once 
> if the heartbeat retry threshold is reached, at which point the agent tries 
> re-registering with the AM. However, if the AM data is stale at that point, 
> it never re-reads the data from the ZK registry, and retries registering with 
> the nonexistent AM forever (until it is timed out due to heartbeat loss and 
> killed by the new AM).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (SLIDER-1188) Make AM agent heartbeat loss configurable / increase defaults

2017-01-19 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi updated SLIDER-1188:
---
Attachment: SLIDER-1188.3.patch

Patch with AM changes only.

> Make AM agent heartbeat loss configurable / increase defaults
> -
>
> Key: SLIDER-1188
> URL: https://issues.apache.org/jira/browse/SLIDER-1188
> Project: Slider
>  Issue Type: Bug
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1188.1.patch, SLIDER-1188.2.patch, 
> SLIDER-1188.3.patch
>
>
> Currently containers are marked as lost after a couple of minutes, which is 
> too sensitive for a busy cluster. We should increase the defaults and make 
> the container timeout configurable. We may also want to increase the number 
> of times the agent will retry heartbeating to the AM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (SLIDER-1189) Agent never connects to new AM

2017-01-19 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi updated SLIDER-1189:
---
Attachment: SLIDER-1189.1.patch

> Agent never connects to new AM
> --
>
> Key: SLIDER-1189
> URL: https://issues.apache.org/jira/browse/SLIDER-1189
> Project: Slider
>  Issue Type: Bug
>  Components: agent
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1189.1.patch
>
>
> In testing RM and AM failure scenarios, I killed my RM, killed the AM, waited 
> for a bit, then restarted the RM. The AM is restarted, but running agents 
> never connect to the new AM. The AM data is re-read from the ZK registry once 
> if the heartbeat retry threshold is reached, at which point the agent tries 
> re-registering with the AM. However, if the AM data is stale at that point, 
> it never re-reads the data from the ZK registry, and retries registering with 
> the nonexistent AM forever (until it is timed out due to heartbeat loss and 
> killed by the new AM).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (SLIDER-1189) Agent never connects to new AM

2017-01-19 Thread Billie Rinaldi (JIRA)
Billie Rinaldi created SLIDER-1189:
--

 Summary: Agent never connects to new AM
 Key: SLIDER-1189
 URL: https://issues.apache.org/jira/browse/SLIDER-1189
 Project: Slider
  Issue Type: Bug
  Components: agent
Reporter: Billie Rinaldi
Assignee: Billie Rinaldi
 Fix For: Slider 1.0.0


In testing RM and AM failure scenarios, I killed my RM, killed the AM, waited 
for a bit, then restarted the RM. The AM is restarted, but running agents never 
connect to the new AM. The AM data is re-read from the ZK registry once if the 
heartbeat retry threshold is reached, at which point the agent tries 
re-registering with the AM. However, if the AM data is stale at that point, it 
never re-reads the data from the ZK registry, and retries registering with the 
nonexistent AM forever (until it is timed out due to heartbeat loss and killed 
by the new AM).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (SLIDER-1188) Make AM agent heartbeat loss configurable / increase defaults

2017-01-19 Thread Billie Rinaldi (JIRA)

[ 
https://issues.apache.org/jira/browse/SLIDER-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15830275#comment-15830275
 ] 

Billie Rinaldi commented on SLIDER-1188:


Upon further testing, this retry threshold is not doing what I thought it was. 
It is not capping the total number of retries, but a number of retries to 
perform before rereading AM data from the ZK registry and re-registering with 
the AM. I think I'll have to do more testing to figure out how to improve 
things on the agent side. We can go ahead with the AM improvements here.

> Make AM agent heartbeat loss configurable / increase defaults
> -
>
> Key: SLIDER-1188
> URL: https://issues.apache.org/jira/browse/SLIDER-1188
> Project: Slider
>  Issue Type: Bug
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
> Attachments: SLIDER-1188.1.patch, SLIDER-1188.2.patch
>
>
> Currently containers are marked as lost after a couple of minutes, which is 
> too sensitive for a busy cluster. We should increase the defaults and make 
> the container timeout configurable. We may also want to increase the number 
> of times the agent will retry heartbeating to the AM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (SLIDER-1188) Make AM agent heartbeat loss configurable / increase defaults

2017-01-19 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi updated SLIDER-1188:
---
Attachment: SLIDER-1188.2.patch

The new patch reads the agent retry threshold from the agent config.

> Make AM agent heartbeat loss configurable / increase defaults
> -
>
> Key: SLIDER-1188
> URL: https://issues.apache.org/jira/browse/SLIDER-1188
> Project: Slider
>  Issue Type: Bug
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
> Attachments: SLIDER-1188.1.patch, SLIDER-1188.2.patch
>
>
> Currently containers are marked as lost after a couple of minutes, which is 
> too sensitive for a busy cluster. We should increase the defaults and make 
> the container timeout configurable. We may also want to increase the number 
> of times the agent will retry heartbeating to the AM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (SLIDER-1187) Create app diagnostics resource with placeholder for containers (live/dead)

2017-01-18 Thread Billie Rinaldi (JIRA)

[ 
https://issues.apache.org/jira/browse/SLIDER-1187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15828717#comment-15828717
 ] 

Billie Rinaldi commented on SLIDER-1187:


Nice patch, [~gsaha]. I think this diagnostic information will be very helpful. 
Comments below.

Since the other object in ClusterDescription is called 
"ApplicationLivenessInformation," perhaps the new class should be named 
"ApplicationDiagnostics." Not a big deal if you want to keep it as 
AppDiagnostics, though.

Is the newline needed in SliderClient printDiagnosticContainers? I don't think 
\n will work for Windows.

I think that storeContainerDiagnostics should be a method of AppState, and 
AppState can handle setting the logLink internally rather than having the AM 
pass it as a parameter. Most of the new methods in SliderAppMaster can be 
handled in AppState.

In onContainerStopped, the AM has asked the container to stop, not the RM. I 
don't think the container diagnostics should be updated for the NMClientAsync 
callbacks onContainerStopped, onGetContainerStatusError, or 
onStopContainerError. (The AM never calls nmClientAsync.stopContainer, so I 
don't think it will ever see onContainerStopped or onStopContainerError. In any 
case, it should receive container completion info from the RM. For 
onGetContainerStatusError, I don't think that callback necessarily means 
anything about the state of the container.) It makes sense to update 
diagnostics for onStartContainerError and onContainerStatusReceived, as well as 
the RM callback onContainersCompleted. I don't think the live log link needs to 
updated in onContainerStatusReceived, because AppState is setting that when the 
ContainerInformation is created.

Do you think this will cause memory issues for long-lived AMs?

> Create app diagnostics resource with placeholder for containers (live/dead)
> ---
>
> Key: SLIDER-1187
> URL: https://issues.apache.org/jira/browse/SLIDER-1187
> Project: Slider
>  Issue Type: Sub-task
>  Components: appmaster, client
>Affects Versions: Slider 0.91
>Reporter: Gour Saha
>Assignee: Gour Saha
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1187.001.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (SLIDER-1188) Make AM agent heartbeat loss configurable / increase defaults

2017-01-17 Thread Billie Rinaldi (JIRA)

[ 
https://issues.apache.org/jira/browse/SLIDER-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827178#comment-15827178
 ] 

Billie Rinaldi commented on SLIDER-1188:


[~gsaha], I am still testing this patch, but could you take a look and let me 
know what you think?

> Make AM agent heartbeat loss configurable / increase defaults
> -
>
> Key: SLIDER-1188
> URL: https://issues.apache.org/jira/browse/SLIDER-1188
> Project: Slider
>  Issue Type: Bug
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
> Attachments: SLIDER-1188.1.patch
>
>
> Currently containers are marked as lost after a couple of minutes, which is 
> too sensitive for a busy cluster. We should increase the defaults and make 
> the container timeout configurable. We may also want to increase the number 
> of times the agent will retry heartbeating to the AM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (SLIDER-1188) Make AM agent heartbeat loss configurable / increase defaults

2017-01-17 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi updated SLIDER-1188:
---
Attachment: SLIDER-1188.1.patch

> Make AM agent heartbeat loss configurable / increase defaults
> -
>
> Key: SLIDER-1188
> URL: https://issues.apache.org/jira/browse/SLIDER-1188
> Project: Slider
>  Issue Type: Bug
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
> Attachments: SLIDER-1188.1.patch
>
>
> Currently containers are marked as lost after a couple of minutes, which is 
> too sensitive for a busy cluster. We should increase the defaults and make 
> the container timeout configurable. We may also want to increase the number 
> of times the agent will retry heartbeating to the AM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (SLIDER-1188) Make AM agent heartbeat loss configurable / increase defaults

2017-01-17 Thread Billie Rinaldi (JIRA)
Billie Rinaldi created SLIDER-1188:
--

 Summary: Make AM agent heartbeat loss configurable / increase 
defaults
 Key: SLIDER-1188
 URL: https://issues.apache.org/jira/browse/SLIDER-1188
 Project: Slider
  Issue Type: Bug
Reporter: Billie Rinaldi
Assignee: Billie Rinaldi


Currently containers are marked as lost after a couple of minutes, which is too 
sensitive for a busy cluster. We should increase the defaults and make the 
container timeout configurable. We may also want to increase the number of 
times the agent will retry heartbeating to the AM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (SLIDER-1182) Slider AM should wait forever for RM

2016-12-22 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi resolved SLIDER-1182.

Resolution: Fixed

> Slider AM should wait forever for RM
> 
>
> Key: SLIDER-1182
> URL: https://issues.apache.org/jira/browse/SLIDER-1182
> Project: Slider
>  Issue Type: Sub-task
>  Components: appmaster
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1182.1.patch
>
>
> This ticket is for applying YARN-5944 to the Slider AM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (SLIDER-1183) Slider AM should not kill application if onError is called

2016-12-22 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi updated SLIDER-1183:
---
Attachment: SLIDER-1183.1.patch

> Slider AM should not kill application if onError is called
> --
>
> Key: SLIDER-1183
> URL: https://issues.apache.org/jira/browse/SLIDER-1183
> Project: Slider
>  Issue Type: Sub-task
>  Components: appmaster
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1183.1.patch
>
>
> Slider AM should not kill the application if the onError callback occurs. 
> Once YARN-5999 is applied, the Slider AM can ignore onError as in YARN-5996.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (SLIDER-1182) Slider AM should wait forever for RM

2016-12-22 Thread Billie Rinaldi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi updated SLIDER-1182:
---
Attachment: SLIDER-1182.1.patch

> Slider AM should wait forever for RM
> 
>
> Key: SLIDER-1182
> URL: https://issues.apache.org/jira/browse/SLIDER-1182
> Project: Slider
>  Issue Type: Sub-task
>  Components: appmaster
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1182.1.patch
>
>
> This ticket is for applying YARN-5944 to the Slider AM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (SLIDER-1183) Slider AM should not kill application if onError is called

2016-12-22 Thread Billie Rinaldi (JIRA)
Billie Rinaldi created SLIDER-1183:
--

 Summary: Slider AM should not kill application if onError is called
 Key: SLIDER-1183
 URL: https://issues.apache.org/jira/browse/SLIDER-1183
 Project: Slider
  Issue Type: Sub-task
  Components: appmaster
Reporter: Billie Rinaldi
Assignee: Billie Rinaldi
 Fix For: Slider 1.0.0


Slider AM should not kill the application if the onError callback occurs. Once 
YARN-5999 is applied, the Slider AM can ignore onError as in YARN-5996.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   3   4   >