[jira] [Commented] (AMBARI-24123) Grafana System-Servers Dashboard Disk IO/IOPS shows negative values
[ https://issues.apache.org/jira/browse/AMBARI-24123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16515718#comment-16515718 ] David F. Quiroga commented on AMBARI-24123: --- Possibly related to AMBARI-23008, completed in AMBARI-23932 > Grafana System-Servers Dashboard Disk IO/IOPS shows negative values > --- > > Key: AMBARI-24123 > URL: https://issues.apache.org/jira/browse/AMBARI-24123 > Project: Ambari > Issue Type: Bug > Components: ambari-metrics >Affects Versions: 2.6.2 > Environment: HDP 2.6.4.0-91 > Ambari 2.6.2.0 > SLES 12 > > Side note we did not observe this in HDP 2.5.3, Ambari 2.5.2 >Reporter: David F. Quiroga >Priority: Trivial > Attachments: disk_stat_2hr.jpg, disk_stat_6hr.jpg > > > Grafana > System-Servers Dashboard > Charts > Disk IO - Read Bytes, Write Bytes, Disk IOPS - Read Count, Write > Count > All display negative values for one or more hosts when viewed over a large > time period (6+ hrs) > Attached screenshots are for single host. 6hr shows negative values, but when > "zoom-in" on same time period (2hr) there are no negative values. > Therefore suspects it relates to aggregation of metrics rather than > collection itself. > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (AMBARI-24123) Grafana System-Servers Dashboard Disk IO/IOPS shows negative values
David F. Quiroga created AMBARI-24123: - Summary: Grafana System-Servers Dashboard Disk IO/IOPS shows negative values Key: AMBARI-24123 URL: https://issues.apache.org/jira/browse/AMBARI-24123 Project: Ambari Issue Type: Bug Components: ambari-metrics Affects Versions: 2.6.2 Environment: HDP 2.6.4.0-91 Ambari 2.6.2.0 SLES 12 Side note we did not observe this in HDP 2.5.3, Ambari 2.5.2 Reporter: David F. Quiroga Attachments: disk_stat_2hr.jpg, disk_stat_6hr.jpg Grafana > System-Servers Dashboard Charts > Disk IO - Read Bytes, Write Bytes, Disk IOPS - Read Count, Write Count All display negative values for one or more hosts when viewed over a large time period (6+ hrs) Attached screenshots are for single host. 6hr shows negative values, but when "zoom-in" on same time period (2hr) there are no negative values. Therefore suspects it relates to aggregation of metrics rather than collection itself. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (AMBARI-24079) Configuring Storm for Supervision
[ https://issues.apache.org/jira/browse/AMBARI-24079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David F. Quiroga updated AMBARI-24079: -- Description: [Configuring Storm for Supervision|https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.4/bk_storm-component-guide/content/config-storm-supv.html] Currently under [{{STORM/0.9.1/package/scripts}}|https://github.com/apache/ambari/tree/trunk/ambari-server/src/main/resources/common-services/STORM/0.9.1/package/scripts] there are {{supervisor.py}} and {{supervisor_prod.py}} (similar for Nimbus) when configuring Storm for supervision you update the {{metainfo.xml}} to reference the {{_prod.py}} files. During a recent cluster upgrade ({{metainfo.xml}} changes lost) we looked at combining the two files, so the scripts check for supervision support and use it when available. The "decision" to be supervised then occurs at the node level, and therefore can be managed at the node-level rather than at the service/whole-cluster level. Currently we perform a basic check (shown below) for support before each action (start, stop, status). A better way might be to do a conditional import. {quote}{{def component_supported(component_name):}} return_code, output = shell.call(("supervisorctl", "status", format("storm-\{component_name}"))) {{ if return_code == 0 and 'ERROR' not in output:}} # return code of 0 if program installed and component configured {{ return True}} {{ else:}} {{ # return code of non 0 if program not installed or component not configured}} {{ return False}} {quote} was: [Configuring Storm for Supervision|https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.4/bk_storm-component-guide/content/config-storm-supv.html] Currently under [{{STORM/0.9.1/package/scripts}}|https://github.com/apache/ambari/tree/trunk/ambari-server/src/main/resources/common-services/STORM/0.9.1/package/scripts] there are {{supervisor.py}} and {{supervisor_prod.py}} (similar for Nimbus) when configuring Storm for supervision you update the {{metainfo.xml}} to reference the {{_prod.py}} files. During a recent cluster upgrade ({{metainfo.xml}} changes lost) we looked at combining the two files, so the scripts check for supervision support and use it when available. The "decision" to be supervised then occurs at the node level, and therefore can be managed at the node-level rather than at the service/whole-cluster level. Currently we perform a basic check (shown below) for support before each action (start, stop, status). A better way might be to do a conditional import. {quote}{{def component_supported(component_name):}} return_code, output = shell.call(("supervisorctl", "status", format("storm-\{component_name}"))) {{ if return_code == 0 and 'ERROR' not in output:}} # return code of 0 if program installed and component configured {{ return True}} {{ else:}} {{ # return code of non 0 if program not installed or component not configured}} {{ return False}} {quote} > Configuring Storm for Supervision > -- > > Key: AMBARI-24079 > URL: https://issues.apache.org/jira/browse/AMBARI-24079 > Project: Ambari > Issue Type: Improvement > Components: ambari-server >Reporter: David F. Quiroga >Assignee: David F. Quiroga >Priority: Trivial > > [Configuring Storm for > Supervision|https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.4/bk_storm-component-guide/content/config-storm-supv.html] > Currently under > > [{{STORM/0.9.1/package/scripts}}|https://github.com/apache/ambari/tree/trunk/ambari-server/src/main/resources/common-services/STORM/0.9.1/package/scripts] > there are {{supervisor.py}} and {{supervisor_prod.py}} (similar for Nimbus) > when configuring Storm for supervision you update the {{metainfo.xml}} to > reference the {{_prod.py}} files. > During a recent cluster upgrade ({{metainfo.xml}} changes lost) we looked at > combining the two files, so the scripts check for supervision support and use > it when available. > The "decision" to be supervised then occurs at the node level, and therefore > can be managed at the node-level rather than at the service/whole-cluster > level. > > Currently we perform a basic check (shown below) for support before each > action (start, stop, status). A better way might be to do a conditional > import. > > {quote}{{def component_supported(component_name):}} > return_code, output = shell.call(("supervisorctl", "status", > format("storm-\{component_name}"))) > {{ if return_code == 0 and 'ERROR' not in output:}} > # return code of 0 if program installed and component configured > {{ return True}} > {{ else:}} > {{ # return cod
[jira] [Created] (AMBARI-24079) Configuring Storm for Supervision
David F. Quiroga created AMBARI-24079: - Summary: Configuring Storm for Supervision Key: AMBARI-24079 URL: https://issues.apache.org/jira/browse/AMBARI-24079 Project: Ambari Issue Type: Improvement Components: ambari-server Reporter: David F. Quiroga Assignee: David F. Quiroga [Configuring Storm for Supervision|https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.4/bk_storm-component-guide/content/config-storm-supv.html] Currently under [{{STORM/0.9.1/package/scripts}}|https://github.com/apache/ambari/tree/trunk/ambari-server/src/main/resources/common-services/STORM/0.9.1/package/scripts] there are {{supervisor.py}} and {{supervisor_prod.py}} (similar for Nimbus) when configuring Storm for supervision you update the {{metainfo.xml}} to reference the {{_prod.py}} files. During a recent cluster upgrade ({{metainfo.xml}} changes lost) we looked at combining the two files, so the scripts check for supervision support and use it when available. The "decision" to be supervised then occurs at the node level, and therefore can be managed at the node-level rather than at the service/whole-cluster level. Currently we perform a basic check (shown below) for support before each action (start, stop, status). A better way might be to do a conditional import. {quote}{{def component_supported(component_name):}} return_code, output = shell.call(("supervisorctl", "status", format("storm-\{component_name}"))) {{ if return_code == 0 and 'ERROR' not in output:}} # return code of 0 if program installed and component configured {{ return True}} {{ else:}} {{ # return code of non 0 if program not installed or component not configured}} {{ return False}} {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (AMBARI-22642) LDAPS sync Connection Refused
[ https://issues.apache.org/jira/browse/AMBARI-22642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David F. Quiroga resolved AMBARI-22642. --- Resolution: Fixed PR #1398 merged > LDAPS sync Connection Refused > -- > > Key: AMBARI-22642 > URL: https://issues.apache.org/jira/browse/AMBARI-22642 > Project: Ambari > Issue Type: Bug > Components: ambari-server >Affects Versions: 2.5.0 > Environment: java version "1.8.0_121" > Java(TM) SE Runtime Environment (build 1.8.0_121-tdc1-b13) > Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode) > AD Domain Controllers > LDAP v.3 > 2012 R2 OS >Reporter: David F. Quiroga >Assignee: David F. Quiroga >Priority: Minor > Labels: easyfix, patch, pull-request-available > Fix For: 2.7.0 > > Attachments: ambari-22642.patch > > Original Estimate: 24h > Time Spent: 3h 10m > Remaining Estimate: 20h 50m > > Ambari server configured to use "secure" ldap authentication. > authentication.ldap.primaryUrl=:636 > authentication.ldap.useSSL=true > We call the ldap_sync_events REST endpoint frequently to synchronize > existing groups and a specific list groups. We had no issues with this until > mid-October at which point we began to see: > {code} > "status" : "ERROR", > "status_detail" : "Caught exception running LDAP sync. simple bind > failed: **:636; nested exception is > javax.naming.CommunicationException: simple bind failed: **:636 [Root > exception is java.net.SocketException: Connection reset]", > {code} > Troubleshooting: > * We saw random success and failure when attempting to sync a single group. > * With useSSL=false and an updated port ldap sync was consistently successful. > Cause: > * By default, ldap connection only uses pooled connections when connecting to > a directory server over LDAP. Enabling SSL causes it to disable the pooling, > resulting in poorer performance and failures due to connection resets. > * Around mid-October we increased the number of groups defined on the system > (50+), this pushed us outside the "safe zone". > Fix: > Enable the SSL connections pooling by adding the below argument to startup > options. > -Dcom.sun.jndi.ldap.connect.pool.protocol='plain ssl' > Reference: > [https://confluence.atlassian.com/jirakb/connecting-jira-to-active-directory-over-ldaps-fails-with-connection-reset-763004137.htm] > [https://docs.oracle.com/javase/jndi/tutorial/ldap/connect/config.html] > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AMBARI-22642) LDAPS sync Connection Refused
[ https://issues.apache.org/jira/browse/AMBARI-22642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16493645#comment-16493645 ] David F. Quiroga commented on AMBARI-22642: --- New [PR #1398|https://github.com/apache/ambari/pull/1398] > LDAPS sync Connection Refused > -- > > Key: AMBARI-22642 > URL: https://issues.apache.org/jira/browse/AMBARI-22642 > Project: Ambari > Issue Type: Bug > Components: ambari-server >Affects Versions: 2.5.0 > Environment: java version "1.8.0_121" > Java(TM) SE Runtime Environment (build 1.8.0_121-tdc1-b13) > Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode) > AD Domain Controllers > LDAP v.3 > 2012 R2 OS >Reporter: David F. Quiroga >Assignee: David F. Quiroga >Priority: Minor > Labels: easyfix, patch, pull-request-available > Fix For: 2.7.0 > > Attachments: ambari-22642.patch > > Original Estimate: 24h > Time Spent: 2h 40m > Remaining Estimate: 21h 20m > > Ambari server configured to use "secure" ldap authentication. > authentication.ldap.primaryUrl=:636 > authentication.ldap.useSSL=true > We call the ldap_sync_events REST endpoint frequently to synchronize > existing groups and a specific list groups. We had no issues with this until > mid-October at which point we began to see: > {code} > "status" : "ERROR", > "status_detail" : "Caught exception running LDAP sync. simple bind > failed: **:636; nested exception is > javax.naming.CommunicationException: simple bind failed: **:636 [Root > exception is java.net.SocketException: Connection reset]", > {code} > Troubleshooting: > * We saw random success and failure when attempting to sync a single group. > * With useSSL=false and an updated port ldap sync was consistently successful. > Cause: > * By default, ldap connection only uses pooled connections when connecting to > a directory server over LDAP. Enabling SSL causes it to disable the pooling, > resulting in poorer performance and failures due to connection resets. > * Around mid-October we increased the number of groups defined on the system > (50+), this pushed us outside the "safe zone". > Fix: > Enable the SSL connections pooling by adding the below argument to startup > options. > -Dcom.sun.jndi.ldap.connect.pool.protocol='plain ssl' > Reference: > [https://confluence.atlassian.com/jirakb/connecting-jira-to-active-directory-over-ldaps-fails-with-connection-reset-763004137.htm] > [https://docs.oracle.com/javase/jndi/tutorial/ldap/connect/config.html] > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AMBARI-22642) LDAPS sync Connection Refused
[ https://issues.apache.org/jira/browse/AMBARI-22642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16489314#comment-16489314 ] David F. Quiroga commented on AMBARI-22642: --- Fiddlesticks... Space delimited options... [~rlevas] Next steps here: convert to escape \", re-test and new PR? > LDAPS sync Connection Refused > -- > > Key: AMBARI-22642 > URL: https://issues.apache.org/jira/browse/AMBARI-22642 > Project: Ambari > Issue Type: Bug > Components: ambari-server >Affects Versions: 2.5.0 > Environment: java version "1.8.0_121" > Java(TM) SE Runtime Environment (build 1.8.0_121-tdc1-b13) > Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode) > AD Domain Controllers > LDAP v.3 > 2012 R2 OS >Reporter: David F. Quiroga >Assignee: David F. Quiroga >Priority: Minor > Labels: easyfix, patch, pull-request-available > Fix For: 2.7.0 > > Attachments: ambari-22642.patch > > Original Estimate: 24h > Time Spent: 2.5h > Remaining Estimate: 21.5h > > Ambari server configured to use "secure" ldap authentication. > authentication.ldap.primaryUrl=:636 > authentication.ldap.useSSL=true > We call the ldap_sync_events REST endpoint frequently to synchronize > existing groups and a specific list groups. We had no issues with this until > mid-October at which point we began to see: > {code} > "status" : "ERROR", > "status_detail" : "Caught exception running LDAP sync. simple bind > failed: **:636; nested exception is > javax.naming.CommunicationException: simple bind failed: **:636 [Root > exception is java.net.SocketException: Connection reset]", > {code} > Troubleshooting: > * We saw random success and failure when attempting to sync a single group. > * With useSSL=false and an updated port ldap sync was consistently successful. > Cause: > * By default, ldap connection only uses pooled connections when connecting to > a directory server over LDAP. Enabling SSL causes it to disable the pooling, > resulting in poorer performance and failures due to connection resets. > * Around mid-October we increased the number of groups defined on the system > (50+), this pushed us outside the "safe zone". > Fix: > Enable the SSL connections pooling by adding the below argument to startup > options. > -Dcom.sun.jndi.ldap.connect.pool.protocol='plain ssl' > Reference: > [https://confluence.atlassian.com/jirakb/connecting-jira-to-active-directory-over-ldaps-fails-with-connection-reset-763004137.htm] > [https://docs.oracle.com/javase/jndi/tutorial/ldap/connect/config.html] > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (AMBARI-23866) Kerberos Service Check failure due to kinit failure on random node
[ https://issues.apache.org/jira/browse/AMBARI-23866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David F. Quiroga resolved AMBARI-23866. --- Resolution: Implemented Fix Version/s: 2.7.0 Pull request merged into trunk. Would make note of Robert's comment {quote}Maybe a future enhancement will be to add properties so a user can adjust the number to retries and the timeout value between retries. {quote} > Kerberos Service Check failure due to kinit failure on random node > -- > > Key: AMBARI-23866 > URL: https://issues.apache.org/jira/browse/AMBARI-23866 > Project: Ambari > Issue Type: Improvement >Affects Versions: 2.5.2 > Environment: Multiple Kerberos Domain Controllers across multiple > data centers for single realm. >Reporter: David F. Quiroga >Assignee: David F. Quiroga >Priority: Minor > Labels: pull-request-available > Fix For: 2.7.0 > > Time Spent: 1h > Remaining Estimate: 0h > > We were seeing Kerberos Service checks failures in Ambari. Specifically it > would fail during the first run of the day, succeed on the second, then fail > on the next but succeed if run again and so forth. > Reviewing the operation log, it showed kinit failure from random node(s) > {{kinit: Client not found in Kerberos database while getting initial > credentials}} > Since AMBARI-9852 > {quote}The service check must perform the following steps: > 1.Create a unique principal in the relevant KDC (server) > 2.Test that the principal can be used to authenticate via kinit (agent) > 3.Destroy the principal (server) > {quote} > Which is a very good check of services. > So what is happening... > In our environment we have multiple Kerberos Domain Controllers across > multiple data centers all providing the same realm. > The creation of a unique principal occurs at a single KDC and is propagated > to the others. > The agents were testing the principal at different KDC, i.e. before it had a > change to propagate. This is why the second service check would succeed. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (AMBARI-22642) LDAPS sync Connection Refused
[ https://issues.apache.org/jira/browse/AMBARI-22642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David F. Quiroga resolved AMBARI-22642. --- Resolution: Fixed Pull Request merged into Trunk > LDAPS sync Connection Refused > -- > > Key: AMBARI-22642 > URL: https://issues.apache.org/jira/browse/AMBARI-22642 > Project: Ambari > Issue Type: Bug > Components: ambari-server >Affects Versions: 2.5.0 > Environment: java version "1.8.0_121" > Java(TM) SE Runtime Environment (build 1.8.0_121-tdc1-b13) > Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode) > AD Domain Controllers > LDAP v.3 > 2012 R2 OS >Reporter: David F. Quiroga >Assignee: David F. Quiroga >Priority: Minor > Labels: easyfix, patch, pull-request-available > Fix For: 2.7.0 > > Attachments: ambari-22642.patch > > Original Estimate: 24h > Time Spent: 2h 10m > Remaining Estimate: 21h 50m > > Ambari server configured to use "secure" ldap authentication. > authentication.ldap.primaryUrl=:636 > authentication.ldap.useSSL=true > We call the ldap_sync_events REST endpoint frequently to synchronize > existing groups and a specific list groups. We had no issues with this until > mid-October at which point we began to see: > {code} > "status" : "ERROR", > "status_detail" : "Caught exception running LDAP sync. simple bind > failed: **:636; nested exception is > javax.naming.CommunicationException: simple bind failed: **:636 [Root > exception is java.net.SocketException: Connection reset]", > {code} > Troubleshooting: > * We saw random success and failure when attempting to sync a single group. > * With useSSL=false and an updated port ldap sync was consistently successful. > Cause: > * By default, ldap connection only uses pooled connections when connecting to > a directory server over LDAP. Enabling SSL causes it to disable the pooling, > resulting in poorer performance and failures due to connection resets. > * Around mid-October we increased the number of groups defined on the system > (50+), this pushed us outside the "safe zone". > Fix: > Enable the SSL connections pooling by adding the below argument to startup > options. > -Dcom.sun.jndi.ldap.connect.pool.protocol='plain ssl' > Reference: > [https://confluence.atlassian.com/jirakb/connecting-jira-to-active-directory-over-ldaps-fails-with-connection-reset-763004137.htm] > [https://docs.oracle.com/javase/jndi/tutorial/ldap/connect/config.html] > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AMBARI-23866) Kerberos Service Check failure due to kinit failure on random node
[ https://issues.apache.org/jira/browse/AMBARI-23866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16481110#comment-16481110 ] David F. Quiroga commented on AMBARI-23866: --- We have about 10-20 KDC servers at 3-4 locations across the US. Analysis determined that it took about 1-2 minutes for a new principal to reach all KDC in our environment. Basically started the service check then ldap searched each host (in a code loop) for the new principal. I selected values based on that but would be opening to changing them, in either direction. If the replication is taking more than 150 seconds I think feedback to the users (AKA failure) is fair as that seems like an unhealthy system. > Kerberos Service Check failure due to kinit failure on random node > -- > > Key: AMBARI-23866 > URL: https://issues.apache.org/jira/browse/AMBARI-23866 > Project: Ambari > Issue Type: Improvement >Affects Versions: 2.5.2 > Environment: Multiple Kerberos Domain Controllers across multiple > data centers for single realm. >Reporter: David F. Quiroga >Assignee: David F. Quiroga >Priority: Minor > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > We were seeing Kerberos Service checks failures in Ambari. Specifically it > would fail during the first run of the day, succeed on the second, then fail > on the next but succeed if run again and so forth. > Reviewing the operation log, it showed kinit failure from random node(s) > {{kinit: Client not found in Kerberos database while getting initial > credentials}} > Since AMBARI-9852 > {quote}The service check must perform the following steps: > 1.Create a unique principal in the relevant KDC (server) > 2.Test that the principal can be used to authenticate via kinit (agent) > 3.Destroy the principal (server) > {quote} > Which is a very good check of services. > So what is happening... > In our environment we have multiple Kerberos Domain Controllers across > multiple data centers all providing the same realm. > The creation of a unique principal occurs at a single KDC and is propagated > to the others. > The agents were testing the principal at different KDC, i.e. before it had a > change to propagate. This is why the second service check would succeed. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AMBARI-23866) Kerberos Service Check failure due to kinit failure on random node
[ https://issues.apache.org/jira/browse/AMBARI-23866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16481055#comment-16481055 ] David F. Quiroga commented on AMBARI-23866: --- [~rlevas], that's a good point on the process design. What concerns remain with including a retry? > Kerberos Service Check failure due to kinit failure on random node > -- > > Key: AMBARI-23866 > URL: https://issues.apache.org/jira/browse/AMBARI-23866 > Project: Ambari > Issue Type: Improvement >Affects Versions: 2.5.2 > Environment: Multiple Kerberos Domain Controllers across multiple > data centers for single realm. >Reporter: David F. Quiroga >Assignee: David F. Quiroga >Priority: Minor > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > We were seeing Kerberos Service checks failures in Ambari. Specifically it > would fail during the first run of the day, succeed on the second, then fail > on the next but succeed if run again and so forth. > Reviewing the operation log, it showed kinit failure from random node(s) > {{kinit: Client not found in Kerberos database while getting initial > credentials}} > Since AMBARI-9852 > {quote}The service check must perform the following steps: > 1.Create a unique principal in the relevant KDC (server) > 2.Test that the principal can be used to authenticate via kinit (agent) > 3.Destroy the principal (server) > {quote} > Which is a very good check of services. > So what is happening... > In our environment we have multiple Kerberos Domain Controllers across > multiple data centers all providing the same realm. > The creation of a unique principal occurs at a single KDC and is propagated > to the others. > The agents were testing the principal at different KDC, i.e. before it had a > change to propagate. This is why the second service check would succeed. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (AMBARI-23866) Kerberos Service Check failure due to kinit failure on random node
[ https://issues.apache.org/jira/browse/AMBARI-23866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16479304#comment-16479304 ] David F. Quiroga edited comment on AMBARI-23866 at 5/18/18 6:51 PM: [~rlevas] thanks for feedback. Invalid password should result in {{Preauthentication failed while getting initial credentials}}, in this case we are seeing {{Client not found in Kerberos database}} which would indicate the principal provided does not exist. So I am not sure if the failure would trigger the use of the {{master-kdc}}. Over the last year here at work they deployed new Active Directory Domain Controllers and retired the old ones. With that we learned that {{kerberos-env\ldap_url}} had been to a single AD server rather than the DNS name. From that point on we really try to avoid referencing a single AD server. RE: latency of the replication process. I like the retry because if the latency is small the service check will not have to wait a maximum time i.e. most users are not affected by the addition of the retry. And true, we can't guarantee that we are waiting long enough for every environment but if it is taking more than 2+ minutes it should be fair to alert on that. Another thing we noticed is that if the test via kinit fails, the clean-up (Destroy the principal) does not happen. Meaning the principals are still out in AD and the keytabs are on the clients. Re-running the service check on the same day will succeed and clean those up. -but that is not an ideal process.- was (Author: quirogadf): [~rlevas] thanks for feedback. Invalid password should result in {{Preauthentication failed while getting initial credentials}}, in this case we are seeing {{Client not found in Kerberos database}} which would indicate the principal provided does not exist. So I am not sure if the failure would trigger the use of the {{master-kdc}}. Over the last year here at work they deployed new Active Directory Domain Controllers and retired the old ones. With that we learned that {{kerberos-env\ldap_url}} had been to a single AD server rather than the DNS name. From that point on we really try to avoid referencing a single AD server. RE: latency of the replication process. I like the retry because if the latency is small the service check will not have to wait a maximum time i.e. most users are not affected by the addition of the retry. And true, we can't guarantee that we are waiting long enough for every environment but if it is taking more than 2+ minutes it should be fair to alert on that. Another thing we noticed is that if the test via kinit fails, the clean-up (Destroy the principal) does not happen. Meaning the principals are still out in AD and the keytabs are on the clients. Re-running the service check on the same day will succeed and clean those up, but that is not an ideal process. > Kerberos Service Check failure due to kinit failure on random node > -- > > Key: AMBARI-23866 > URL: https://issues.apache.org/jira/browse/AMBARI-23866 > Project: Ambari > Issue Type: Improvement >Affects Versions: 2.5.2 > Environment: Multiple Kerberos Domain Controllers across multiple > data centers for single realm. >Reporter: David F. Quiroga >Assignee: David F. Quiroga >Priority: Minor > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > We were seeing Kerberos Service checks failures in Ambari. Specifically it > would fail during the first run of the day, succeed on the second, then fail > on the next but succeed if run again and so forth. > Reviewing the operation log, it showed kinit failure from random node(s) > {{kinit: Client not found in Kerberos database while getting initial > credentials}} > Since AMBARI-9852 > {quote}The service check must perform the following steps: > 1.Create a unique principal in the relevant KDC (server) > 2.Test that the principal can be used to authenticate via kinit (agent) > 3.Destroy the principal (server) > {quote} > Which is a very good check of services. > So what is happening... > In our environment we have multiple Kerberos Domain Controllers across > multiple data centers all providing the same realm. > The creation of a unique principal occurs at a single KDC and is propagated > to the others. > The agents were testing the principal at different KDC, i.e. before it had a > change to propagate. This is why the second service check would succeed. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AMBARI-23866) Kerberos Service Check failure due to kinit failure on random node
[ https://issues.apache.org/jira/browse/AMBARI-23866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16479304#comment-16479304 ] David F. Quiroga commented on AMBARI-23866: --- [~rlevas] thanks for feedback. Invalid password should result in {{Preauthentication failed while getting initial credentials}}, in this case we are seeing {{Client not found in Kerberos database}} which would indicate the principal provided does not exist. So I am not sure if the failure would trigger the use of the {{master-kdc}}. Over the last year here at work they deployed new Active Directory Domain Controllers and retired the old ones. With that we learned that {{kerberos-env\ldap_url}} had been to a single AD server rather than the DNS name. From that point on we really try to avoid referencing a single AD server. RE: latency of the replication process. I like the retry because if the latency is small the service check will not have to wait a maximum time i.e. most users are not affected by the addition of the retry. And true, we can't guarantee that we are waiting long enough for every environment but if it is taking more than 2+ minutes it should be fair to alert on that. Another thing we noticed is that if the test via kinit fails, the clean-up (Destroy the principal) does not happen. Meaning the principals are still out in AD and the keytabs are on the clients. Re-running the service check on the same day will succeed and clean those up, but that is not an ideal process. > Kerberos Service Check failure due to kinit failure on random node > -- > > Key: AMBARI-23866 > URL: https://issues.apache.org/jira/browse/AMBARI-23866 > Project: Ambari > Issue Type: Improvement >Affects Versions: 2.5.2 > Environment: Multiple Kerberos Domain Controllers across multiple > data centers for single realm. >Reporter: David F. Quiroga >Assignee: David F. Quiroga >Priority: Minor > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > We were seeing Kerberos Service checks failures in Ambari. Specifically it > would fail during the first run of the day, succeed on the second, then fail > on the next but succeed if run again and so forth. > Reviewing the operation log, it showed kinit failure from random node(s) > {{kinit: Client not found in Kerberos database while getting initial > credentials}} > Since AMBARI-9852 > {quote}The service check must perform the following steps: > 1.Create a unique principal in the relevant KDC (server) > 2.Test that the principal can be used to authenticate via kinit (agent) > 3.Destroy the principal (server) > {quote} > Which is a very good check of services. > So what is happening... > In our environment we have multiple Kerberos Domain Controllers across > multiple data centers all providing the same realm. > The creation of a unique principal occurs at a single KDC and is propagated > to the others. > The agents were testing the principal at different KDC, i.e. before it had a > change to propagate. This is why the second service check would succeed. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AMBARI-23866) Kerberos Service Check failure due to kinit failure on random node
[ https://issues.apache.org/jira/browse/AMBARI-23866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16478180#comment-16478180 ] David F. Quiroga commented on AMBARI-23866: --- My first "fix" was to add a sleep before testing the principal, this worked but believe the better way is to add a retry to the Execute. Pull Request on the way. > Kerberos Service Check failure due to kinit failure on random node > -- > > Key: AMBARI-23866 > URL: https://issues.apache.org/jira/browse/AMBARI-23866 > Project: Ambari > Issue Type: Improvement >Affects Versions: 2.5.2 > Environment: Multiple Kerberos Domain Controllers across multiple > data centers for single realm. >Reporter: David F. Quiroga >Assignee: David F. Quiroga >Priority: Minor > > We were seeing Kerberos Service checks failures in Ambari. Specifically it > would fail during the first run of the day, succeed on the second, then fail > on the next but succeed if run again and so forth. > Reviewing the operation log, it showed kinit failure from random node(s) > {{kinit: Client not found in Kerberos database while getting initial > credentials}} > Since AMBARI-9852 > {quote}The service check must perform the following steps: > 1.Create a unique principal in the relevant KDC (server) > 2.Test that the principal can be used to authenticate via kinit (agent) > 3.Destroy the principal (server) > {quote} > Which is a very good check of services. > So what is happening... > In our environment we have multiple Kerberos Domain Controllers across > multiple data centers all providing the same realm. > The creation of a unique principal occurs at a single KDC and is propagated > to the others. > The agents were testing the principal at different KDC, i.e. before it had a > change to propagate. This is why the second service check would succeed. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (AMBARI-23866) Kerberos Service Check failure due to kinit failure on random node
David F. Quiroga created AMBARI-23866: - Summary: Kerberos Service Check failure due to kinit failure on random node Key: AMBARI-23866 URL: https://issues.apache.org/jira/browse/AMBARI-23866 Project: Ambari Issue Type: Improvement Affects Versions: 2.5.2 Environment: Multiple Kerberos Domain Controllers across multiple data centers for single realm. Reporter: David F. Quiroga Assignee: David F. Quiroga We were seeing Kerberos Service checks failures in Ambari. Specifically it would fail during the first run of the day, succeed on the second, then fail on the next but succeed if run again and so forth. Reviewing the operation log, it showed kinit failure from random node(s) {{kinit: Client not found in Kerberos database while getting initial credentials}} Since AMBARI-9852 {quote}The service check must perform the following steps: 1.Create a unique principal in the relevant KDC (server) 2.Test that the principal can be used to authenticate via kinit (agent) 3.Destroy the principal (server) {quote} Which is a very good check of services. So what is happening... In our environment we have multiple Kerberos Domain Controllers across multiple data centers all providing the same realm. The creation of a unique principal occurs at a single KDC and is propagated to the others. The agents were testing the principal at different KDC, i.e. before it had a change to propagate. This is why the second service check would succeed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (AMBARI-22642) LDAPS sync Connection Refused
[ https://issues.apache.org/jira/browse/AMBARI-22642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David F. Quiroga updated AMBARI-22642: -- Fix Version/s: 2.7.0 Status: In Progress (was: Patch Available) Starting work on pull request into trunk. > LDAPS sync Connection Refused > -- > > Key: AMBARI-22642 > URL: https://issues.apache.org/jira/browse/AMBARI-22642 > Project: Ambari > Issue Type: Bug > Components: ambari-server >Affects Versions: 2.5.0 > Environment: java version "1.8.0_121" > Java(TM) SE Runtime Environment (build 1.8.0_121-tdc1-b13) > Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode) > AD Domain Controllers > LDAP v.3 > 2012 R2 OS >Reporter: David F. Quiroga >Assignee: David F. Quiroga >Priority: Minor > Labels: easyfix, patch > Fix For: 2.7.0 > > Attachments: ambari-22642.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Ambari server configured to use "secure" ldap authentication. > authentication.ldap.primaryUrl=:636 > authentication.ldap.useSSL=true > We call the ldap_sync_events REST endpoint frequently to synchronize > existing groups and a specific list groups. We had no issues with this until > mid-October at which point we began to see: > {code} > "status" : "ERROR", > "status_detail" : "Caught exception running LDAP sync. simple bind > failed: **:636; nested exception is > javax.naming.CommunicationException: simple bind failed: **:636 [Root > exception is java.net.SocketException: Connection reset]", > {code} > Troubleshooting: > * We saw random success and failure when attempting to sync a single group. > * With useSSL=false and an updated port ldap sync was consistently successful. > Cause: > * By default, ldap connection only uses pooled connections when connecting to > a directory server over LDAP. Enabling SSL causes it to disable the pooling, > resulting in poorer performance and failures due to connection resets. > * Around mid-October we increased the number of groups defined on the system > (50+), this pushed us outside the "safe zone". > Fix: > Enable the SSL connections pooling by adding the below argument to startup > options. > -Dcom.sun.jndi.ldap.connect.pool.protocol='plain ssl' > Reference: > [https://confluence.atlassian.com/jirakb/connecting-jira-to-active-directory-over-ldaps-fails-with-connection-reset-763004137.htm] > [https://docs.oracle.com/javase/jndi/tutorial/ldap/connect/config.html] > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (AMBARI-22642) LDAPS sync Connection Refused
[ https://issues.apache.org/jira/browse/AMBARI-22642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David F. Quiroga reassigned AMBARI-22642: - Assignee: David F. Quiroga > LDAPS sync Connection Refused > -- > > Key: AMBARI-22642 > URL: https://issues.apache.org/jira/browse/AMBARI-22642 > Project: Ambari > Issue Type: Bug > Components: ambari-server >Affects Versions: 2.5.0 > Environment: java version "1.8.0_121" > Java(TM) SE Runtime Environment (build 1.8.0_121-tdc1-b13) > Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode) > AD Domain Controllers > LDAP v.3 > 2012 R2 OS >Reporter: David F. Quiroga >Assignee: David F. Quiroga >Priority: Minor > Labels: easyfix, patch > Attachments: ambari-22642.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Ambari server configured to use "secure" ldap authentication. > authentication.ldap.primaryUrl=:636 > authentication.ldap.useSSL=true > We call the ldap_sync_events REST endpoint frequently to synchronize > existing groups and a specific list groups. We had no issues with this until > mid-October at which point we began to see: > {code} > "status" : "ERROR", > "status_detail" : "Caught exception running LDAP sync. simple bind > failed: **:636; nested exception is > javax.naming.CommunicationException: simple bind failed: **:636 [Root > exception is java.net.SocketException: Connection reset]", > {code} > Troubleshooting: > * We saw random success and failure when attempting to sync a single group. > * With useSSL=false and an updated port ldap sync was consistently successful. > Cause: > * By default, ldap connection only uses pooled connections when connecting to > a directory server over LDAP. Enabling SSL causes it to disable the pooling, > resulting in poorer performance and failures due to connection resets. > * Around mid-October we increased the number of groups defined on the system > (50+), this pushed us outside the "safe zone". > Fix: > Enable the SSL connections pooling by adding the below argument to startup > options. > -Dcom.sun.jndi.ldap.connect.pool.protocol='plain ssl' > Reference: > [https://confluence.atlassian.com/jirakb/connecting-jira-to-active-directory-over-ldaps-fails-with-connection-reset-763004137.htm] > [https://docs.oracle.com/javase/jndi/tutorial/ldap/connect/config.html] > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AMBARI-22642) LDAPS sync Connection Refused
[ https://issues.apache.org/jira/browse/AMBARI-22642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16459800#comment-16459800 ] David F. Quiroga commented on AMBARI-22642: --- Can I get access to assign to myself? Plan to create Pull Request. > LDAPS sync Connection Refused > -- > > Key: AMBARI-22642 > URL: https://issues.apache.org/jira/browse/AMBARI-22642 > Project: Ambari > Issue Type: Bug > Components: ambari-server >Affects Versions: 2.5.0 > Environment: java version "1.8.0_121" > Java(TM) SE Runtime Environment (build 1.8.0_121-tdc1-b13) > Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode) > AD Domain Controllers > LDAP v.3 > 2012 R2 OS >Reporter: David F. Quiroga >Priority: Minor > Labels: easyfix, patch > Attachments: ambari-22642.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Ambari server configured to use "secure" ldap authentication. > authentication.ldap.primaryUrl=:636 > authentication.ldap.useSSL=true > We call the ldap_sync_events REST endpoint frequently to synchronize > existing groups and a specific list groups. We had no issues with this until > mid-October at which point we began to see: > {code} > "status" : "ERROR", > "status_detail" : "Caught exception running LDAP sync. simple bind > failed: **:636; nested exception is > javax.naming.CommunicationException: simple bind failed: **:636 [Root > exception is java.net.SocketException: Connection reset]", > {code} > Troubleshooting: > * We saw random success and failure when attempting to sync a single group. > * With useSSL=false and an updated port ldap sync was consistently successful. > Cause: > * By default, ldap connection only uses pooled connections when connecting to > a directory server over LDAP. Enabling SSL causes it to disable the pooling, > resulting in poorer performance and failures due to connection resets. > * Around mid-October we increased the number of groups defined on the system > (50+), this pushed us outside the "safe zone". > Fix: > Enable the SSL connections pooling by adding the below argument to startup > options. > -Dcom.sun.jndi.ldap.connect.pool.protocol='plain ssl' > Reference: > [https://confluence.atlassian.com/jirakb/connecting-jira-to-active-directory-over-ldaps-fails-with-connection-reset-763004137.htm] > [https://docs.oracle.com/javase/jndi/tutorial/ldap/connect/config.html] > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (AMBARI-23382) Ambari-server sync-ldap: Sync event creation failed
[ https://issues.apache.org/jira/browse/AMBARI-23382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David F. Quiroga updated AMBARI-23382: -- Affects Version/s: (was: 2.5.3) 2.6.1 Environment: Python 2.7.5-58 or greater > Ambari-server sync-ldap: Sync event creation failed > --- > > Key: AMBARI-23382 > URL: https://issues.apache.org/jira/browse/AMBARI-23382 > Project: Ambari > Issue Type: Bug > Components: ambari-server >Affects Versions: 2.6.1, trunk, 2.6.2 > Environment: Python 2.7.5-58 or greater >Reporter: David F. Quiroga >Priority: Minor > Original Estimate: 4h > Remaining Estimate: 4h > > As described here [ambari-server sync-ldap no longer > working|https://community.hortonworks.com/questions/119756/ambari-server-sync-ldap-no-longer-working.html] > sync-ldap fails with > {{REASON: Sync event creation failed. Error details: hostname '127.0.0.1' > doesn't match }} > As pointed out by [Berry Osterlund > |https://community.hortonworks.com/users/13196/berryosterlund.html]this is > because the default behavior for ssl cert verification changed in python > 2.7.5-58. > > ambari_server/serverUtils.py hardcodes "SERVER_API_HOST = '127.0.0.1'" > Thinking we can provide the full hostname dyanmically. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AMBARI-23382) Ambari-server sync-ldap: Sync event creation failed
[ https://issues.apache.org/jira/browse/AMBARI-23382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16416264#comment-16416264 ] David F. Quiroga commented on AMBARI-23382: --- Started a pull request [https://github.com/apache/ambari/pull/806] > Ambari-server sync-ldap: Sync event creation failed > --- > > Key: AMBARI-23382 > URL: https://issues.apache.org/jira/browse/AMBARI-23382 > Project: Ambari > Issue Type: Bug > Components: ambari-server >Affects Versions: 2.5.3, trunk, 2.6.2 >Reporter: David F. Quiroga >Priority: Minor > Original Estimate: 4h > Remaining Estimate: 4h > > As described here [ambari-server sync-ldap no longer > working|https://community.hortonworks.com/questions/119756/ambari-server-sync-ldap-no-longer-working.html] > sync-ldap fails with > {{REASON: Sync event creation failed. Error details: hostname '127.0.0.1' > doesn't match }} > As pointed out by [Berry Osterlund > |https://community.hortonworks.com/users/13196/berryosterlund.html]this is > because the default behavior for ssl cert verification changed in python > 2.7.5-58. > > ambari_server/serverUtils.py hardcodes "SERVER_API_HOST = '127.0.0.1'" > Thinking we can provide the full hostname dyanmically. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AMBARI-23382) Ambari-server sync-ldap: Sync event creation failed
[ https://issues.apache.org/jira/browse/AMBARI-23382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16416167#comment-16416167 ] David F. Quiroga commented on AMBARI-23382: --- Starting work on this. But don't think I have access to assign to myself or update status. > Ambari-server sync-ldap: Sync event creation failed > --- > > Key: AMBARI-23382 > URL: https://issues.apache.org/jira/browse/AMBARI-23382 > Project: Ambari > Issue Type: Bug > Components: ambari-server >Affects Versions: 2.5.3, trunk, 2.6.2 >Reporter: David F. Quiroga >Priority: Minor > Original Estimate: 4h > Remaining Estimate: 4h > > As described here [ambari-server sync-ldap no longer > working|https://community.hortonworks.com/questions/119756/ambari-server-sync-ldap-no-longer-working.html] > sync-ldap fails with > {{REASON: Sync event creation failed. Error details: hostname '127.0.0.1' > doesn't match }} > As pointed out by [Berry Osterlund > |https://community.hortonworks.com/users/13196/berryosterlund.html]this is > because the default behavior for ssl cert verification changed in python > 2.7.5-58. > > ambari_server/serverUtils.py hardcodes "SERVER_API_HOST = '127.0.0.1'" > Thinking we can provide the full hostname dyanmically. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (AMBARI-23382) Ambari-server sync-ldap: Sync event creation failed
David F. Quiroga created AMBARI-23382: - Summary: Ambari-server sync-ldap: Sync event creation failed Key: AMBARI-23382 URL: https://issues.apache.org/jira/browse/AMBARI-23382 Project: Ambari Issue Type: Bug Components: ambari-server Affects Versions: 2.5.3, trunk, 2.6.2 Reporter: David F. Quiroga As described here [ambari-server sync-ldap no longer working|https://community.hortonworks.com/questions/119756/ambari-server-sync-ldap-no-longer-working.html] sync-ldap fails with {{REASON: Sync event creation failed. Error details: hostname '127.0.0.1' doesn't match }} As pointed out by [Berry Osterlund |https://community.hortonworks.com/users/13196/berryosterlund.html]this is because the default behavior for ssl cert verification changed in python 2.7.5-58. ambari_server/serverUtils.py hardcodes "SERVER_API_HOST = '127.0.0.1'" Thinking we can provide the full hostname dyanmically. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (AMBARI-23026) WEB type alerts authentication in Kerberos secured cluster
David F. Quiroga created AMBARI-23026: - Summary: WEB type alerts authentication in Kerberos secured cluster Key: AMBARI-23026 URL: https://issues.apache.org/jira/browse/AMBARI-23026 Project: Ambari Issue Type: Bug Components: alerts Affects Versions: 2.5.2, trunk, 2.6.2 Environment: Ambari 2.5.2 Hortonworks HDP-2.5.3.0-37 Reporter: David F. Quiroga In a Kerberized cluster some web endpoints (App Timeline Web UI, ResourceManger Web UI, etc.) require authentication. Any Ambari alerts checking those endpoints must then be able to authenticate. This was addressed in AMBARI-9586, however the default principal and keytab used in the alerts.json is that of the "bare" SPNEGO principal HTTP/_HOST@REALM. My understanding is that the HTTP service principal is used to authenticate users to a service, not used to authenticate to another service. 1. Since most endpoints involved are Web UI, would it be more appropriate to use the smokeuser in the alerts? 2. This was first observed in Ranger Audit, the YARN Ranger Plug-in showed many access denied from HTTP user. [This post|https://community.hortonworks.com/content/supportkb/150206/ranger-audit-logs-refers-to-access-denied-for-http.html] provided some direction as to where those requests were coming from. We have updated the ResourceManger Web UI alert definition to use cluster-env/smokeuser_keytab and cluster-env/smokeuser_principal_name and this has resolved the initial HTTP access denied. Would it also be advisable to make the change in the other secure Web UI alert definitions? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (AMBARI-22708) Ranger HDFS logging health Ambari Alert
David F. Quiroga created AMBARI-22708: - Summary: Ranger HDFS logging health Ambari Alert Key: AMBARI-22708 URL: https://issues.apache.org/jira/browse/AMBARI-22708 Project: Ambari Issue Type: New Feature Components: alerts Environment: HDP 2.5.3.0 Reporter: David F. Quiroga Priority: Trivial Attachments: alert_ranger_hdfs_logging.json, alert_ranger_knox_logging.json, alert_ranger_logging.py First some background: We were directed to retain audit/access records "forever" (technically 7 years but that is basically forever in electronic log time). Each Hadoop component generates local audit logs as per their log4j settings. In our production system these logs would frequently fill up the disk. At first we would just compress them in place but that only works for so long and there was no redundancy with local disk storage. In others words, no long term plan. We started to discuss moving them to HDFS or a different storage solution. One of our team members pointed out the Ranger plugins are already logging the "same data" into HDFS. Probably after several meeting with the higher-ups, using Ranger logs as the record truth was approved. Components log4j settings were updated to purge data automatically. Purging local logs felt like operating with out a safety net. Thought it we be good to check that Ranger was successful logging to HDFS each day. Should mention this is a kerberized cluster, not that anything ever goes wrong with kerberos. *Checking this would have certainly been possible with a shell script, but we have been pushing to centralize warning/alerts in Ambari. And so an Ambari alert python script to check on Ranger Logging Health was crafted. * For the most part the alert was modeled after some of the hive alerts. At the moment it just checks that the daily /ranger/audit/ HDFS directory has been created. I am attaching the host script and the alert.json for HDFS and Knox components. In the alert.json, service_name and component_name should be set to local values. Everything else should "work out of the box". -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (AMBARI-22642) LDAPS sync Connection Refused
[ https://issues.apache.org/jira/browse/AMBARI-22642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16290996#comment-16290996 ] David F. Quiroga commented on AMBARI-22642: --- New tests shouldn't be needed. Only updating JVM opts here. > LDAPS sync Connection Refused > -- > > Key: AMBARI-22642 > URL: https://issues.apache.org/jira/browse/AMBARI-22642 > Project: Ambari > Issue Type: Bug > Components: ambari-server >Affects Versions: 2.5.0 > Environment: java version "1.8.0_121" > Java(TM) SE Runtime Environment (build 1.8.0_121-tdc1-b13) > Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode) > AD Domain Controllers > LDAP v.3 > 2012 R2 OS >Reporter: David F. Quiroga >Priority: Minor > Labels: easyfix, patch > Attachments: ambari-22642.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Ambari server configured to use "secure" ldap authentication. > authentication.ldap.primaryUrl=:636 > authentication.ldap.useSSL=true > We call the ldap_sync_events REST endpoint frequently to synchronize > existing groups and a specific list groups. We had no issues with this until > mid-October at which point we began to see: > {code} > "status" : "ERROR", > "status_detail" : "Caught exception running LDAP sync. simple bind > failed: **:636; nested exception is > javax.naming.CommunicationException: simple bind failed: **:636 [Root > exception is java.net.SocketException: Connection reset]", > {code} > Troubleshooting: > * We saw random success and failure when attempting to sync a single group. > * With useSSL=false and an updated port ldap sync was consistently successful. > Cause: > * By default, ldap connection only uses pooled connections when connecting to > a directory server over LDAP. Enabling SSL causes it to disable the pooling, > resulting in poorer performance and failures due to connection resets. > * Around mid-October we increased the number of groups defined on the system > (50+), this pushed us outside the "safe zone". > Fix: > Enable the SSL connections pooling by adding the below argument to startup > options. > -Dcom.sun.jndi.ldap.connect.pool.protocol='plain ssl' > Reference: > [https://confluence.atlassian.com/jirakb/connecting-jira-to-active-directory-over-ldaps-fails-with-connection-reset-763004137.htm] > [https://docs.oracle.com/javase/jndi/tutorial/ldap/connect/config.html] > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AMBARI-22642) LDAPS sync Connection Refused
[ https://issues.apache.org/jira/browse/AMBARI-22642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David F. Quiroga updated AMBARI-22642: -- Attachment: ambari-22642.patch Regenerated the patch using git diff > LDAPS sync Connection Refused > -- > > Key: AMBARI-22642 > URL: https://issues.apache.org/jira/browse/AMBARI-22642 > Project: Ambari > Issue Type: Bug > Components: ambari-server >Affects Versions: 2.5.0 > Environment: java version "1.8.0_121" > Java(TM) SE Runtime Environment (build 1.8.0_121-tdc1-b13) > Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode) > AD Domain Controllers > LDAP v.3 > 2012 R2 OS >Reporter: David F. Quiroga >Priority: Minor > Labels: easyfix, patch > Attachments: ambari-22642.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Ambari server configured to use "secure" ldap authentication. > authentication.ldap.primaryUrl=:636 > authentication.ldap.useSSL=true > We call the ldap_sync_events REST endpoint frequently to synchronize > existing groups and a specific list groups. We had no issues with this until > mid-October at which point we began to see: > {code} > "status" : "ERROR", > "status_detail" : "Caught exception running LDAP sync. simple bind > failed: **:636; nested exception is > javax.naming.CommunicationException: simple bind failed: **:636 [Root > exception is java.net.SocketException: Connection reset]", > {code} > Troubleshooting: > * We saw random success and failure when attempting to sync a single group. > * With useSSL=false and an updated port ldap sync was consistently successful. > Cause: > * By default, ldap connection only uses pooled connections when connecting to > a directory server over LDAP. Enabling SSL causes it to disable the pooling, > resulting in poorer performance and failures due to connection resets. > * Around mid-October we increased the number of groups defined on the system > (50+), this pushed us outside the "safe zone". > Fix: > Enable the SSL connections pooling by adding the below argument to startup > options. > -Dcom.sun.jndi.ldap.connect.pool.protocol='plain ssl' > Reference: > [https://confluence.atlassian.com/jirakb/connecting-jira-to-active-directory-over-ldaps-fails-with-connection-reset-763004137.htm] > [https://docs.oracle.com/javase/jndi/tutorial/ldap/connect/config.html] > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AMBARI-22642) LDAPS sync Connection Refused
[ https://issues.apache.org/jira/browse/AMBARI-22642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David F. Quiroga updated AMBARI-22642: -- Attachment: (was: ambari-22642.patch) > LDAPS sync Connection Refused > -- > > Key: AMBARI-22642 > URL: https://issues.apache.org/jira/browse/AMBARI-22642 > Project: Ambari > Issue Type: Bug > Components: ambari-server >Affects Versions: 2.5.0 > Environment: java version "1.8.0_121" > Java(TM) SE Runtime Environment (build 1.8.0_121-tdc1-b13) > Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode) > AD Domain Controllers > LDAP v.3 > 2012 R2 OS >Reporter: David F. Quiroga >Priority: Minor > Labels: easyfix, patch > Original Estimate: 24h > Remaining Estimate: 24h > > Ambari server configured to use "secure" ldap authentication. > authentication.ldap.primaryUrl=:636 > authentication.ldap.useSSL=true > We call the ldap_sync_events REST endpoint frequently to synchronize > existing groups and a specific list groups. We had no issues with this until > mid-October at which point we began to see: > {code} > "status" : "ERROR", > "status_detail" : "Caught exception running LDAP sync. simple bind > failed: **:636; nested exception is > javax.naming.CommunicationException: simple bind failed: **:636 [Root > exception is java.net.SocketException: Connection reset]", > {code} > Troubleshooting: > * We saw random success and failure when attempting to sync a single group. > * With useSSL=false and an updated port ldap sync was consistently successful. > Cause: > * By default, ldap connection only uses pooled connections when connecting to > a directory server over LDAP. Enabling SSL causes it to disable the pooling, > resulting in poorer performance and failures due to connection resets. > * Around mid-October we increased the number of groups defined on the system > (50+), this pushed us outside the "safe zone". > Fix: > Enable the SSL connections pooling by adding the below argument to startup > options. > -Dcom.sun.jndi.ldap.connect.pool.protocol='plain ssl' > Reference: > [https://confluence.atlassian.com/jirakb/connecting-jira-to-active-directory-over-ldaps-fails-with-connection-reset-763004137.htm] > [https://docs.oracle.com/javase/jndi/tutorial/ldap/connect/config.html] > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AMBARI-22642) LDAPS sync Connection Refused
[ https://issues.apache.org/jira/browse/AMBARI-22642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David F. Quiroga updated AMBARI-22642: -- Attachment: (was: ambari-env.patch) > LDAPS sync Connection Refused > -- > > Key: AMBARI-22642 > URL: https://issues.apache.org/jira/browse/AMBARI-22642 > Project: Ambari > Issue Type: Bug > Components: ambari-server >Affects Versions: 2.5.0 > Environment: java version "1.8.0_121" > Java(TM) SE Runtime Environment (build 1.8.0_121-tdc1-b13) > Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode) > AD Domain Controllers > LDAP v.3 > 2012 R2 OS >Reporter: David F. Quiroga >Priority: Minor > Labels: easyfix, patch > Attachments: ambari-22642.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Ambari server configured to use "secure" ldap authentication. > authentication.ldap.primaryUrl=:636 > authentication.ldap.useSSL=true > We call the ldap_sync_events REST endpoint frequently to synchronize > existing groups and a specific list groups. We had no issues with this until > mid-October at which point we began to see: > {code} > "status" : "ERROR", > "status_detail" : "Caught exception running LDAP sync. simple bind > failed: **:636; nested exception is > javax.naming.CommunicationException: simple bind failed: **:636 [Root > exception is java.net.SocketException: Connection reset]", > {code} > Troubleshooting: > * We saw random success and failure when attempting to sync a single group. > * With useSSL=false and an updated port ldap sync was consistently successful. > Cause: > * By default, ldap connection only uses pooled connections when connecting to > a directory server over LDAP. Enabling SSL causes it to disable the pooling, > resulting in poorer performance and failures due to connection resets. > * Around mid-October we increased the number of groups defined on the system > (50+), this pushed us outside the "safe zone". > Fix: > Enable the SSL connections pooling by adding the below argument to startup > options. > -Dcom.sun.jndi.ldap.connect.pool.protocol='plain ssl' > Reference: > [https://confluence.atlassian.com/jirakb/connecting-jira-to-active-directory-over-ldaps-fails-with-connection-reset-763004137.htm] > [https://docs.oracle.com/javase/jndi/tutorial/ldap/connect/config.html] > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AMBARI-22642) LDAPS sync Connection Refused
[ https://issues.apache.org/jira/browse/AMBARI-22642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David F. Quiroga updated AMBARI-22642: -- Attachment: ambari-22642.patch Change filename > LDAPS sync Connection Refused > -- > > Key: AMBARI-22642 > URL: https://issues.apache.org/jira/browse/AMBARI-22642 > Project: Ambari > Issue Type: Bug > Components: ambari-server >Affects Versions: 2.5.0 > Environment: java version "1.8.0_121" > Java(TM) SE Runtime Environment (build 1.8.0_121-tdc1-b13) > Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode) > AD Domain Controllers > LDAP v.3 > 2012 R2 OS >Reporter: David F. Quiroga >Priority: Minor > Labels: easyfix, patch > Attachments: ambari-22642.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Ambari server configured to use "secure" ldap authentication. > authentication.ldap.primaryUrl=:636 > authentication.ldap.useSSL=true > We call the ldap_sync_events REST endpoint frequently to synchronize > existing groups and a specific list groups. We had no issues with this until > mid-October at which point we began to see: > {code} > "status" : "ERROR", > "status_detail" : "Caught exception running LDAP sync. simple bind > failed: **:636; nested exception is > javax.naming.CommunicationException: simple bind failed: **:636 [Root > exception is java.net.SocketException: Connection reset]", > {code} > Troubleshooting: > * We saw random success and failure when attempting to sync a single group. > * With useSSL=false and an updated port ldap sync was consistently successful. > Cause: > * By default, ldap connection only uses pooled connections when connecting to > a directory server over LDAP. Enabling SSL causes it to disable the pooling, > resulting in poorer performance and failures due to connection resets. > * Around mid-October we increased the number of groups defined on the system > (50+), this pushed us outside the "safe zone". > Fix: > Enable the SSL connections pooling by adding the below argument to startup > options. > -Dcom.sun.jndi.ldap.connect.pool.protocol='plain ssl' > Reference: > [https://confluence.atlassian.com/jirakb/connecting-jira-to-active-directory-over-ldaps-fails-with-connection-reset-763004137.htm] > [https://docs.oracle.com/javase/jndi/tutorial/ldap/connect/config.html] > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AMBARI-22642) LDAPS sync Connection Refused
[ https://issues.apache.org/jira/browse/AMBARI-22642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David F. Quiroga updated AMBARI-22642: -- Attachment: ambari-env.patch > LDAPS sync Connection Refused > -- > > Key: AMBARI-22642 > URL: https://issues.apache.org/jira/browse/AMBARI-22642 > Project: Ambari > Issue Type: Bug > Components: ambari-server >Affects Versions: 2.5.0 > Environment: java version "1.8.0_121" > Java(TM) SE Runtime Environment (build 1.8.0_121-tdc1-b13) > Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode) > AD Domain Controllers > LDAP v.3 > 2012 R2 OS >Reporter: David F. Quiroga >Priority: Minor > Labels: easyfix, patch > Attachments: ambari-env.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Ambari server configured to use "secure" ldap authentication. > authentication.ldap.primaryUrl=:636 > authentication.ldap.useSSL=true > We call the ldap_sync_events REST endpoint frequently to synchronize > existing groups and a specific list groups. We had no issues with this until > mid-October at which point we began to see: > {code} > "status" : "ERROR", > "status_detail" : "Caught exception running LDAP sync. simple bind > failed: **:636; nested exception is > javax.naming.CommunicationException: simple bind failed: **:636 [Root > exception is java.net.SocketException: Connection reset]", > {code} > Troubleshooting: > * We saw random success and failure when attempting to sync a single group. > * With useSSL=false and an updated port ldap sync was consistently successful. > Cause: > * By default, ldap connection only uses pooled connections when connecting to > a directory server over LDAP. Enabling SSL causes it to disable the pooling, > resulting in poorer performance and failures due to connection resets. > * Around mid-October we increased the number of groups defined on the system > (50+), this pushed us outside the "safe zone". > Fix: > Enable the SSL connections pooling by adding the below argument to startup > options. > -Dcom.sun.jndi.ldap.connect.pool.protocol='plain ssl' > Reference: > [https://confluence.atlassian.com/jirakb/connecting-jira-to-active-directory-over-ldaps-fails-with-connection-reset-763004137.htm] > [https://docs.oracle.com/javase/jndi/tutorial/ldap/connect/config.html] > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (AMBARI-22642) LDAPS sync Connection Refused
David F. Quiroga created AMBARI-22642: - Summary: LDAPS sync Connection Refused Key: AMBARI-22642 URL: https://issues.apache.org/jira/browse/AMBARI-22642 Project: Ambari Issue Type: Bug Components: ambari-server Affects Versions: 2.5.0 Environment: java version "1.8.0_121" Java(TM) SE Runtime Environment (build 1.8.0_121-tdc1-b13) Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode) AD Domain Controllers LDAP v.3 2012 R2 OS Reporter: David F. Quiroga Priority: Minor Ambari server configured to use "secure" ldap authentication. authentication.ldap.primaryUrl=:636 authentication.ldap.useSSL=true We call the ldap_sync_events REST endpoint frequently to synchronize existing groups and a specific list groups. We had no issues with this until mid-October at which point we began to see: {code} "status" : "ERROR", "status_detail" : "Caught exception running LDAP sync. simple bind failed: **:636; nested exception is javax.naming.CommunicationException: simple bind failed: **:636 [Root exception is java.net.SocketException: Connection reset]", {code} Troubleshooting: * We saw random success and failure when attempting to sync a single group. * With useSSL=false and an updated port ldap sync was consistently successful. Cause: * By default, ldap connection only uses pooled connections when connecting to a directory server over LDAP. Enabling SSL causes it to disable the pooling, resulting in poorer performance and failures due to connection resets. * Around mid-October we increased the number of groups defined on the system (50+), this pushed us outside the "safe zone". Fix: Enable the SSL connections pooling by adding the below argument to startup options. -Dcom.sun.jndi.ldap.connect.pool.protocol='plain ssl' Reference: [https://confluence.atlassian.com/jirakb/connecting-jira-to-active-directory-over-ldaps-fails-with-connection-reset-763004137.htm] [https://docs.oracle.com/javase/jndi/tutorial/ldap/connect/config.html] -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AMBARI-22642) LDAPS sync Connection Refused
[ https://issues.apache.org/jira/browse/AMBARI-22642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David F. Quiroga updated AMBARI-22642: -- Status: Patch Available (was: Open) Add javaopts to /var/lib/ambari-server/ambari-env.sh > LDAPS sync Connection Refused > -- > > Key: AMBARI-22642 > URL: https://issues.apache.org/jira/browse/AMBARI-22642 > Project: Ambari > Issue Type: Bug > Components: ambari-server >Affects Versions: 2.5.0 > Environment: java version "1.8.0_121" > Java(TM) SE Runtime Environment (build 1.8.0_121-tdc1-b13) > Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode) > AD Domain Controllers > LDAP v.3 > 2012 R2 OS >Reporter: David F. Quiroga >Priority: Minor > Labels: easyfix, patch > Original Estimate: 24h > Remaining Estimate: 24h > > Ambari server configured to use "secure" ldap authentication. > authentication.ldap.primaryUrl=:636 > authentication.ldap.useSSL=true > We call the ldap_sync_events REST endpoint frequently to synchronize > existing groups and a specific list groups. We had no issues with this until > mid-October at which point we began to see: > {code} > "status" : "ERROR", > "status_detail" : "Caught exception running LDAP sync. simple bind > failed: **:636; nested exception is > javax.naming.CommunicationException: simple bind failed: **:636 [Root > exception is java.net.SocketException: Connection reset]", > {code} > Troubleshooting: > * We saw random success and failure when attempting to sync a single group. > * With useSSL=false and an updated port ldap sync was consistently successful. > Cause: > * By default, ldap connection only uses pooled connections when connecting to > a directory server over LDAP. Enabling SSL causes it to disable the pooling, > resulting in poorer performance and failures due to connection resets. > * Around mid-October we increased the number of groups defined on the system > (50+), this pushed us outside the "safe zone". > Fix: > Enable the SSL connections pooling by adding the below argument to startup > options. > -Dcom.sun.jndi.ldap.connect.pool.protocol='plain ssl' > Reference: > [https://confluence.atlassian.com/jirakb/connecting-jira-to-active-directory-over-ldaps-fails-with-connection-reset-763004137.htm] > [https://docs.oracle.com/javase/jndi/tutorial/ldap/connect/config.html] > -- This message was sent by Atlassian JIRA (v6.4.14#64029)