[jira] [Commented] (AMBARI-24123) Grafana System-Servers Dashboard Disk IO/IOPS shows negative values

2018-06-18 Thread David F. Quiroga (JIRA)


[ 
https://issues.apache.org/jira/browse/AMBARI-24123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16515718#comment-16515718
 ] 

David F. Quiroga commented on AMBARI-24123:
---

Possibly related to AMBARI-23008, completed in AMBARI-23932 

> Grafana System-Servers Dashboard Disk IO/IOPS shows negative values
> ---
>
> Key: AMBARI-24123
> URL: https://issues.apache.org/jira/browse/AMBARI-24123
> Project: Ambari
>  Issue Type: Bug
>  Components: ambari-metrics
>Affects Versions: 2.6.2
> Environment: HDP 2.6.4.0-91
> Ambari 2.6.2.0
> SLES 12
>  
> Side note we did not observe this in HDP 2.5.3, Ambari 2.5.2
>Reporter: David F. Quiroga
>Priority: Trivial
> Attachments: disk_stat_2hr.jpg, disk_stat_6hr.jpg
>
>
> Grafana > System-Servers Dashboard 
> Charts > Disk IO - Read Bytes, Write Bytes, Disk IOPS - Read Count, Write 
> Count
> All display negative values for one or more hosts when viewed over a large 
> time period (6+ hrs)
> Attached screenshots are for single host. 6hr shows negative values, but when 
> "zoom-in" on same time period (2hr) there are no negative values. 
> Therefore suspects it relates to aggregation of metrics rather than 
> collection itself. 
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (AMBARI-24123) Grafana System-Servers Dashboard Disk IO/IOPS shows negative values

2018-06-15 Thread David F. Quiroga (JIRA)
David F. Quiroga created AMBARI-24123:
-

 Summary: Grafana System-Servers Dashboard Disk IO/IOPS shows 
negative values
 Key: AMBARI-24123
 URL: https://issues.apache.org/jira/browse/AMBARI-24123
 Project: Ambari
  Issue Type: Bug
  Components: ambari-metrics
Affects Versions: 2.6.2
 Environment: HDP 2.6.4.0-91

Ambari 2.6.2.0

SLES 12

 

Side note we did not observe this in HDP 2.5.3, Ambari 2.5.2
Reporter: David F. Quiroga
 Attachments: disk_stat_2hr.jpg, disk_stat_6hr.jpg

Grafana > System-Servers Dashboard 

Charts > Disk IO - Read Bytes, Write Bytes, Disk IOPS - Read Count, Write Count

All display negative values for one or more hosts when viewed over a large time 
period (6+ hrs)

Attached screenshots are for single host. 6hr shows negative values, but when 
"zoom-in" on same time period (2hr) there are no negative values. 

Therefore suspects it relates to aggregation of metrics rather than collection 
itself. 

 

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (AMBARI-24079) ​Configuring Storm for Supervision

2018-06-11 Thread David F. Quiroga (JIRA)


 [ 
https://issues.apache.org/jira/browse/AMBARI-24079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David F. Quiroga updated AMBARI-24079:
--
Description: 
[Configuring Storm for 
Supervision|https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.4/bk_storm-component-guide/content/config-storm-supv.html]

Currently under 
 
[{{STORM/0.9.1/package/scripts}}|https://github.com/apache/ambari/tree/trunk/ambari-server/src/main/resources/common-services/STORM/0.9.1/package/scripts]
 there are {{supervisor.py}} and {{supervisor_prod.py}} (similar for Nimbus) 
when configuring Storm for supervision you update the {{metainfo.xml}} to 
reference the {{_prod.py}} files.

During a recent cluster upgrade ({{metainfo.xml}} changes lost) we looked at 
combining the two files, so the scripts check for supervision support and use 
it when available.

The "decision" to be supervised then occurs at the node level, and therefore 
can be managed at the node-level rather than at the service/whole-cluster 
level. 

 

Currently we perform a basic check (shown below) for support before each action 
(start, stop, status). A better way might be to do a conditional import. 

 
{quote}{{def component_supported(component_name):}}
    return_code, output = shell.call(("supervisorctl", "status", 
format("storm-\{component_name}")))
 {{  if return_code == 0 and 'ERROR' not in output:}}
        # return code of 0 if program installed and component configured 
 {{    return True}}
 {{  else:}}
 {{    # return code of non 0 if program not installed or component not 
configured}}
 {{    return False}}
{quote}
 

  was:
[Configuring Storm for 
Supervision|https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.4/bk_storm-component-guide/content/config-storm-supv.html]

Currently under 
 
[{{STORM/0.9.1/package/scripts}}|https://github.com/apache/ambari/tree/trunk/ambari-server/src/main/resources/common-services/STORM/0.9.1/package/scripts]
 there are {{supervisor.py}} and {{supervisor_prod.py}} (similar for Nimbus) 
when configuring Storm for supervision you update the {{metainfo.xml}} to 
reference the {{_prod.py}} files.

During a recent cluster upgrade ({{metainfo.xml}} changes lost) we looked at 
combining the two files, so the scripts check for supervision support and use 
it when available.

The "decision" to be supervised then occurs at the node level, and therefore 
can be managed at the node-level rather than at the service/whole-cluster 
level. 

 

Currently we perform a basic check (shown below) for support before each action 
(start, stop, status). A better way might be to do a conditional import. 

 
{quote}{{def component_supported(component_name):}}
   return_code, output = shell.call(("supervisorctl", "status",                 
                                                            
format("storm-\{component_name}")))
 {{  if return_code == 0 and 'ERROR' not in output:}}
  # return code of 0 if program installed and component configured 
 {{    return True}}
 {{  else:}}
 {{  # return code of non 0 if program not installed or component not 
configured}}
 {{    return False}}
{quote}

  


> ​Configuring Storm for Supervision
> --
>
> Key: AMBARI-24079
> URL: https://issues.apache.org/jira/browse/AMBARI-24079
> Project: Ambari
>  Issue Type: Improvement
>  Components: ambari-server
>Reporter: David F. Quiroga
>Assignee: David F. Quiroga
>Priority: Trivial
>
> [Configuring Storm for 
> Supervision|https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.4/bk_storm-component-guide/content/config-storm-supv.html]
> Currently under 
>  
> [{{STORM/0.9.1/package/scripts}}|https://github.com/apache/ambari/tree/trunk/ambari-server/src/main/resources/common-services/STORM/0.9.1/package/scripts]
>  there are {{supervisor.py}} and {{supervisor_prod.py}} (similar for Nimbus) 
> when configuring Storm for supervision you update the {{metainfo.xml}} to 
> reference the {{_prod.py}} files.
> During a recent cluster upgrade ({{metainfo.xml}} changes lost) we looked at 
> combining the two files, so the scripts check for supervision support and use 
> it when available.
> The "decision" to be supervised then occurs at the node level, and therefore 
> can be managed at the node-level rather than at the service/whole-cluster 
> level. 
>  
> Currently we perform a basic check (shown below) for support before each 
> action (start, stop, status). A better way might be to do a conditional 
> import. 
>  
> {quote}{{def component_supported(component_name):}}
>     return_code, output = shell.call(("supervisorctl", "status", 
> format("storm-\{component_name}")))
>  {{  if return_code == 0 and 'ERROR' not in output:}}
>         # return code of 0 if program installed and component configured 
>  {{    return True}}
>  {{  else:}}
>  {{    # return cod

[jira] [Created] (AMBARI-24079) ​Configuring Storm for Supervision

2018-06-11 Thread David F. Quiroga (JIRA)
David F. Quiroga created AMBARI-24079:
-

 Summary: ​Configuring Storm for Supervision
 Key: AMBARI-24079
 URL: https://issues.apache.org/jira/browse/AMBARI-24079
 Project: Ambari
  Issue Type: Improvement
  Components: ambari-server
Reporter: David F. Quiroga
Assignee: David F. Quiroga


[Configuring Storm for 
Supervision|https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.4/bk_storm-component-guide/content/config-storm-supv.html]

Currently under 
 
[{{STORM/0.9.1/package/scripts}}|https://github.com/apache/ambari/tree/trunk/ambari-server/src/main/resources/common-services/STORM/0.9.1/package/scripts]
 there are {{supervisor.py}} and {{supervisor_prod.py}} (similar for Nimbus) 
when configuring Storm for supervision you update the {{metainfo.xml}} to 
reference the {{_prod.py}} files.

During a recent cluster upgrade ({{metainfo.xml}} changes lost) we looked at 
combining the two files, so the scripts check for supervision support and use 
it when available.

The "decision" to be supervised then occurs at the node level, and therefore 
can be managed at the node-level rather than at the service/whole-cluster 
level. 

 

Currently we perform a basic check (shown below) for support before each action 
(start, stop, status). A better way might be to do a conditional import. 

 
{quote}{{def component_supported(component_name):}}
   return_code, output = shell.call(("supervisorctl", "status",                 
                                                            
format("storm-\{component_name}")))
 {{  if return_code == 0 and 'ERROR' not in output:}}
  # return code of 0 if program installed and component configured 
 {{    return True}}
 {{  else:}}
 {{  # return code of non 0 if program not installed or component not 
configured}}
 {{    return False}}
{quote}

  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (AMBARI-22642) LDAPS sync Connection Refused

2018-06-07 Thread David F. Quiroga (JIRA)


 [ 
https://issues.apache.org/jira/browse/AMBARI-22642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David F. Quiroga resolved AMBARI-22642.
---
Resolution: Fixed

PR #1398 merged

> LDAPS sync Connection Refused 
> --
>
> Key: AMBARI-22642
> URL: https://issues.apache.org/jira/browse/AMBARI-22642
> Project: Ambari
>  Issue Type: Bug
>  Components: ambari-server
>Affects Versions: 2.5.0
> Environment: java version "1.8.0_121"
> Java(TM) SE Runtime Environment (build 1.8.0_121-tdc1-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
> AD Domain Controllers 
> LDAP v.3
> 2012 R2 OS 
>Reporter: David F. Quiroga
>Assignee: David F. Quiroga
>Priority: Minor
>  Labels: easyfix, patch, pull-request-available
> Fix For: 2.7.0
>
> Attachments: ambari-22642.patch
>
>   Original Estimate: 24h
>  Time Spent: 3h 10m
>  Remaining Estimate: 20h 50m
>
> Ambari server configured to use "secure" ldap authentication. 
> authentication.ldap.primaryUrl=:636
> authentication.ldap.useSSL=true
>  We call the ldap_sync_events REST endpoint frequently to synchronize 
> existing groups and a specific list groups.  We had no issues with this until 
> mid-October at which point we began to see:
> {code}
> "status" : "ERROR",
> "status_detail" : "Caught exception running LDAP sync. simple bind 
> failed: **:636; nested exception is 
> javax.naming.CommunicationException: simple bind failed: **:636 [Root 
> exception is java.net.SocketException: Connection reset]",
> {code}
> Troubleshooting: 
> * We saw random success and failure when attempting to sync a single group. 
> * With useSSL=false and an updated port ldap sync was consistently successful.
> Cause:
> * By default, ldap connection only uses pooled connections when connecting to 
> a directory server over LDAP. Enabling SSL causes it to disable the pooling, 
> resulting in poorer performance and failures due to connection resets. 
> * Around mid-October we increased the number of groups defined on the system 
> (50+), this pushed us outside the "safe zone".
> Fix:
> Enable the SSL connections pooling by adding the below argument to startup 
> options.
> -Dcom.sun.jndi.ldap.connect.pool.protocol='plain ssl'
> Reference: 
> [https://confluence.atlassian.com/jirakb/connecting-jira-to-active-directory-over-ldaps-fails-with-connection-reset-763004137.htm]
> [https://docs.oracle.com/javase/jndi/tutorial/ldap/connect/config.html]
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AMBARI-22642) LDAPS sync Connection Refused

2018-05-29 Thread David F. Quiroga (JIRA)


[ 
https://issues.apache.org/jira/browse/AMBARI-22642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16493645#comment-16493645
 ] 

David F. Quiroga commented on AMBARI-22642:
---

New [PR #1398|https://github.com/apache/ambari/pull/1398]

 

> LDAPS sync Connection Refused 
> --
>
> Key: AMBARI-22642
> URL: https://issues.apache.org/jira/browse/AMBARI-22642
> Project: Ambari
>  Issue Type: Bug
>  Components: ambari-server
>Affects Versions: 2.5.0
> Environment: java version "1.8.0_121"
> Java(TM) SE Runtime Environment (build 1.8.0_121-tdc1-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
> AD Domain Controllers 
> LDAP v.3
> 2012 R2 OS 
>Reporter: David F. Quiroga
>Assignee: David F. Quiroga
>Priority: Minor
>  Labels: easyfix, patch, pull-request-available
> Fix For: 2.7.0
>
> Attachments: ambari-22642.patch
>
>   Original Estimate: 24h
>  Time Spent: 2h 40m
>  Remaining Estimate: 21h 20m
>
> Ambari server configured to use "secure" ldap authentication. 
> authentication.ldap.primaryUrl=:636
> authentication.ldap.useSSL=true
>  We call the ldap_sync_events REST endpoint frequently to synchronize 
> existing groups and a specific list groups.  We had no issues with this until 
> mid-October at which point we began to see:
> {code}
> "status" : "ERROR",
> "status_detail" : "Caught exception running LDAP sync. simple bind 
> failed: **:636; nested exception is 
> javax.naming.CommunicationException: simple bind failed: **:636 [Root 
> exception is java.net.SocketException: Connection reset]",
> {code}
> Troubleshooting: 
> * We saw random success and failure when attempting to sync a single group. 
> * With useSSL=false and an updated port ldap sync was consistently successful.
> Cause:
> * By default, ldap connection only uses pooled connections when connecting to 
> a directory server over LDAP. Enabling SSL causes it to disable the pooling, 
> resulting in poorer performance and failures due to connection resets. 
> * Around mid-October we increased the number of groups defined on the system 
> (50+), this pushed us outside the "safe zone".
> Fix:
> Enable the SSL connections pooling by adding the below argument to startup 
> options.
> -Dcom.sun.jndi.ldap.connect.pool.protocol='plain ssl'
> Reference: 
> [https://confluence.atlassian.com/jirakb/connecting-jira-to-active-directory-over-ldaps-fails-with-connection-reset-763004137.htm]
> [https://docs.oracle.com/javase/jndi/tutorial/ldap/connect/config.html]
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AMBARI-22642) LDAPS sync Connection Refused

2018-05-24 Thread David F. Quiroga (JIRA)

[ 
https://issues.apache.org/jira/browse/AMBARI-22642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16489314#comment-16489314
 ] 

David F. Quiroga commented on AMBARI-22642:
---

Fiddlesticks... Space delimited options... 

[~rlevas] 

Next steps here: convert to escape \", re-test and new PR? 

 

> LDAPS sync Connection Refused 
> --
>
> Key: AMBARI-22642
> URL: https://issues.apache.org/jira/browse/AMBARI-22642
> Project: Ambari
>  Issue Type: Bug
>  Components: ambari-server
>Affects Versions: 2.5.0
> Environment: java version "1.8.0_121"
> Java(TM) SE Runtime Environment (build 1.8.0_121-tdc1-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
> AD Domain Controllers 
> LDAP v.3
> 2012 R2 OS 
>Reporter: David F. Quiroga
>Assignee: David F. Quiroga
>Priority: Minor
>  Labels: easyfix, patch, pull-request-available
> Fix For: 2.7.0
>
> Attachments: ambari-22642.patch
>
>   Original Estimate: 24h
>  Time Spent: 2.5h
>  Remaining Estimate: 21.5h
>
> Ambari server configured to use "secure" ldap authentication. 
> authentication.ldap.primaryUrl=:636
> authentication.ldap.useSSL=true
>  We call the ldap_sync_events REST endpoint frequently to synchronize 
> existing groups and a specific list groups.  We had no issues with this until 
> mid-October at which point we began to see:
> {code}
> "status" : "ERROR",
> "status_detail" : "Caught exception running LDAP sync. simple bind 
> failed: **:636; nested exception is 
> javax.naming.CommunicationException: simple bind failed: **:636 [Root 
> exception is java.net.SocketException: Connection reset]",
> {code}
> Troubleshooting: 
> * We saw random success and failure when attempting to sync a single group. 
> * With useSSL=false and an updated port ldap sync was consistently successful.
> Cause:
> * By default, ldap connection only uses pooled connections when connecting to 
> a directory server over LDAP. Enabling SSL causes it to disable the pooling, 
> resulting in poorer performance and failures due to connection resets. 
> * Around mid-October we increased the number of groups defined on the system 
> (50+), this pushed us outside the "safe zone".
> Fix:
> Enable the SSL connections pooling by adding the below argument to startup 
> options.
> -Dcom.sun.jndi.ldap.connect.pool.protocol='plain ssl'
> Reference: 
> [https://confluence.atlassian.com/jirakb/connecting-jira-to-active-directory-over-ldaps-fails-with-connection-reset-763004137.htm]
> [https://docs.oracle.com/javase/jndi/tutorial/ldap/connect/config.html]
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (AMBARI-23866) Kerberos Service Check failure due to kinit failure on random node

2018-05-21 Thread David F. Quiroga (JIRA)

 [ 
https://issues.apache.org/jira/browse/AMBARI-23866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David F. Quiroga resolved AMBARI-23866.
---
   Resolution: Implemented
Fix Version/s: 2.7.0

Pull request merged into trunk.

Would make note of Robert's comment
{quote}Maybe a future enhancement will be to add properties so a user can 
adjust the number to retries and the timeout value between retries.
{quote}

> Kerberos Service Check failure due to kinit failure on random node
> --
>
> Key: AMBARI-23866
> URL: https://issues.apache.org/jira/browse/AMBARI-23866
> Project: Ambari
>  Issue Type: Improvement
>Affects Versions: 2.5.2
> Environment: Multiple Kerberos Domain Controllers across multiple 
> data centers for single realm.
>Reporter: David F. Quiroga
>Assignee: David F. Quiroga
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.7.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> We were seeing Kerberos Service checks failures in Ambari. Specifically it 
> would fail during the first run of the day, succeed on the second, then fail 
> on the next but succeed if run again and so forth.
> Reviewing the operation log, it showed kinit failure from random node(s)
>  {{kinit: Client  not found in Kerberos database while getting initial 
> credentials}}
> Since AMBARI-9852
> {quote}The service check must perform the following steps:
>    1.Create a unique principal in the relevant KDC (server)
>    2.Test that the principal can be used to authenticate via kinit (agent)
>    3.Destroy the principal (server)
> {quote}
> Which is a very good check of services.
> So what is happening...
> In our environment we have multiple Kerberos Domain Controllers across 
> multiple data centers all providing the same realm.
> The creation of a unique principal occurs at a single KDC and is propagated 
> to the others.
> The agents were testing the principal at different KDC, i.e. before it had a 
> change to propagate. This is why the second service check would succeed.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (AMBARI-22642) LDAPS sync Connection Refused

2018-05-21 Thread David F. Quiroga (JIRA)

 [ 
https://issues.apache.org/jira/browse/AMBARI-22642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David F. Quiroga resolved AMBARI-22642.
---
Resolution: Fixed

Pull Request merged into Trunk

> LDAPS sync Connection Refused 
> --
>
> Key: AMBARI-22642
> URL: https://issues.apache.org/jira/browse/AMBARI-22642
> Project: Ambari
>  Issue Type: Bug
>  Components: ambari-server
>Affects Versions: 2.5.0
> Environment: java version "1.8.0_121"
> Java(TM) SE Runtime Environment (build 1.8.0_121-tdc1-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
> AD Domain Controllers 
> LDAP v.3
> 2012 R2 OS 
>Reporter: David F. Quiroga
>Assignee: David F. Quiroga
>Priority: Minor
>  Labels: easyfix, patch, pull-request-available
> Fix For: 2.7.0
>
> Attachments: ambari-22642.patch
>
>   Original Estimate: 24h
>  Time Spent: 2h 10m
>  Remaining Estimate: 21h 50m
>
> Ambari server configured to use "secure" ldap authentication. 
> authentication.ldap.primaryUrl=:636
> authentication.ldap.useSSL=true
>  We call the ldap_sync_events REST endpoint frequently to synchronize 
> existing groups and a specific list groups.  We had no issues with this until 
> mid-October at which point we began to see:
> {code}
> "status" : "ERROR",
> "status_detail" : "Caught exception running LDAP sync. simple bind 
> failed: **:636; nested exception is 
> javax.naming.CommunicationException: simple bind failed: **:636 [Root 
> exception is java.net.SocketException: Connection reset]",
> {code}
> Troubleshooting: 
> * We saw random success and failure when attempting to sync a single group. 
> * With useSSL=false and an updated port ldap sync was consistently successful.
> Cause:
> * By default, ldap connection only uses pooled connections when connecting to 
> a directory server over LDAP. Enabling SSL causes it to disable the pooling, 
> resulting in poorer performance and failures due to connection resets. 
> * Around mid-October we increased the number of groups defined on the system 
> (50+), this pushed us outside the "safe zone".
> Fix:
> Enable the SSL connections pooling by adding the below argument to startup 
> options.
> -Dcom.sun.jndi.ldap.connect.pool.protocol='plain ssl'
> Reference: 
> [https://confluence.atlassian.com/jirakb/connecting-jira-to-active-directory-over-ldaps-fails-with-connection-reset-763004137.htm]
> [https://docs.oracle.com/javase/jndi/tutorial/ldap/connect/config.html]
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AMBARI-23866) Kerberos Service Check failure due to kinit failure on random node

2018-05-18 Thread David F. Quiroga (JIRA)

[ 
https://issues.apache.org/jira/browse/AMBARI-23866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16481110#comment-16481110
 ] 

David F. Quiroga commented on AMBARI-23866:
---

We have about 10-20 KDC servers at 3-4 locations across the US. 

Analysis determined that it took about 1-2 minutes for a new principal to reach 
all KDC in our environment. Basically started the service check then ldap 
searched each host (in a code loop) for the new principal. 

I selected values based on that but would be opening to changing them, in 
either direction. 

If the replication is taking more than 150 seconds I think feedback to the 
users (AKA failure) is fair as that seems like an unhealthy system. 

 

> Kerberos Service Check failure due to kinit failure on random node
> --
>
> Key: AMBARI-23866
> URL: https://issues.apache.org/jira/browse/AMBARI-23866
> Project: Ambari
>  Issue Type: Improvement
>Affects Versions: 2.5.2
> Environment: Multiple Kerberos Domain Controllers across multiple 
> data centers for single realm.
>Reporter: David F. Quiroga
>Assignee: David F. Quiroga
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We were seeing Kerberos Service checks failures in Ambari. Specifically it 
> would fail during the first run of the day, succeed on the second, then fail 
> on the next but succeed if run again and so forth.
> Reviewing the operation log, it showed kinit failure from random node(s)
>  {{kinit: Client  not found in Kerberos database while getting initial 
> credentials}}
> Since AMBARI-9852
> {quote}The service check must perform the following steps:
>    1.Create a unique principal in the relevant KDC (server)
>    2.Test that the principal can be used to authenticate via kinit (agent)
>    3.Destroy the principal (server)
> {quote}
> Which is a very good check of services.
> So what is happening...
> In our environment we have multiple Kerberos Domain Controllers across 
> multiple data centers all providing the same realm.
> The creation of a unique principal occurs at a single KDC and is propagated 
> to the others.
> The agents were testing the principal at different KDC, i.e. before it had a 
> change to propagate. This is why the second service check would succeed.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AMBARI-23866) Kerberos Service Check failure due to kinit failure on random node

2018-05-18 Thread David F. Quiroga (JIRA)

[ 
https://issues.apache.org/jira/browse/AMBARI-23866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16481055#comment-16481055
 ] 

David F. Quiroga commented on AMBARI-23866:
---

[~rlevas], that's a good point on the process design. 

 

What concerns remain with including a retry? 

 

> Kerberos Service Check failure due to kinit failure on random node
> --
>
> Key: AMBARI-23866
> URL: https://issues.apache.org/jira/browse/AMBARI-23866
> Project: Ambari
>  Issue Type: Improvement
>Affects Versions: 2.5.2
> Environment: Multiple Kerberos Domain Controllers across multiple 
> data centers for single realm.
>Reporter: David F. Quiroga
>Assignee: David F. Quiroga
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We were seeing Kerberos Service checks failures in Ambari. Specifically it 
> would fail during the first run of the day, succeed on the second, then fail 
> on the next but succeed if run again and so forth.
> Reviewing the operation log, it showed kinit failure from random node(s)
>  {{kinit: Client  not found in Kerberos database while getting initial 
> credentials}}
> Since AMBARI-9852
> {quote}The service check must perform the following steps:
>    1.Create a unique principal in the relevant KDC (server)
>    2.Test that the principal can be used to authenticate via kinit (agent)
>    3.Destroy the principal (server)
> {quote}
> Which is a very good check of services.
> So what is happening...
> In our environment we have multiple Kerberos Domain Controllers across 
> multiple data centers all providing the same realm.
> The creation of a unique principal occurs at a single KDC and is propagated 
> to the others.
> The agents were testing the principal at different KDC, i.e. before it had a 
> change to propagate. This is why the second service check would succeed.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (AMBARI-23866) Kerberos Service Check failure due to kinit failure on random node

2018-05-18 Thread David F. Quiroga (JIRA)

[ 
https://issues.apache.org/jira/browse/AMBARI-23866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16479304#comment-16479304
 ] 

David F. Quiroga edited comment on AMBARI-23866 at 5/18/18 6:51 PM:


[~rlevas] thanks for feedback.

Invalid password should result in {{Preauthentication failed while getting 
initial credentials}}, in this case we are seeing {{Client not found in 
Kerberos database}} which would indicate the principal provided does not exist. 
So I am not sure if the failure would trigger the use of the  {{master-kdc}}.

Over the last year here at work they deployed new Active Directory Domain 
Controllers and retired the old ones. With that we learned that 
{{kerberos-env\ldap_url}} had been to a single AD server rather than the DNS 
name. From that point on we really try to avoid referencing a single AD server. 

RE:  latency of the replication process. I like the retry because if the 
latency is small the service check will not have to wait a maximum time i.e. 
most users are not affected by the addition of the retry. And true, we can't 
guarantee that we are waiting long enough for every environment but if it is 
taking more than 2+ minutes it should be fair to alert on that. 

 

Another thing we noticed is that if the test via kinit fails, the clean-up 
(Destroy the principal) does not happen. Meaning the principals are still out 
in AD and the keytabs are on the clients. Re-running the service check on the 
same day will succeed and clean those up. -but that is not an ideal process.- 

 

 

 


was (Author: quirogadf):
[~rlevas] thanks for feedback.

Invalid password should result in {{Preauthentication failed while getting 
initial credentials}}, in this case we are seeing {{Client not found in 
Kerberos database}} which would indicate the principal provided does not exist. 
So I am not sure if the failure would trigger the use of the  {{master-kdc}}.

Over the last year here at work they deployed new Active Directory Domain 
Controllers and retired the old ones. With that we learned that 
{{kerberos-env\ldap_url}} had been to a single AD server rather than the DNS 
name. From that point on we really try to avoid referencing a single AD server. 

RE:  latency of the replication process. I like the retry because if the 
latency is small the service check will not have to wait a maximum time i.e. 
most users are not affected by the addition of the retry. And true, we can't 
guarantee that we are waiting long enough for every environment but if it is 
taking more than 2+ minutes it should be fair to alert on that. 

 

Another thing we noticed is that if the test via kinit fails, the clean-up 
(Destroy the principal) does not happen. Meaning the principals are still out 
in AD and the keytabs are on the clients. Re-running the service check on the 
same day will succeed and clean those up, but that is not an ideal process. 

 

 

 

 

 

 

 

 

 

 

 

> Kerberos Service Check failure due to kinit failure on random node
> --
>
> Key: AMBARI-23866
> URL: https://issues.apache.org/jira/browse/AMBARI-23866
> Project: Ambari
>  Issue Type: Improvement
>Affects Versions: 2.5.2
> Environment: Multiple Kerberos Domain Controllers across multiple 
> data centers for single realm.
>Reporter: David F. Quiroga
>Assignee: David F. Quiroga
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We were seeing Kerberos Service checks failures in Ambari. Specifically it 
> would fail during the first run of the day, succeed on the second, then fail 
> on the next but succeed if run again and so forth.
> Reviewing the operation log, it showed kinit failure from random node(s)
>  {{kinit: Client  not found in Kerberos database while getting initial 
> credentials}}
> Since AMBARI-9852
> {quote}The service check must perform the following steps:
>    1.Create a unique principal in the relevant KDC (server)
>    2.Test that the principal can be used to authenticate via kinit (agent)
>    3.Destroy the principal (server)
> {quote}
> Which is a very good check of services.
> So what is happening...
> In our environment we have multiple Kerberos Domain Controllers across 
> multiple data centers all providing the same realm.
> The creation of a unique principal occurs at a single KDC and is propagated 
> to the others.
> The agents were testing the principal at different KDC, i.e. before it had a 
> change to propagate. This is why the second service check would succeed.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AMBARI-23866) Kerberos Service Check failure due to kinit failure on random node

2018-05-17 Thread David F. Quiroga (JIRA)

[ 
https://issues.apache.org/jira/browse/AMBARI-23866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16479304#comment-16479304
 ] 

David F. Quiroga commented on AMBARI-23866:
---

[~rlevas] thanks for feedback.

Invalid password should result in {{Preauthentication failed while getting 
initial credentials}}, in this case we are seeing {{Client not found in 
Kerberos database}} which would indicate the principal provided does not exist. 
So I am not sure if the failure would trigger the use of the  {{master-kdc}}.

Over the last year here at work they deployed new Active Directory Domain 
Controllers and retired the old ones. With that we learned that 
{{kerberos-env\ldap_url}} had been to a single AD server rather than the DNS 
name. From that point on we really try to avoid referencing a single AD server. 

RE:  latency of the replication process. I like the retry because if the 
latency is small the service check will not have to wait a maximum time i.e. 
most users are not affected by the addition of the retry. And true, we can't 
guarantee that we are waiting long enough for every environment but if it is 
taking more than 2+ minutes it should be fair to alert on that. 

 

Another thing we noticed is that if the test via kinit fails, the clean-up 
(Destroy the principal) does not happen. Meaning the principals are still out 
in AD and the keytabs are on the clients. Re-running the service check on the 
same day will succeed and clean those up, but that is not an ideal process. 

 

 

 

 

 

 

 

 

 

 

 

> Kerberos Service Check failure due to kinit failure on random node
> --
>
> Key: AMBARI-23866
> URL: https://issues.apache.org/jira/browse/AMBARI-23866
> Project: Ambari
>  Issue Type: Improvement
>Affects Versions: 2.5.2
> Environment: Multiple Kerberos Domain Controllers across multiple 
> data centers for single realm.
>Reporter: David F. Quiroga
>Assignee: David F. Quiroga
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We were seeing Kerberos Service checks failures in Ambari. Specifically it 
> would fail during the first run of the day, succeed on the second, then fail 
> on the next but succeed if run again and so forth.
> Reviewing the operation log, it showed kinit failure from random node(s)
>  {{kinit: Client  not found in Kerberos database while getting initial 
> credentials}}
> Since AMBARI-9852
> {quote}The service check must perform the following steps:
>    1.Create a unique principal in the relevant KDC (server)
>    2.Test that the principal can be used to authenticate via kinit (agent)
>    3.Destroy the principal (server)
> {quote}
> Which is a very good check of services.
> So what is happening...
> In our environment we have multiple Kerberos Domain Controllers across 
> multiple data centers all providing the same realm.
> The creation of a unique principal occurs at a single KDC and is propagated 
> to the others.
> The agents were testing the principal at different KDC, i.e. before it had a 
> change to propagate. This is why the second service check would succeed.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AMBARI-23866) Kerberos Service Check failure due to kinit failure on random node

2018-05-16 Thread David F. Quiroga (JIRA)

[ 
https://issues.apache.org/jira/browse/AMBARI-23866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16478180#comment-16478180
 ] 

David F. Quiroga commented on AMBARI-23866:
---

My first "fix" was to add a sleep before testing the principal, this worked but 
believe the better way is to add a retry to the Execute. 

Pull Request on the way.

> Kerberos Service Check failure due to kinit failure on random node
> --
>
> Key: AMBARI-23866
> URL: https://issues.apache.org/jira/browse/AMBARI-23866
> Project: Ambari
>  Issue Type: Improvement
>Affects Versions: 2.5.2
> Environment: Multiple Kerberos Domain Controllers across multiple 
> data centers for single realm.
>Reporter: David F. Quiroga
>Assignee: David F. Quiroga
>Priority: Minor
>
> We were seeing Kerberos Service checks failures in Ambari. Specifically it 
> would fail during the first run of the day, succeed on the second, then fail 
> on the next but succeed if run again and so forth.
> Reviewing the operation log, it showed kinit failure from random node(s)
>  {{kinit: Client  not found in Kerberos database while getting initial 
> credentials}}
> Since AMBARI-9852
> {quote}The service check must perform the following steps:
>    1.Create a unique principal in the relevant KDC (server)
>    2.Test that the principal can be used to authenticate via kinit (agent)
>    3.Destroy the principal (server)
> {quote}
> Which is a very good check of services.
> So what is happening...
> In our environment we have multiple Kerberos Domain Controllers across 
> multiple data centers all providing the same realm.
> The creation of a unique principal occurs at a single KDC and is propagated 
> to the others.
> The agents were testing the principal at different KDC, i.e. before it had a 
> change to propagate. This is why the second service check would succeed.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (AMBARI-23866) Kerberos Service Check failure due to kinit failure on random node

2018-05-16 Thread David F. Quiroga (JIRA)
David F. Quiroga created AMBARI-23866:
-

 Summary: Kerberos Service Check failure due to kinit failure on 
random node
 Key: AMBARI-23866
 URL: https://issues.apache.org/jira/browse/AMBARI-23866
 Project: Ambari
  Issue Type: Improvement
Affects Versions: 2.5.2
 Environment: Multiple Kerberos Domain Controllers across multiple data 
centers for single realm.
Reporter: David F. Quiroga
Assignee: David F. Quiroga


We were seeing Kerberos Service checks failures in Ambari. Specifically it 
would fail during the first run of the day, succeed on the second, then fail on 
the next but succeed if run again and so forth.

Reviewing the operation log, it showed kinit failure from random node(s)
 {{kinit: Client  not found in Kerberos database while getting initial 
credentials}}

Since AMBARI-9852
{quote}The service check must perform the following steps:
   1.Create a unique principal in the relevant KDC (server)
   2.Test that the principal can be used to authenticate via kinit (agent)
   3.Destroy the principal (server)
{quote}
Which is a very good check of services.

So what is happening...

In our environment we have multiple Kerberos Domain Controllers across multiple 
data centers all providing the same realm.

The creation of a unique principal occurs at a single KDC and is propagated to 
the others.

The agents were testing the principal at different KDC, i.e. before it had a 
change to propagate. This is why the second service check would succeed.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (AMBARI-22642) LDAPS sync Connection Refused

2018-05-15 Thread David F. Quiroga (JIRA)

 [ 
https://issues.apache.org/jira/browse/AMBARI-22642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David F. Quiroga updated AMBARI-22642:
--
Fix Version/s: 2.7.0
   Status: In Progress  (was: Patch Available)

Starting work on pull request into trunk.

> LDAPS sync Connection Refused 
> --
>
> Key: AMBARI-22642
> URL: https://issues.apache.org/jira/browse/AMBARI-22642
> Project: Ambari
>  Issue Type: Bug
>  Components: ambari-server
>Affects Versions: 2.5.0
> Environment: java version "1.8.0_121"
> Java(TM) SE Runtime Environment (build 1.8.0_121-tdc1-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
> AD Domain Controllers 
> LDAP v.3
> 2012 R2 OS 
>Reporter: David F. Quiroga
>Assignee: David F. Quiroga
>Priority: Minor
>  Labels: easyfix, patch
> Fix For: 2.7.0
>
> Attachments: ambari-22642.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Ambari server configured to use "secure" ldap authentication. 
> authentication.ldap.primaryUrl=:636
> authentication.ldap.useSSL=true
>  We call the ldap_sync_events REST endpoint frequently to synchronize 
> existing groups and a specific list groups.  We had no issues with this until 
> mid-October at which point we began to see:
> {code}
> "status" : "ERROR",
> "status_detail" : "Caught exception running LDAP sync. simple bind 
> failed: **:636; nested exception is 
> javax.naming.CommunicationException: simple bind failed: **:636 [Root 
> exception is java.net.SocketException: Connection reset]",
> {code}
> Troubleshooting: 
> * We saw random success and failure when attempting to sync a single group. 
> * With useSSL=false and an updated port ldap sync was consistently successful.
> Cause:
> * By default, ldap connection only uses pooled connections when connecting to 
> a directory server over LDAP. Enabling SSL causes it to disable the pooling, 
> resulting in poorer performance and failures due to connection resets. 
> * Around mid-October we increased the number of groups defined on the system 
> (50+), this pushed us outside the "safe zone".
> Fix:
> Enable the SSL connections pooling by adding the below argument to startup 
> options.
> -Dcom.sun.jndi.ldap.connect.pool.protocol='plain ssl'
> Reference: 
> [https://confluence.atlassian.com/jirakb/connecting-jira-to-active-directory-over-ldaps-fails-with-connection-reset-763004137.htm]
> [https://docs.oracle.com/javase/jndi/tutorial/ldap/connect/config.html]
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (AMBARI-22642) LDAPS sync Connection Refused

2018-05-15 Thread David F. Quiroga (JIRA)

 [ 
https://issues.apache.org/jira/browse/AMBARI-22642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David F. Quiroga reassigned AMBARI-22642:
-

Assignee: David F. Quiroga

> LDAPS sync Connection Refused 
> --
>
> Key: AMBARI-22642
> URL: https://issues.apache.org/jira/browse/AMBARI-22642
> Project: Ambari
>  Issue Type: Bug
>  Components: ambari-server
>Affects Versions: 2.5.0
> Environment: java version "1.8.0_121"
> Java(TM) SE Runtime Environment (build 1.8.0_121-tdc1-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
> AD Domain Controllers 
> LDAP v.3
> 2012 R2 OS 
>Reporter: David F. Quiroga
>Assignee: David F. Quiroga
>Priority: Minor
>  Labels: easyfix, patch
> Attachments: ambari-22642.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Ambari server configured to use "secure" ldap authentication. 
> authentication.ldap.primaryUrl=:636
> authentication.ldap.useSSL=true
>  We call the ldap_sync_events REST endpoint frequently to synchronize 
> existing groups and a specific list groups.  We had no issues with this until 
> mid-October at which point we began to see:
> {code}
> "status" : "ERROR",
> "status_detail" : "Caught exception running LDAP sync. simple bind 
> failed: **:636; nested exception is 
> javax.naming.CommunicationException: simple bind failed: **:636 [Root 
> exception is java.net.SocketException: Connection reset]",
> {code}
> Troubleshooting: 
> * We saw random success and failure when attempting to sync a single group. 
> * With useSSL=false and an updated port ldap sync was consistently successful.
> Cause:
> * By default, ldap connection only uses pooled connections when connecting to 
> a directory server over LDAP. Enabling SSL causes it to disable the pooling, 
> resulting in poorer performance and failures due to connection resets. 
> * Around mid-October we increased the number of groups defined on the system 
> (50+), this pushed us outside the "safe zone".
> Fix:
> Enable the SSL connections pooling by adding the below argument to startup 
> options.
> -Dcom.sun.jndi.ldap.connect.pool.protocol='plain ssl'
> Reference: 
> [https://confluence.atlassian.com/jirakb/connecting-jira-to-active-directory-over-ldaps-fails-with-connection-reset-763004137.htm]
> [https://docs.oracle.com/javase/jndi/tutorial/ldap/connect/config.html]
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AMBARI-22642) LDAPS sync Connection Refused

2018-05-01 Thread David F. Quiroga (JIRA)

[ 
https://issues.apache.org/jira/browse/AMBARI-22642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16459800#comment-16459800
 ] 

David F. Quiroga commented on AMBARI-22642:
---

Can I get access to assign to myself? Plan to create Pull Request. 

> LDAPS sync Connection Refused 
> --
>
> Key: AMBARI-22642
> URL: https://issues.apache.org/jira/browse/AMBARI-22642
> Project: Ambari
>  Issue Type: Bug
>  Components: ambari-server
>Affects Versions: 2.5.0
> Environment: java version "1.8.0_121"
> Java(TM) SE Runtime Environment (build 1.8.0_121-tdc1-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
> AD Domain Controllers 
> LDAP v.3
> 2012 R2 OS 
>Reporter: David F. Quiroga
>Priority: Minor
>  Labels: easyfix, patch
> Attachments: ambari-22642.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Ambari server configured to use "secure" ldap authentication. 
> authentication.ldap.primaryUrl=:636
> authentication.ldap.useSSL=true
>  We call the ldap_sync_events REST endpoint frequently to synchronize 
> existing groups and a specific list groups.  We had no issues with this until 
> mid-October at which point we began to see:
> {code}
> "status" : "ERROR",
> "status_detail" : "Caught exception running LDAP sync. simple bind 
> failed: **:636; nested exception is 
> javax.naming.CommunicationException: simple bind failed: **:636 [Root 
> exception is java.net.SocketException: Connection reset]",
> {code}
> Troubleshooting: 
> * We saw random success and failure when attempting to sync a single group. 
> * With useSSL=false and an updated port ldap sync was consistently successful.
> Cause:
> * By default, ldap connection only uses pooled connections when connecting to 
> a directory server over LDAP. Enabling SSL causes it to disable the pooling, 
> resulting in poorer performance and failures due to connection resets. 
> * Around mid-October we increased the number of groups defined on the system 
> (50+), this pushed us outside the "safe zone".
> Fix:
> Enable the SSL connections pooling by adding the below argument to startup 
> options.
> -Dcom.sun.jndi.ldap.connect.pool.protocol='plain ssl'
> Reference: 
> [https://confluence.atlassian.com/jirakb/connecting-jira-to-active-directory-over-ldaps-fails-with-connection-reset-763004137.htm]
> [https://docs.oracle.com/javase/jndi/tutorial/ldap/connect/config.html]
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (AMBARI-23382) Ambari-server sync-ldap: Sync event creation failed

2018-04-04 Thread David F. Quiroga (JIRA)

 [ 
https://issues.apache.org/jira/browse/AMBARI-23382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David F. Quiroga updated AMBARI-23382:
--
Affects Version/s: (was: 2.5.3)
   2.6.1
  Environment: Python 2.7.5-58 or greater

> Ambari-server sync-ldap: Sync event creation failed
> ---
>
> Key: AMBARI-23382
> URL: https://issues.apache.org/jira/browse/AMBARI-23382
> Project: Ambari
>  Issue Type: Bug
>  Components: ambari-server
>Affects Versions: 2.6.1, trunk, 2.6.2
> Environment: Python 2.7.5-58 or greater
>Reporter: David F. Quiroga
>Priority: Minor
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> As described here [ambari-server sync-ldap no longer 
> working|https://community.hortonworks.com/questions/119756/ambari-server-sync-ldap-no-longer-working.html]
>  sync-ldap fails with 
> {{REASON: Sync event creation failed. Error details: hostname '127.0.0.1' 
> doesn't match }}
> As pointed out by [Berry Osterlund 
> |https://community.hortonworks.com/users/13196/berryosterlund.html]this is 
> because the default behavior for ssl cert verification changed in python 
> 2.7.5-58. 
>  
> ambari_server/serverUtils.py hardcodes "SERVER_API_HOST = '127.0.0.1'"
> Thinking we can provide the full hostname dyanmically.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AMBARI-23382) Ambari-server sync-ldap: Sync event creation failed

2018-03-27 Thread David F. Quiroga (JIRA)

[ 
https://issues.apache.org/jira/browse/AMBARI-23382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16416264#comment-16416264
 ] 

David F. Quiroga commented on AMBARI-23382:
---

Started a pull request

[https://github.com/apache/ambari/pull/806]

> Ambari-server sync-ldap: Sync event creation failed
> ---
>
> Key: AMBARI-23382
> URL: https://issues.apache.org/jira/browse/AMBARI-23382
> Project: Ambari
>  Issue Type: Bug
>  Components: ambari-server
>Affects Versions: 2.5.3, trunk, 2.6.2
>Reporter: David F. Quiroga
>Priority: Minor
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> As described here [ambari-server sync-ldap no longer 
> working|https://community.hortonworks.com/questions/119756/ambari-server-sync-ldap-no-longer-working.html]
>  sync-ldap fails with 
> {{REASON: Sync event creation failed. Error details: hostname '127.0.0.1' 
> doesn't match }}
> As pointed out by [Berry Osterlund 
> |https://community.hortonworks.com/users/13196/berryosterlund.html]this is 
> because the default behavior for ssl cert verification changed in python 
> 2.7.5-58. 
>  
> ambari_server/serverUtils.py hardcodes "SERVER_API_HOST = '127.0.0.1'"
> Thinking we can provide the full hostname dyanmically.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AMBARI-23382) Ambari-server sync-ldap: Sync event creation failed

2018-03-27 Thread David F. Quiroga (JIRA)

[ 
https://issues.apache.org/jira/browse/AMBARI-23382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16416167#comment-16416167
 ] 

David F. Quiroga commented on AMBARI-23382:
---

Starting work on this. But don't think I have access to assign to myself or 
update status. 

> Ambari-server sync-ldap: Sync event creation failed
> ---
>
> Key: AMBARI-23382
> URL: https://issues.apache.org/jira/browse/AMBARI-23382
> Project: Ambari
>  Issue Type: Bug
>  Components: ambari-server
>Affects Versions: 2.5.3, trunk, 2.6.2
>Reporter: David F. Quiroga
>Priority: Minor
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> As described here [ambari-server sync-ldap no longer 
> working|https://community.hortonworks.com/questions/119756/ambari-server-sync-ldap-no-longer-working.html]
>  sync-ldap fails with 
> {{REASON: Sync event creation failed. Error details: hostname '127.0.0.1' 
> doesn't match }}
> As pointed out by [Berry Osterlund 
> |https://community.hortonworks.com/users/13196/berryosterlund.html]this is 
> because the default behavior for ssl cert verification changed in python 
> 2.7.5-58. 
>  
> ambari_server/serverUtils.py hardcodes "SERVER_API_HOST = '127.0.0.1'"
> Thinking we can provide the full hostname dyanmically.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (AMBARI-23382) Ambari-server sync-ldap: Sync event creation failed

2018-03-27 Thread David F. Quiroga (JIRA)
David F. Quiroga created AMBARI-23382:
-

 Summary: Ambari-server sync-ldap: Sync event creation failed
 Key: AMBARI-23382
 URL: https://issues.apache.org/jira/browse/AMBARI-23382
 Project: Ambari
  Issue Type: Bug
  Components: ambari-server
Affects Versions: 2.5.3, trunk, 2.6.2
Reporter: David F. Quiroga


As described here [ambari-server sync-ldap no longer 
working|https://community.hortonworks.com/questions/119756/ambari-server-sync-ldap-no-longer-working.html]
 sync-ldap fails with 

{{REASON: Sync event creation failed. Error details: hostname '127.0.0.1' 
doesn't match }}

As pointed out by [Berry Osterlund 
|https://community.hortonworks.com/users/13196/berryosterlund.html]this is 
because the default behavior for ssl cert verification changed in python 
2.7.5-58. 

 

ambari_server/serverUtils.py hardcodes "SERVER_API_HOST = '127.0.0.1'"

Thinking we can provide the full hostname dyanmically.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (AMBARI-23026) WEB type alerts authentication in Kerberos secured cluster

2018-02-19 Thread David F. Quiroga (JIRA)
David F. Quiroga created AMBARI-23026:
-

 Summary: WEB type alerts authentication in Kerberos secured cluster
 Key: AMBARI-23026
 URL: https://issues.apache.org/jira/browse/AMBARI-23026
 Project: Ambari
  Issue Type: Bug
  Components: alerts
Affects Versions: 2.5.2, trunk, 2.6.2
 Environment: Ambari 2.5.2

Hortonworks HDP-2.5.3.0-37
Reporter: David F. Quiroga


In a Kerberized cluster some web endpoints (App Timeline Web UI, ResourceManger 
Web UI, etc.) require authentication. Any Ambari alerts checking those 
endpoints must then be able to authenticate.

This was addressed in AMBARI-9586, however the default principal and keytab 
used in the alerts.json is that of the "bare" SPNEGO principal 
HTTP/_HOST@REALM. 
 My understanding is that the HTTP service principal is used to authenticate 
users to a service, not used to authenticate to another service.

1. Since most endpoints involved are Web UI, would it be more appropriate to 
use the smokeuser in the alerts?

2. This was first observed in Ranger Audit, the YARN Ranger Plug-in showed many 
access denied from HTTP user. [This 
post|https://community.hortonworks.com/content/supportkb/150206/ranger-audit-logs-refers-to-access-denied-for-http.html]
 provided some direction as to where those requests were coming from. We have 
updated the ResourceManger Web UI alert definition to use 
cluster-env/smokeuser_keytab and cluster-env/smokeuser_principal_name and this 
has resolved the initial HTTP access denied. 
 Would it also be advisable to make the change in the other secure Web UI alert 
definitions?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (AMBARI-22708) Ranger HDFS logging health Ambari Alert

2017-12-28 Thread David F. Quiroga (JIRA)
David F. Quiroga created AMBARI-22708:
-

 Summary: Ranger HDFS logging health Ambari Alert
 Key: AMBARI-22708
 URL: https://issues.apache.org/jira/browse/AMBARI-22708
 Project: Ambari
  Issue Type: New Feature
  Components: alerts
 Environment: HDP 2.5.3.0
Reporter: David F. Quiroga
Priority: Trivial
 Attachments: alert_ranger_hdfs_logging.json, 
alert_ranger_knox_logging.json, alert_ranger_logging.py

First some background:

We were directed to retain audit/access records "forever" (technically 7 years 
but that is basically forever in electronic log time). 

Each Hadoop component generates local audit logs as per their log4j settings. 
In our production system these logs would frequently fill up the disk. At first 
we would just compress them in place but that only works for so long and there 
was no redundancy with local disk storage. In others words, no long term plan. 

We started to discuss moving them to HDFS or a different storage solution. One 
of our team members pointed out the Ranger plugins are already logging the 
"same data" into HDFS. 
Probably after several meeting with the higher-ups, using Ranger logs as the 
record truth was approved. Components log4j settings were updated to purge data 
automatically. 

Purging local logs felt like operating with out a safety net. 
Thought it we be good to check that Ranger was successful logging to HDFS each 
day. Should mention this is a kerberized cluster, not that anything ever goes 
wrong with kerberos.  
*Checking this would have certainly been possible with a shell script, but we 
have been pushing to centralize warning/alerts in Ambari. And so an Ambari 
alert python script to check on Ranger Logging Health was crafted. *

For the most part the alert was modeled after some of the hive alerts. 
At the moment it just checks that the daily /ranger/audit/ HDFS 
directory has been created. 

I am attaching the host script and the alert.json for HDFS and Knox components. 
In the alert.json, service_name and component_name should be set to local 
values. 
Everything else should "work out of the box". 




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (AMBARI-22642) LDAPS sync Connection Refused

2017-12-14 Thread David F. Quiroga (JIRA)

[ 
https://issues.apache.org/jira/browse/AMBARI-22642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16290996#comment-16290996
 ] 

David F. Quiroga commented on AMBARI-22642:
---

New tests shouldn't be needed. Only updating JVM opts here. 

> LDAPS sync Connection Refused 
> --
>
> Key: AMBARI-22642
> URL: https://issues.apache.org/jira/browse/AMBARI-22642
> Project: Ambari
>  Issue Type: Bug
>  Components: ambari-server
>Affects Versions: 2.5.0
> Environment: java version "1.8.0_121"
> Java(TM) SE Runtime Environment (build 1.8.0_121-tdc1-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
> AD Domain Controllers 
> LDAP v.3
> 2012 R2 OS 
>Reporter: David F. Quiroga
>Priority: Minor
>  Labels: easyfix, patch
> Attachments: ambari-22642.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Ambari server configured to use "secure" ldap authentication. 
> authentication.ldap.primaryUrl=:636
> authentication.ldap.useSSL=true
>  We call the ldap_sync_events REST endpoint frequently to synchronize 
> existing groups and a specific list groups.  We had no issues with this until 
> mid-October at which point we began to see:
> {code}
> "status" : "ERROR",
> "status_detail" : "Caught exception running LDAP sync. simple bind 
> failed: **:636; nested exception is 
> javax.naming.CommunicationException: simple bind failed: **:636 [Root 
> exception is java.net.SocketException: Connection reset]",
> {code}
> Troubleshooting: 
> * We saw random success and failure when attempting to sync a single group. 
> * With useSSL=false and an updated port ldap sync was consistently successful.
> Cause:
> * By default, ldap connection only uses pooled connections when connecting to 
> a directory server over LDAP. Enabling SSL causes it to disable the pooling, 
> resulting in poorer performance and failures due to connection resets. 
> * Around mid-October we increased the number of groups defined on the system 
> (50+), this pushed us outside the "safe zone".
> Fix:
> Enable the SSL connections pooling by adding the below argument to startup 
> options.
> -Dcom.sun.jndi.ldap.connect.pool.protocol='plain ssl'
> Reference: 
> [https://confluence.atlassian.com/jirakb/connecting-jira-to-active-directory-over-ldaps-fails-with-connection-reset-763004137.htm]
> [https://docs.oracle.com/javase/jndi/tutorial/ldap/connect/config.html]
>   



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AMBARI-22642) LDAPS sync Connection Refused

2017-12-14 Thread David F. Quiroga (JIRA)

 [ 
https://issues.apache.org/jira/browse/AMBARI-22642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David F. Quiroga updated AMBARI-22642:
--
Attachment: ambari-22642.patch

Regenerated the patch using git diff

> LDAPS sync Connection Refused 
> --
>
> Key: AMBARI-22642
> URL: https://issues.apache.org/jira/browse/AMBARI-22642
> Project: Ambari
>  Issue Type: Bug
>  Components: ambari-server
>Affects Versions: 2.5.0
> Environment: java version "1.8.0_121"
> Java(TM) SE Runtime Environment (build 1.8.0_121-tdc1-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
> AD Domain Controllers 
> LDAP v.3
> 2012 R2 OS 
>Reporter: David F. Quiroga
>Priority: Minor
>  Labels: easyfix, patch
> Attachments: ambari-22642.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Ambari server configured to use "secure" ldap authentication. 
> authentication.ldap.primaryUrl=:636
> authentication.ldap.useSSL=true
>  We call the ldap_sync_events REST endpoint frequently to synchronize 
> existing groups and a specific list groups.  We had no issues with this until 
> mid-October at which point we began to see:
> {code}
> "status" : "ERROR",
> "status_detail" : "Caught exception running LDAP sync. simple bind 
> failed: **:636; nested exception is 
> javax.naming.CommunicationException: simple bind failed: **:636 [Root 
> exception is java.net.SocketException: Connection reset]",
> {code}
> Troubleshooting: 
> * We saw random success and failure when attempting to sync a single group. 
> * With useSSL=false and an updated port ldap sync was consistently successful.
> Cause:
> * By default, ldap connection only uses pooled connections when connecting to 
> a directory server over LDAP. Enabling SSL causes it to disable the pooling, 
> resulting in poorer performance and failures due to connection resets. 
> * Around mid-October we increased the number of groups defined on the system 
> (50+), this pushed us outside the "safe zone".
> Fix:
> Enable the SSL connections pooling by adding the below argument to startup 
> options.
> -Dcom.sun.jndi.ldap.connect.pool.protocol='plain ssl'
> Reference: 
> [https://confluence.atlassian.com/jirakb/connecting-jira-to-active-directory-over-ldaps-fails-with-connection-reset-763004137.htm]
> [https://docs.oracle.com/javase/jndi/tutorial/ldap/connect/config.html]
>   



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AMBARI-22642) LDAPS sync Connection Refused

2017-12-14 Thread David F. Quiroga (JIRA)

 [ 
https://issues.apache.org/jira/browse/AMBARI-22642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David F. Quiroga updated AMBARI-22642:
--
Attachment: (was: ambari-22642.patch)

> LDAPS sync Connection Refused 
> --
>
> Key: AMBARI-22642
> URL: https://issues.apache.org/jira/browse/AMBARI-22642
> Project: Ambari
>  Issue Type: Bug
>  Components: ambari-server
>Affects Versions: 2.5.0
> Environment: java version "1.8.0_121"
> Java(TM) SE Runtime Environment (build 1.8.0_121-tdc1-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
> AD Domain Controllers 
> LDAP v.3
> 2012 R2 OS 
>Reporter: David F. Quiroga
>Priority: Minor
>  Labels: easyfix, patch
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Ambari server configured to use "secure" ldap authentication. 
> authentication.ldap.primaryUrl=:636
> authentication.ldap.useSSL=true
>  We call the ldap_sync_events REST endpoint frequently to synchronize 
> existing groups and a specific list groups.  We had no issues with this until 
> mid-October at which point we began to see:
> {code}
> "status" : "ERROR",
> "status_detail" : "Caught exception running LDAP sync. simple bind 
> failed: **:636; nested exception is 
> javax.naming.CommunicationException: simple bind failed: **:636 [Root 
> exception is java.net.SocketException: Connection reset]",
> {code}
> Troubleshooting: 
> * We saw random success and failure when attempting to sync a single group. 
> * With useSSL=false and an updated port ldap sync was consistently successful.
> Cause:
> * By default, ldap connection only uses pooled connections when connecting to 
> a directory server over LDAP. Enabling SSL causes it to disable the pooling, 
> resulting in poorer performance and failures due to connection resets. 
> * Around mid-October we increased the number of groups defined on the system 
> (50+), this pushed us outside the "safe zone".
> Fix:
> Enable the SSL connections pooling by adding the below argument to startup 
> options.
> -Dcom.sun.jndi.ldap.connect.pool.protocol='plain ssl'
> Reference: 
> [https://confluence.atlassian.com/jirakb/connecting-jira-to-active-directory-over-ldaps-fails-with-connection-reset-763004137.htm]
> [https://docs.oracle.com/javase/jndi/tutorial/ldap/connect/config.html]
>   



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AMBARI-22642) LDAPS sync Connection Refused

2017-12-13 Thread David F. Quiroga (JIRA)

 [ 
https://issues.apache.org/jira/browse/AMBARI-22642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David F. Quiroga updated AMBARI-22642:
--
Attachment: (was: ambari-env.patch)

> LDAPS sync Connection Refused 
> --
>
> Key: AMBARI-22642
> URL: https://issues.apache.org/jira/browse/AMBARI-22642
> Project: Ambari
>  Issue Type: Bug
>  Components: ambari-server
>Affects Versions: 2.5.0
> Environment: java version "1.8.0_121"
> Java(TM) SE Runtime Environment (build 1.8.0_121-tdc1-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
> AD Domain Controllers 
> LDAP v.3
> 2012 R2 OS 
>Reporter: David F. Quiroga
>Priority: Minor
>  Labels: easyfix, patch
> Attachments: ambari-22642.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Ambari server configured to use "secure" ldap authentication. 
> authentication.ldap.primaryUrl=:636
> authentication.ldap.useSSL=true
>  We call the ldap_sync_events REST endpoint frequently to synchronize 
> existing groups and a specific list groups.  We had no issues with this until 
> mid-October at which point we began to see:
> {code}
> "status" : "ERROR",
> "status_detail" : "Caught exception running LDAP sync. simple bind 
> failed: **:636; nested exception is 
> javax.naming.CommunicationException: simple bind failed: **:636 [Root 
> exception is java.net.SocketException: Connection reset]",
> {code}
> Troubleshooting: 
> * We saw random success and failure when attempting to sync a single group. 
> * With useSSL=false and an updated port ldap sync was consistently successful.
> Cause:
> * By default, ldap connection only uses pooled connections when connecting to 
> a directory server over LDAP. Enabling SSL causes it to disable the pooling, 
> resulting in poorer performance and failures due to connection resets. 
> * Around mid-October we increased the number of groups defined on the system 
> (50+), this pushed us outside the "safe zone".
> Fix:
> Enable the SSL connections pooling by adding the below argument to startup 
> options.
> -Dcom.sun.jndi.ldap.connect.pool.protocol='plain ssl'
> Reference: 
> [https://confluence.atlassian.com/jirakb/connecting-jira-to-active-directory-over-ldaps-fails-with-connection-reset-763004137.htm]
> [https://docs.oracle.com/javase/jndi/tutorial/ldap/connect/config.html]
>   



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AMBARI-22642) LDAPS sync Connection Refused

2017-12-13 Thread David F. Quiroga (JIRA)

 [ 
https://issues.apache.org/jira/browse/AMBARI-22642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David F. Quiroga updated AMBARI-22642:
--
Attachment: ambari-22642.patch

Change filename

> LDAPS sync Connection Refused 
> --
>
> Key: AMBARI-22642
> URL: https://issues.apache.org/jira/browse/AMBARI-22642
> Project: Ambari
>  Issue Type: Bug
>  Components: ambari-server
>Affects Versions: 2.5.0
> Environment: java version "1.8.0_121"
> Java(TM) SE Runtime Environment (build 1.8.0_121-tdc1-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
> AD Domain Controllers 
> LDAP v.3
> 2012 R2 OS 
>Reporter: David F. Quiroga
>Priority: Minor
>  Labels: easyfix, patch
> Attachments: ambari-22642.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Ambari server configured to use "secure" ldap authentication. 
> authentication.ldap.primaryUrl=:636
> authentication.ldap.useSSL=true
>  We call the ldap_sync_events REST endpoint frequently to synchronize 
> existing groups and a specific list groups.  We had no issues with this until 
> mid-October at which point we began to see:
> {code}
> "status" : "ERROR",
> "status_detail" : "Caught exception running LDAP sync. simple bind 
> failed: **:636; nested exception is 
> javax.naming.CommunicationException: simple bind failed: **:636 [Root 
> exception is java.net.SocketException: Connection reset]",
> {code}
> Troubleshooting: 
> * We saw random success and failure when attempting to sync a single group. 
> * With useSSL=false and an updated port ldap sync was consistently successful.
> Cause:
> * By default, ldap connection only uses pooled connections when connecting to 
> a directory server over LDAP. Enabling SSL causes it to disable the pooling, 
> resulting in poorer performance and failures due to connection resets. 
> * Around mid-October we increased the number of groups defined on the system 
> (50+), this pushed us outside the "safe zone".
> Fix:
> Enable the SSL connections pooling by adding the below argument to startup 
> options.
> -Dcom.sun.jndi.ldap.connect.pool.protocol='plain ssl'
> Reference: 
> [https://confluence.atlassian.com/jirakb/connecting-jira-to-active-directory-over-ldaps-fails-with-connection-reset-763004137.htm]
> [https://docs.oracle.com/javase/jndi/tutorial/ldap/connect/config.html]
>   



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AMBARI-22642) LDAPS sync Connection Refused

2017-12-13 Thread David F. Quiroga (JIRA)

 [ 
https://issues.apache.org/jira/browse/AMBARI-22642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David F. Quiroga updated AMBARI-22642:
--
Attachment: ambari-env.patch

> LDAPS sync Connection Refused 
> --
>
> Key: AMBARI-22642
> URL: https://issues.apache.org/jira/browse/AMBARI-22642
> Project: Ambari
>  Issue Type: Bug
>  Components: ambari-server
>Affects Versions: 2.5.0
> Environment: java version "1.8.0_121"
> Java(TM) SE Runtime Environment (build 1.8.0_121-tdc1-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
> AD Domain Controllers 
> LDAP v.3
> 2012 R2 OS 
>Reporter: David F. Quiroga
>Priority: Minor
>  Labels: easyfix, patch
> Attachments: ambari-env.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Ambari server configured to use "secure" ldap authentication. 
> authentication.ldap.primaryUrl=:636
> authentication.ldap.useSSL=true
>  We call the ldap_sync_events REST endpoint frequently to synchronize 
> existing groups and a specific list groups.  We had no issues with this until 
> mid-October at which point we began to see:
> {code}
> "status" : "ERROR",
> "status_detail" : "Caught exception running LDAP sync. simple bind 
> failed: **:636; nested exception is 
> javax.naming.CommunicationException: simple bind failed: **:636 [Root 
> exception is java.net.SocketException: Connection reset]",
> {code}
> Troubleshooting: 
> * We saw random success and failure when attempting to sync a single group. 
> * With useSSL=false and an updated port ldap sync was consistently successful.
> Cause:
> * By default, ldap connection only uses pooled connections when connecting to 
> a directory server over LDAP. Enabling SSL causes it to disable the pooling, 
> resulting in poorer performance and failures due to connection resets. 
> * Around mid-October we increased the number of groups defined on the system 
> (50+), this pushed us outside the "safe zone".
> Fix:
> Enable the SSL connections pooling by adding the below argument to startup 
> options.
> -Dcom.sun.jndi.ldap.connect.pool.protocol='plain ssl'
> Reference: 
> [https://confluence.atlassian.com/jirakb/connecting-jira-to-active-directory-over-ldaps-fails-with-connection-reset-763004137.htm]
> [https://docs.oracle.com/javase/jndi/tutorial/ldap/connect/config.html]
>   



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (AMBARI-22642) LDAPS sync Connection Refused

2017-12-13 Thread David F. Quiroga (JIRA)
David F. Quiroga created AMBARI-22642:
-

 Summary: LDAPS sync Connection Refused 
 Key: AMBARI-22642
 URL: https://issues.apache.org/jira/browse/AMBARI-22642
 Project: Ambari
  Issue Type: Bug
  Components: ambari-server
Affects Versions: 2.5.0
 Environment: java version "1.8.0_121"
Java(TM) SE Runtime Environment (build 1.8.0_121-tdc1-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)

AD Domain Controllers 
LDAP v.3
2012 R2 OS 
Reporter: David F. Quiroga
Priority: Minor


Ambari server configured to use "secure" ldap authentication. 
authentication.ldap.primaryUrl=:636
authentication.ldap.useSSL=true

 We call the ldap_sync_events REST endpoint frequently to synchronize existing 
groups and a specific list groups.  We had no issues with this until 
mid-October at which point we began to see:
{code}
"status" : "ERROR",
"status_detail" : "Caught exception running LDAP sync. simple bind failed: 
**:636; nested exception is javax.naming.CommunicationException: simple 
bind failed: **:636 [Root exception is java.net.SocketException: 
Connection reset]",
{code}

Troubleshooting: 
* We saw random success and failure when attempting to sync a single group. 
* With useSSL=false and an updated port ldap sync was consistently successful.

Cause:
* By default, ldap connection only uses pooled connections when connecting to a 
directory server over LDAP. Enabling SSL causes it to disable the pooling, 
resulting in poorer performance and failures due to connection resets. 
* Around mid-October we increased the number of groups defined on the system 
(50+), this pushed us outside the "safe zone".

Fix:
Enable the SSL connections pooling by adding the below argument to startup 
options.
-Dcom.sun.jndi.ldap.connect.pool.protocol='plain ssl'

Reference: 
[https://confluence.atlassian.com/jirakb/connecting-jira-to-active-directory-over-ldaps-fails-with-connection-reset-763004137.htm]
[https://docs.oracle.com/javase/jndi/tutorial/ldap/connect/config.html]

  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AMBARI-22642) LDAPS sync Connection Refused

2017-12-13 Thread David F. Quiroga (JIRA)

 [ 
https://issues.apache.org/jira/browse/AMBARI-22642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David F. Quiroga updated AMBARI-22642:
--
Status: Patch Available  (was: Open)

Add javaopts to /var/lib/ambari-server/ambari-env.sh

> LDAPS sync Connection Refused 
> --
>
> Key: AMBARI-22642
> URL: https://issues.apache.org/jira/browse/AMBARI-22642
> Project: Ambari
>  Issue Type: Bug
>  Components: ambari-server
>Affects Versions: 2.5.0
> Environment: java version "1.8.0_121"
> Java(TM) SE Runtime Environment (build 1.8.0_121-tdc1-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
> AD Domain Controllers 
> LDAP v.3
> 2012 R2 OS 
>Reporter: David F. Quiroga
>Priority: Minor
>  Labels: easyfix, patch
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Ambari server configured to use "secure" ldap authentication. 
> authentication.ldap.primaryUrl=:636
> authentication.ldap.useSSL=true
>  We call the ldap_sync_events REST endpoint frequently to synchronize 
> existing groups and a specific list groups.  We had no issues with this until 
> mid-October at which point we began to see:
> {code}
> "status" : "ERROR",
> "status_detail" : "Caught exception running LDAP sync. simple bind 
> failed: **:636; nested exception is 
> javax.naming.CommunicationException: simple bind failed: **:636 [Root 
> exception is java.net.SocketException: Connection reset]",
> {code}
> Troubleshooting: 
> * We saw random success and failure when attempting to sync a single group. 
> * With useSSL=false and an updated port ldap sync was consistently successful.
> Cause:
> * By default, ldap connection only uses pooled connections when connecting to 
> a directory server over LDAP. Enabling SSL causes it to disable the pooling, 
> resulting in poorer performance and failures due to connection resets. 
> * Around mid-October we increased the number of groups defined on the system 
> (50+), this pushed us outside the "safe zone".
> Fix:
> Enable the SSL connections pooling by adding the below argument to startup 
> options.
> -Dcom.sun.jndi.ldap.connect.pool.protocol='plain ssl'
> Reference: 
> [https://confluence.atlassian.com/jirakb/connecting-jira-to-active-directory-over-ldaps-fails-with-connection-reset-763004137.htm]
> [https://docs.oracle.com/javase/jndi/tutorial/ldap/connect/config.html]
>   



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)