[ 
https://issues.apache.org/jira/browse/USERGRID-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Johnson updated USERGRID-1283:
------------------------------------
    Description: 
Sometimes during Usergrid operation there is a temporary problem connecting to 
Cassandra. This can cause some requests to fail (with HTTP 500) and cause bad 
values (e.g. EntityManagers with application = null) to be cached. If 
connectivity problems happen during startup, Usergrid may start but be unable 
to respond to requests without errors. 

Usergrid should be more resilient to such temporary connectivity problems:

  - Change startup to retry Cassandra until it becomes available
  - Cache a copy of the ManagementApp because its needed for almost request
  - Change cache to prevent caching of bad EntityManagers

Test these scenarios:

  - case 1: startup with no Cassandra running
  - case 2: startup with Cassandra starting 30s after Usergrid starts
  - case 3: startup where Cassandra goes down after start of Lock Manager, but 
before EMF init
       - case 3.1: Cassandra comes back before max retries
       - case 3.2: Cassandra never comes back

PR is ready for review here: https://github.com/apache/usergrid/pull/528

  was:
Sometimes on Usergrid startup there is a failure contacting Cassandra, either 
an immediate communications failure or a time-out.

In some cases when this happens, the ServiceManager.init() method cannot 
retrieve the internal Management Application that holds information about 
Usergrid orgs, app and admin users.  

We added some retry logic to the ServiceManager.init() method, which is not an 
ideal fix because that method is also invoked in processing of HTTP requests.  
Problem is, if the the retries do not work we end up with an instance of 
Usergrid that is alive and able to respond to /status requests, but everything 
else fails.

We should fix this by:

1) Moving the Management App lookup (and retry logic) to a much earlier point 
in the startup process. 

2) Caching the Management App in some place where all threads can get it. This 
cache should never be allowed to be null.  We always need to be able to fall 
back to a recent version of the Management App




> Improve ServiceManager.init() start-up logic
> --------------------------------------------
>
>                 Key: USERGRID-1283
>                 URL: https://issues.apache.org/jira/browse/USERGRID-1283
>             Project: Usergrid
>          Issue Type: Improvement
>    Affects Versions: 2.1.0
>            Reporter: David Johnson
>             Fix For: 2.1.1
>
>
> Sometimes during Usergrid operation there is a temporary problem connecting 
> to Cassandra. This can cause some requests to fail (with HTTP 500) and cause 
> bad values (e.g. EntityManagers with application = null) to be cached. If 
> connectivity problems happen during startup, Usergrid may start but be unable 
> to respond to requests without errors. 
> Usergrid should be more resilient to such temporary connectivity problems:
>   - Change startup to retry Cassandra until it becomes available
>   - Cache a copy of the ManagementApp because its needed for almost request
>   - Change cache to prevent caching of bad EntityManagers
> Test these scenarios:
>   - case 1: startup with no Cassandra running
>   - case 2: startup with Cassandra starting 30s after Usergrid starts
>   - case 3: startup where Cassandra goes down after start of Lock Manager, 
> but before EMF init
>        - case 3.1: Cassandra comes back before max retries
>        - case 3.2: Cassandra never comes back
> PR is ready for review here: https://github.com/apache/usergrid/pull/528



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to