There's already a ticket: https://issues.apache.org/jira/browse/USERGRID-1051
Just as FYI when we load tested Usergrid to upwards of 10k TPS, we were using a search queue of 5000. In fact, we made all of the queues have size of 5000, including the bulk queue. On Thu, Dec 10, 2015 at 7:17 AM, Jaskaran Singh < jaskaran.si...@comprotechnologies.com> wrote: > Hi Michael, > > I am providing an update on my situation. We have changed our application > logic to minimize the use of queries (ie calls with "ql=.....") in usergrid > 2.x. This seems to have provided significant benefit and all the problems > reported below, seem to have disappeared. > > To some extent this is good news. However we were lucky that we were able > to work around the logic and would like to understand any limitations or > best practices around the use of queries (which are serviced by > elasticsearch in usergrid 2.x) under high load situations. > > Also please let me know me know if there is an existing Jira issue for > addressing the empty entity response when elasticsearch when it is > overloaded. Or should i add one? > > Thanks in advance, > > Thanks > Jaskaran > > > On Tue, Dec 8, 2015 at 6:00 PM, Jaskaran Singh < > jaskaran.si...@comprotechnologies.com> wrote: > >> Hi Michael, >> >> This makes sense. I can confirm that while we have been seeing missing >> entity errors on high load; these automatically get resolved themselves as >> the load decreases. >> >> Another anomaly that we have noticed is that usergrid responds with a >> code "401" and message "Unable to authenticate OAuth credentials" for >> certain user's credentials under high load and the same credentials work >> fine after the load reduces. Can we assume that this issue (intermittent >> invalid credentials) has the same underlying root cause (ie elasticsearch >> is not responding)? See below a couple of examples of the error_description >> for such 401 errors: >> 1. 'invalid username or password' >> 2. ‘Unable to authenticate OAuth credentials’ >> 3. ‘Unable to authenticate due to corrupt access token’ >> >> Regarding your suggestion to increase the search thread pool queue size, >> we were already using a setting of 1000 (with 320 threads). Should we >> consider further increasing this? Or simply provide additional resources >> (cpu / ram) to the ES process. >> >> Additionally we are also seeing cassandra connection timeouts, >> specifically the exceptions below under high load conditions: >> ERROR stage.write.WriteCommit.call(132)<Usergrid-Collection-Pool-12>- >> Failed to execute write asynchronously >> com.netflix.astyanax.connectionpool.exceptions.TimeoutException: >> TimeoutException: [host=10.0.0.237(10.0.0.237):9160, latency=2003(2003), >> attempts=1]org.apache.thrift.transport.TTransportException: >> java.net.SocketTimeoutException: Read timed out >> >> These exceptions occur even though opscenter was reporting medium load on >> our cluster. Is there a way to optimize the astyanax library. Please let us >> know if you have any recommendations in this area. >> >> Thanks a lot for the help. >> >> Thanks >> Jaskaran >> >> On Mon, Dec 7, 2015 at 2:29 AM, Michael Russo <michaelaru...@gmail.com> >> wrote: >> >>> Here are a couple things to check: >>> >>> 1) Can you query all of these entities out when the system is not under >>> load? >>> 2) Elasticsearch has a search queue for index query requests. ( >>> https://www.elastic.co/guide/en/elasticsearch/reference/1.6/modules-threadpool.html) >>> When this is full the searches are rejected. Currently Usergrid surfaces >>> this as no results returned rather than unable to query or some other >>> identifying error message (we're aware and plan to fix this in the future). >>> Try increasing the queue size to 1000. You might have delayed results, but >>> can prevent them from being empty results for data that's known to be in >>> the index. >>> >>> Thanks. >>> -Michael R. >>> >>> On Dec 5, 2015, at 07:07, Jaskaran Singh < >>> jaskaran.si...@comprotechnologies.com> wrote: >>> >>> Hello All, >>> >>> We are testing usergrid 2.x (master branch) for our application that was >>> previously being prototyped on usergrid 1.x. We are noticing some weird >>> anomalies which are causing errors in our application which otherwise works >>> fine against usergrid 1.x. Specifically, we are seeing empty responses when >>> querying custom collections for a particular entity record. >>> Following is an example of one such query: >>> http://server-name/b2perf1/default/userdata?client_id= >>> <...>&client_secret=<....>&ql=userproductid='4d543507-9839-11e5-ba08-0a75091e6d25~~5c856de9-9828-11e5-ba08-0a75091e6d25' >>> >>> In the above scenario, we are querying a custom collection “userdata”. >>> And under high load conditions (performance tests), this query starts >>> returning an empty entities array (see below), even though this entity did >>> exist at one time and we have no code / logic to delete entities. >>> { >>> "action": "get", >>> "application": "0f7a2396-9826-11e5-ba08-0a75091e6d25", >>> "params": { >>> "ql": [ >>> >>> "userproductid='4d543507-9839-11e5-ba08-0a75091e6d25~~5c856de9-9828-11e5-ba08-0a75091e6d25'" >>> ] >>> }, >>> "path": "/userdata", >>> "uri": "http://localhost:8080/b2perf1/default/userdata", >>> "entities": [], >>> "timestamp": 1449322746733, >>> "duration": 1053, >>> "organization": "b2perf1", >>> "applicationName": "default", >>> "count": 0 >>> } >>> >>> This has been happening quite randomly / intermittently and we have not >>> been able to isolate any replication steps besides, running load / >>> performance tests when this problem does eventually show up. >>> Note, the creation of the entities is prior to the load test and we can >>> confirm that they existed before running the load test. >>> >>> We have never noticed this issue for ‘non’ query calls (ie calls that do >>> not directly provide a field to query on) >>> >>> Our suspicion is that while these records do exist in Cassandra (because >>> we have never deleted them), but the ElasticSearch index is ‘not’ in sync >>> or is not functioning properly. >>> How do we go about debugging this problem? Is there any particular >>> logging or metric that we can check for us to confirm if all the >>> elasticsearch index is upto date with the changes in cassandra. >>> >>> Any other suggestions will be greatly appreciated. >>> >>> Thanks >>> Jaskaran >>> >>> >> >