Matt, I think further investigation is warranted. I really think you need to find a way to trace through the code and find where the slowdown is occurring. That will help us narrow down what the problem is. I know it is production, but getting some code on there that starts timing method calls and such can be very useful.
On Tue, Jul 15, 2014 at 3:04 PM, Merrill, Matt <mmerr...@mitre.org> wrote: > Hi Ryan, > > Thanks for responding! > > I’ve attached our ehcacheConfig, however, comparing it to the default > configuration the only difference is the overall amount of elements (10000 > in ours vs 1000 in default) and also the temp disk store location. > > I’m assuming you are asking if each user in our system has the exact same > set of gadgets to render, correct? If that’s the case: different users > have different sets of gadgets, however, many of them have a default set > we give them when they are initially setup in our system. So, many people > will hit the same gadgets over and over again. This default subset of > gadgets is about 10-12 different gadgets and that is by and large what > many users have. > > However, we have a total of 48 different gadgets that could be rendered by > a user at any given time on this instance of shindig. We do run another > instance of shindig which could render a different subset of gadgets, but > that has a much lower usage and only renders about 10 different gadgets > altogether. > > > I am admittedly rusty with my ehCache configuration knowledge, but here’s > a couple things I noticed: > * I notice that the maxBytesLocalHeap in the ehCacheConfig is 50mb, which > seems low, however, this is the same setting we had in shindig 2.0, so I > have to wonder if that has anything to do with it. > * Our old ehCache configuration for shindig 2.0 specified a defaultCache > maxElementsInMemory of 1000 but NO sizeOfPolicy at all. > * Our new ehCache configuration for shindig 2.5 specifies a sizeOfPolicy > maxDepth of 10000 but NO defaultCache maxElementsInMemory. > > Our heap sizes in tomcat are 2048mb which based on a 50m max heap for a > cache seems adequate. This is the same heap size from when we were using > shindig 2.0. Unfortunately, we don’t have profiling tools enabled on our > Tomcat instances so i can’t see what the heap looked like when things > crashed, and like I said, we’re unable to reproduce this in int. > > I think we might be on to something here… I will keep searching but if any > devs out there have any ideas, please let me know. > > Thanks shindig list! > -Matt > > On 7/13/14, 10:12 AM, "Ryan Baxter" <rbaxte...@gmail.com> wrote: > >>Matt can you tell us more about how you have configured the caches in >>shindig? When you are rendering these gadgets are you rendering the same >>gadget across all users? >> >>-Ryan >> >>> On Jul 9, 2014, at 3:31 PM, "Merrill, Matt" <mmerr...@mitre.org> wrote: >>> >>> Stanton, >>> >>> Thanks for responding! >>> >>> This is one instance of shindig. >>> >>> If you mean the configuration within the container and for the shindig >>> java app, then yes, the locked domains are the same. In fact, the >>> configuration with the exception of shindig¹s host URL¹s is exactly the >>> same from what I can tell. >>> >>> Unfortunately, I don¹t have any way to trace that exact message, but I >>>did >>> do a traceroute from the server running shindig to the URL that is being >>> called for rpc calls to make sure there weren¹t any extra network hops, >>> and there weren¹t, it actually only had one, as expected for an app >>>making >>> an HTTP call to itself. >>> >>> Thanks again for responding. >>> >>> -Matt >>> >>>> On 7/9/14, 3:08 PM, "Stanton Sievers" <ssiev...@apache.org> wrote: >>>> >>>> Hi Matt, >>>> >>>> Is the configuration for locked domains and security tokens consistent >>>> between your test and production environments? >>>> >>>> Do you have any way of tracing the request in the log entry you >>>>provided >>>> through the network? Is this a single Shindig server or is there any >>>>load >>>> balancing occurring? >>>> >>>> Regards, >>>> -Stanton >>>> >>>> >>>>> On Wed, Jul 9, 2014 at 2:40 PM, Merrill, Matt <mmerr...@mitre.org> >>>>>wrote: >>>>> >>>>> Hi shindig devs, >>>>> >>>>> We are in the process of upgrading from shindig 2.0 to 2.5-update1 and >>>>> everything has gone ok, however, once we got into our production >>>>> environment, we are seeing significant slowdowns for the opensocial >>>>>RPC >>>>> calls that shindig makes to itself when rendering a gadget. >>>>> >>>>> This is obviously very dependent on how we¹ve implemented the shindig >>>>> interfaces in our own code, and also our infrastructure, however, so >>>>> we¹re >>>>> hoping someone on the list can help give us some more ideas for areas >>>>>to >>>>> investigate inside shindig itself or in general. >>>>> >>>>> Here¹s what¹s happening: >>>>> * Gadgets load fine when the app is not experiencing much load (< 10 >>>>> users >>>>> rendering 10-12 gadgets on a page) >>>>> * Once a reasonable subset of users begins rendering gadgets, gadget >>>>> render calls through the ³ifr² endpoint start taking a very long time >>>>>to >>>>> respond >>>>> * The problem gets worse from there >>>>> * Even with extensive load testing we can¹t recreate this problem in >>>>>our >>>>> testing environments >>>>> * Our system adminstrators have assured us that the configurations of >>>>> our >>>>> servers are the same between int and prod >>>>> >>>>> This is an example of what we¹re seeing from the logs inside >>>>> BasicHttpFetcher: >>>>> >>>>> >>>>> >>>>>http://238redacteddnsprefix234.gadgetsv2.company.com:7001/gmodules/rpc? >>>>>st >>>>> >>>>>=mycontainer%3AvY2rb-teGXuk9HX8d6W0rm6wE6hkLxM95ByaSMQlV8RudwohiAFqAliy >>>>>wV >>>>> >>>>>wc5yQ8maFSwK7IEhogNVnoUXa-doA3_h7EbSDGq_DW5i_VvC0CFEeaTKtr70A9XgYlAq5T9 >>>>>5j >>>>> >>>>>7mivGO3lXVBTayU2PFNSdnLu8xtQEJJ7YrlmekEYyERmTSQmi7n2wZlmnG2puxVkegQKWNp >>>>>dz >>>>> OH4xCfgROnNCnAI >>>>> is responding slowly. 12,449 ms elapsed. >>>>> >>>>> We¹ll continue to get these warnings for rpc calls for many different >>>>> gadgets, the amount of time elapsed will grow, and ultimately every >>>>> gadget >>>>> render slows to a crawl. >>>>> >>>>> Some other relevant information: >>>>> * We have implemented ³throttling² logic in our own custom HttpFetcher >>>>> which extends the BasicHttpFetcher. Basically, what this does, is >>>>>keep >>>>> track of how many outgoing requests are happening for a given url, and >>>>> if >>>>> there are too many concurrent ones going at once, it will start >>>>> rejecting >>>>> outgoing requests. This was done to avoid a situation where an >>>>>external >>>>> service is responding slowly and ties up all of shindig¹s external >>>>>http >>>>> connections. In our case, I believe that because our rpc endpoint is >>>>> taking so long to respond, we start rejecting these requests with our >>>>> throttling logic. >>>>> >>>>> I have tried to trace through the rpc calls inside the shindig code >>>>> starting in the RpcServlet, and as best I can tell, these rpc calls >>>>>are >>>>> used for: >>>>> * getting viewer data >>>>> * getting application data >>>>> * anything else? >>>>> >>>>> I¹ve also looked at the BasicHTTPFetcher, but nothing stands out at me >>>>> at >>>>> first glance that would cause such a difference in performance between >>>>> environments if, as our sys admins say, they are the same. >>>>> >>>>> Additionally, I¹ve ensured that the database table which contains our >>>>> Application Data has been indexed properly (by person ID and gadget >>>>>url) >>>>> and that person data is cached. >>>>> >>>>> Any other ideas, or areas in the codebase to explore are very much >>>>> appreciated. >>>>> >>>>> Thanks! >>>>> -Matt >>> >