Re: Performance problems with opensocial rpc calls

2014-07-18 Thread Ryan Baxter
Matt, I think further investigation is warranted.  I really think you
need to find a way to trace through the code and find where the
slowdown is occurring.  That will help us narrow down what the problem
is.  I know it is production, but getting some code on there that
starts timing method calls and such can be very useful.

On Tue, Jul 15, 2014 at 3:04 PM, Merrill, Matt mmerr...@mitre.org wrote:
 Hi Ryan,

 Thanks for responding!

 I’ve attached our ehcacheConfig, however, comparing it to the default
 configuration the only difference is the overall amount of elements (1
 in ours vs 1000 in default) and also the temp disk store location.

 I’m assuming you are asking if each user in our system has the exact same
 set of gadgets to render, correct?  If that’s the case: different users
 have different sets of gadgets, however, many of them have a default set
 we give them when they are initially setup in our system.  So, many people
 will hit the same gadgets over and over again.  This default subset of
 gadgets is about 10-12 different gadgets and that is by and large what
 many users have.

 However, we have a total of 48 different gadgets that could be rendered by
 a user at any given time on this instance of shindig.  We do run another
 instance of shindig which could render a different subset of gadgets, but
 that has a much lower usage and only renders about 10 different gadgets
 altogether.


 I am admittedly rusty with my ehCache configuration knowledge, but here’s
 a couple things I noticed:
 * I notice that the maxBytesLocalHeap in the ehCacheConfig is 50mb, which
 seems low, however, this is the same setting we had in shindig 2.0, so I
 have to wonder if that has anything to do with it.
 * Our old ehCache configuration for shindig 2.0 specified a defaultCache
 maxElementsInMemory of 1000 but NO sizeOfPolicy at all.
 * Our new ehCache configuration for shindig 2.5 specifies a sizeOfPolicy
 maxDepth of 1 but NO defaultCache maxElementsInMemory.

 Our heap sizes in tomcat are 2048mb which based on a 50m max heap for a
 cache seems adequate. This is the same heap size from when we were using
 shindig 2.0.  Unfortunately, we don’t have profiling tools enabled on our
 Tomcat instances so i can’t see what the heap looked like when things
 crashed, and like I said, we’re unable to reproduce this in int.

 I think we might be on to something here… I will keep searching but if any
 devs out there have any ideas, please let me know.

 Thanks shindig list!
 -Matt

 On 7/13/14, 10:12 AM, Ryan Baxter rbaxte...@gmail.com wrote:

Matt can you tell us more about how you have configured the caches in
shindig?  When you are rendering these gadgets are you rendering the same
gadget across all users?

-Ryan

 On Jul 9, 2014, at 3:31 PM, Merrill, Matt mmerr...@mitre.org wrote:

 Stanton,

 Thanks for responding!

 This is one instance of shindig.

 If you mean the configuration within the container and for the shindig
 java app, then yes, the locked domains are the same.  In fact, the
 configuration with the exception of shindig¹s host URL¹s is exactly the
 same from what I can tell.

 Unfortunately, I don¹t have any way to trace that exact message, but I
did
 do a traceroute from the server running shindig to the URL that is being
 called for rpc calls to make sure there weren¹t any extra network hops,
 and there weren¹t, it actually only had one, as expected for an app
making
 an HTTP call to itself.

 Thanks again for responding.

 -Matt

 On 7/9/14, 3:08 PM, Stanton Sievers ssiev...@apache.org wrote:

 Hi Matt,

 Is the configuration for locked domains and security tokens consistent
 between your test and production environments?

 Do you have any way of tracing the request in the log entry you
provided
 through the network?  Is this a single Shindig server or is there any
load
 balancing occurring?

 Regards,
 -Stanton


 On Wed, Jul 9, 2014 at 2:40 PM, Merrill, Matt mmerr...@mitre.org
wrote:

 Hi shindig devs,

 We are in the process of upgrading from shindig 2.0 to 2.5-update1 and
 everything has gone ok, however, once we got into our production
 environment, we are seeing significant slowdowns for the opensocial
RPC
 calls that shindig makes to itself when rendering a gadget.

 This is obviously very dependent on how we¹ve implemented the shindig
 interfaces in our own code, and also our infrastructure, however, so
 we¹re
 hoping someone on the list can help give us some more ideas for areas
to
 investigate inside shindig itself or in general.

 Here¹s what¹s happening:
 * Gadgets load fine when the app is not experiencing much load ( 10
 users
 rendering 10-12 gadgets on a page)
 * Once a reasonable subset of users begins rendering gadgets, gadget
 render calls through the ³ifr² endpoint start taking a very long time
to
 respond
 * The problem gets worse from there
 * Even with extensive load testing we can¹t recreate this problem in
our
 testing environments
 * Our system adminstrators have assured 

Re: Performance problems with opensocial rpc calls

2014-07-18 Thread Merrill, Matt
Yep, that’s where I’m headed next.  Obviously there’s some hesitation to
do that on the part of our product owners so it takes a while to get to
that point.

Will let you know what I find.

Thanks!
-Matt

On 7/18/14, 8:44 AM, Ryan Baxter rbaxte...@gmail.com wrote:

Matt, I think further investigation is warranted.  I really think you
need to find a way to trace through the code and find where the
slowdown is occurring.  That will help us narrow down what the problem
is.  I know it is production, but getting some code on there that
starts timing method calls and such can be very useful.

On Tue, Jul 15, 2014 at 3:04 PM, Merrill, Matt mmerr...@mitre.org wrote:
 Hi Ryan,

 Thanks for responding!

 I’ve attached our ehcacheConfig, however, comparing it to the default
 configuration the only difference is the overall amount of elements
(1
 in ours vs 1000 in default) and also the temp disk store location.

 I’m assuming you are asking if each user in our system has the exact
same
 set of gadgets to render, correct?  If that’s the case: different users
 have different sets of gadgets, however, many of them have a default set
 we give them when they are initially setup in our system.  So, many
people
 will hit the same gadgets over and over again.  This default subset of
 gadgets is about 10-12 different gadgets and that is by and large what
 many users have.

 However, we have a total of 48 different gadgets that could be rendered
by
 a user at any given time on this instance of shindig.  We do run another
 instance of shindig which could render a different subset of gadgets,
but
 that has a much lower usage and only renders about 10 different gadgets
 altogether.


 I am admittedly rusty with my ehCache configuration knowledge, but
here’s
 a couple things I noticed:
 * I notice that the maxBytesLocalHeap in the ehCacheConfig is 50mb,
which
 seems low, however, this is the same setting we had in shindig 2.0, so I
 have to wonder if that has anything to do with it.
 * Our old ehCache configuration for shindig 2.0 specified a defaultCache
 maxElementsInMemory of 1000 but NO sizeOfPolicy at all.
 * Our new ehCache configuration for shindig 2.5 specifies a sizeOfPolicy
 maxDepth of 1 but NO defaultCache maxElementsInMemory.

 Our heap sizes in tomcat are 2048mb which based on a 50m max heap for a
 cache seems adequate. This is the same heap size from when we were using
 shindig 2.0.  Unfortunately, we don’t have profiling tools enabled on
our
 Tomcat instances so i can’t see what the heap looked like when things
 crashed, and like I said, we’re unable to reproduce this in int.

 I think we might be on to something here… I will keep searching but if
any
 devs out there have any ideas, please let me know.

 Thanks shindig list!
 -Matt

 On 7/13/14, 10:12 AM, Ryan Baxter rbaxte...@gmail.com wrote:

Matt can you tell us more about how you have configured the caches in
shindig?  When you are rendering these gadgets are you rendering the
same
gadget across all users?

-Ryan

 On Jul 9, 2014, at 3:31 PM, Merrill, Matt mmerr...@mitre.org
wrote:

 Stanton,

 Thanks for responding!

 This is one instance of shindig.

 If you mean the configuration within the container and for the shindig
 java app, then yes, the locked domains are the same.  In fact, the
 configuration with the exception of shindig¹s host URL¹s is exactly
the
 same from what I can tell.

 Unfortunately, I don¹t have any way to trace that exact message, but I
did
 do a traceroute from the server running shindig to the URL that is
being
 called for rpc calls to make sure there weren¹t any extra network
hops,
 and there weren¹t, it actually only had one, as expected for an app
making
 an HTTP call to itself.

 Thanks again for responding.

 -Matt

 On 7/9/14, 3:08 PM, Stanton Sievers ssiev...@apache.org wrote:

 Hi Matt,

 Is the configuration for locked domains and security tokens
consistent
 between your test and production environments?

 Do you have any way of tracing the request in the log entry you
provided
 through the network?  Is this a single Shindig server or is there any
load
 balancing occurring?

 Regards,
 -Stanton


 On Wed, Jul 9, 2014 at 2:40 PM, Merrill, Matt mmerr...@mitre.org
wrote:

 Hi shindig devs,

 We are in the process of upgrading from shindig 2.0 to 2.5-update1
and
 everything has gone ok, however, once we got into our production
 environment, we are seeing significant slowdowns for the opensocial
RPC
 calls that shindig makes to itself when rendering a gadget.

 This is obviously very dependent on how we¹ve implemented the
shindig
 interfaces in our own code, and also our infrastructure, however, so
 we¹re
 hoping someone on the list can help give us some more ideas for
areas
to
 investigate inside shindig itself or in general.

 Here¹s what¹s happening:
 * Gadgets load fine when the app is not experiencing much load ( 10
 users
 rendering 10-12 gadgets on a page)
 * Once a reasonable subset of users begins rendering 

Re: Performance problems with opensocial rpc calls

2014-07-15 Thread Merrill, Matt
Hi Ryan,

Thanks for responding!

I’ve attached our ehcacheConfig, however, comparing it to the default
configuration the only difference is the overall amount of elements (1
in ours vs 1000 in default) and also the temp disk store location.

I’m assuming you are asking if each user in our system has the exact same
set of gadgets to render, correct?  If that’s the case: different users
have different sets of gadgets, however, many of them have a default set
we give them when they are initially setup in our system.  So, many people
will hit the same gadgets over and over again.  This default subset of
gadgets is about 10-12 different gadgets and that is by and large what
many users have. 

However, we have a total of 48 different gadgets that could be rendered by
a user at any given time on this instance of shindig.  We do run another
instance of shindig which could render a different subset of gadgets, but
that has a much lower usage and only renders about 10 different gadgets
altogether.


I am admittedly rusty with my ehCache configuration knowledge, but here’s
a couple things I noticed:
* I notice that the maxBytesLocalHeap in the ehCacheConfig is 50mb, which
seems low, however, this is the same setting we had in shindig 2.0, so I
have to wonder if that has anything to do with it.
* Our old ehCache configuration for shindig 2.0 specified a defaultCache
maxElementsInMemory of 1000 but NO sizeOfPolicy at all.
* Our new ehCache configuration for shindig 2.5 specifies a sizeOfPolicy
maxDepth of 1 but NO defaultCache maxElementsInMemory.

Our heap sizes in tomcat are 2048mb which based on a 50m max heap for a
cache seems adequate. This is the same heap size from when we were using
shindig 2.0.  Unfortunately, we don’t have profiling tools enabled on our
Tomcat instances so i can’t see what the heap looked like when things
crashed, and like I said, we’re unable to reproduce this in int.

I think we might be on to something here… I will keep searching but if any
devs out there have any ideas, please let me know.

Thanks shindig list!
-Matt

On 7/13/14, 10:12 AM, Ryan Baxter rbaxte...@gmail.com wrote:

Matt can you tell us more about how you have configured the caches in
shindig?  When you are rendering these gadgets are you rendering the same
gadget across all users?

-Ryan

 On Jul 9, 2014, at 3:31 PM, Merrill, Matt mmerr...@mitre.org wrote:
 
 Stanton, 
 
 Thanks for responding!
 
 This is one instance of shindig.
 
 If you mean the configuration within the container and for the shindig
 java app, then yes, the locked domains are the same.  In fact, the
 configuration with the exception of shindig¹s host URL¹s is exactly the
 same from what I can tell.
 
 Unfortunately, I don¹t have any way to trace that exact message, but I
did
 do a traceroute from the server running shindig to the URL that is being
 called for rpc calls to make sure there weren¹t any extra network hops,
 and there weren¹t, it actually only had one, as expected for an app
making
 an HTTP call to itself.
 
 Thanks again for responding.
 
 -Matt
 
 On 7/9/14, 3:08 PM, Stanton Sievers ssiev...@apache.org wrote:
 
 Hi Matt,
 
 Is the configuration for locked domains and security tokens consistent
 between your test and production environments?
 
 Do you have any way of tracing the request in the log entry you
provided
 through the network?  Is this a single Shindig server or is there any
load
 balancing occurring?
 
 Regards,
 -Stanton
 
 
 On Wed, Jul 9, 2014 at 2:40 PM, Merrill, Matt mmerr...@mitre.org
wrote:
 
 Hi shindig devs,
 
 We are in the process of upgrading from shindig 2.0 to 2.5-update1 and
 everything has gone ok, however, once we got into our production
 environment, we are seeing significant slowdowns for the opensocial
RPC
 calls that shindig makes to itself when rendering a gadget.
 
 This is obviously very dependent on how we¹ve implemented the shindig
 interfaces in our own code, and also our infrastructure, however, so
 we¹re
 hoping someone on the list can help give us some more ideas for areas
to
 investigate inside shindig itself or in general.
 
 Here¹s what¹s happening:
 * Gadgets load fine when the app is not experiencing much load ( 10
 users
 rendering 10-12 gadgets on a page)
 * Once a reasonable subset of users begins rendering gadgets, gadget
 render calls through the ³ifr² endpoint start taking a very long time
to
 respond
 * The problem gets worse from there
 * Even with extensive load testing we can¹t recreate this problem in
our
 testing environments
 * Our system adminstrators have assured us that the configurations of
 our
 servers are the same between int and prod
 
 This is an example of what we¹re seeing from the logs inside
 BasicHttpFetcher:
 
 
 
http://238redacteddnsprefix234.gadgetsv2.company.com:7001/gmodules/rpc?
st
 
=mycontainer%3AvY2rb-teGXuk9HX8d6W0rm6wE6hkLxM95ByaSMQlV8RudwohiAFqAliy
wV
 
wc5yQ8maFSwK7IEhogNVnoUXa-doA3_h7EbSDGq_DW5i_VvC0CFEeaTKtr70A9XgYlAq5T9
5j