Okay, so it sounds like either threads are being exhausted or the sheer cost of messaging a lot of people it eating up CPU. It would be really nice to know which via some external monitoring package. You might also want to try bumping up MONO_THREADS_PER_CPU just to see what happens if you haven't already.

If it's threads, then perhaps one could only spawn N message threads to work until done. But naturally this would slow down message sending and isn't a solution anyway if CPU load is the problem.

So perhaps the best thing would be to experiment with fanning out messages via an external service [1]. GroupsMessagingModule.SendMessageToGroup() could be patched with an experimental setting to send the message request through a new method on IGroupsServicesConnector (e.g. IGroupsServicesConnector.SendMessageToGroup()) rather than purely via the simulator.

XmlRpcGroupsServiceConnectorModule could then implement this message as another XmlRpcCall (with just a stub out in SimianGroupsServiceConnectorModule for now).

Extension to the existing PHP groups code could then distribute the messages. The service should remain stateless so it's easy to spin up additional instances if load balancing is needed (one would really need some way to monitor load).

I, for one, would welcome any patches to allow this experimentation. Unfortunately, this isn't an issue in my current work so I can't spend much time on it myself presently.

[1] 
http://opensimulator.org/wiki/Feature_Proposals/Improve_Groups_Service#D.29_Groups_service_distributes_IM

On 18/11/12 18:50, Michelle Argus wrote:
Darn, why did your responce land in my junkmails...^^

The exact same behaviour can be seen in the original and new solution. In the 
original solution the time period is just
much longer during which the lag is noticable.  The viewer lags the moment the 
IM is sent by the viewer and stops once
all IMs have been sent out by the simulator, the issue already starts prior to 
the simulator actualy sending the IMs.

During this time the sim console responce is also slowed down when typing a new 
command.

  I could imagine that it is thread related. In that case tests within the same 
datacenter or near by would most
probably not cause much noticable issues. In my Case the distance EU/USA could 
increase the lag situation I experiance
in a similar way  all other requests were we have increasing issues with slow 
requests.

Short sidenote on the slow request. The sims are hosted in one of the top 5 
datacenters in germany that we had tested
with with very good ping times to the osgrid servers, so slow requests are 
unlikely to be caused by slow network on the
simulator side.

Am 10.11.2012 04:44, schrieb Justin Clark-Casey:
What is the nature of the lag in this case?  Is is some freezing of the scene 
(if inbound packets stop being
processed) or something else?  On a naive reading of the code, scene freezing 
should not happen since the IM packet is
handled on its own thread and not the inbound packet loop thread, though I 
guess it's possible that for big groups,
sending a lot of messages may tie up all the available threads (not something I 
have experienced or tested).

Do you see the same thing with the original solution of caching presence 
information directly in the groups service?
I would have thought so since the sending step in that case is still performed 
by the simulator.

On 08/11/12 22:28, Michelle Argus wrote:
NP Justin.

I tested the new MessageOnlineUsersOnly and it is a big improvment eventhough 
some lag still was noticable. The lag is
noticable from the moment the IM is beeing sent from the viewer untill all IMs 
were sent to the online group members.
With the new option we have 3 lag creating steps:
- Query the groupmembers from the group server ( time is not logged)
- Query the presence server for online members (time is not logged)
- Sending the IMs (ranged between 59 - 3200 ms for constant 13 online members 
of 560 total)
Same test with false took about 20 seconds to send all IMs

As my server is in europe, the 2 queries to the OSGrid group server AND the 
presence server can both be slow at times.
This could be improved if the group service queries the presence server (as in 
most cases both will be within the same
network) instead of the simulator sending both queries. So instead of the 
currently implemented proposal B, the
proposal  C would be better for simulators further away or those with slow 
internet connection.

In addition, proposal C does not require each groupmember to be checked for 
online status if the group server uses a
presence cache. Example, if 2 people are sending IMs to diffrent groups with 
common members, then 1 presence check would
be enough for the common members, the lists of agents sent to the presence 
service would thus also be reduced. As in a
bigger grids many IMs are beeing sent gridwide, the presence querries should  
be even more efficiant in C compared to B



Yes, E is a hybrid of B/C and D. It could use the exact same online presence 
method as in B/C to get the list of online
members. The group service would thus not realy need any changes. This also 
means, that a 3rd party dev could develope
the IM relay service which would then not need to be part of the core itself. 
The only change required would be
simulator side, but even this would be a minor change. One would need to 
implement a url/enable ini setting, and
depending on the setting the IM is sent simulator side or passed on to the IM 
relay service. Its a small change
simulator side which should not be a problem for implementation in core.

As the IM relay server is an option, everyone can deside themselves if they 
trust the hoster of the IM relay service or
if they want to have their own simulator send all IMs. Talking for the german 
cummunity in OSGrid, I know that we
definatly will use a IM relay to improve the IM situation for the simulator 
side nomatter if A, B or C get implemented,
especialy on our event regions.


  My personal favorit would be E, were C is the core group service and the 
optional IM+group notice Service can be run
by closed grid admins or 3rd party in open grids to improve worldwide groupchat 
and notice sending usage.





Am 08.11.2012 04:39, schrieb Justin Clark-Casey:
Hi Michelle.  Sorry that it's taken quite a long time for me to reply to this - 
unfortunately been hit by other work
and divers alarums.

I somewhat edited [1] again, mainly for sense changes though I also left a few 
comments in italics.  To try and make
things clearer, I gave the alternatives names (e.g. alternative E I've called 
"Separate group IM relay service".
Please change if these are not accurate.

It would seem that E is mostly a hybrid solution of B (simulator queries 
presence service on IM) and D (Groups service
distributes IM)?  I would regard this as over-complicated for a core 
OpenSimulator solution, especially when one
starts talking about trusted relay services.

I'm curious if you have tried the experimental MessageOnlineUsersOnly = true 
setting I added a couple of weeks ago.

[1] http://opensimulator.org/wiki/Feature_Proposals/Improve_Groups_Service

On 29/10/12 00:13, Michelle Argus wrote:
I have updated the proposal page and listed the diffrent alternative from A 
onwards sothat its easier for us.

My current favorite is the alternative E: The group server requests the online 
status from the grid server itself and
caches this data instead of the grid server keeping the group server updated.

-  Simulators request their data directly from the group server and sends IMs 
itself OR
- Optionaly the Simulator communicates via a relay service with its own cache. 
The relay service requests its data
from
the same central group server. The relay service can additionaly send IMs if 
wanted to reduce resource usage simulator
side. The relay service can be hosted by anyone for a worldwide network.

The same concept could be used for other services such as assets, presence, 
inventory, friendslist etc which are
meanwhile causing many issues due to slow requests in bigger grids such as 
OSGrid.


Am 23.10.2012 04:40, schrieb Justin Clark-Casey:
Apologies, it's [1].   Please feel free to edit it as you see fit - I've put 
you as one of the proposers.  This page
is to keep track of the issue rather than a formal proposal mechanism.

No rush on this - please feel free to take your time in responding. In truth, I 
only have a certain amount of time
for
these issues currently myself.

Having messages route through a service rather than be largely handled by 
simulators themselves is an interesting
approach.  It's the argument of a distributed versus a more centralized 
architecture. Although I can't see
OpenSimulator going down this route in the near future, if anybody wants to 
experiment and needs additional config
settings then patches are very welcome.

[1] http://opensimulator.org/wiki/Feature_Proposals/Improve_Groups_Service

On 20/10/12 11:06, Michelle Argus wrote:
Justin, could you post the url to the suggestion page, I think you forgot to 
add it ;)

  One issue that having the sim updating online status is, that if someone has 
the group module diabled or uses a
diffrent setting then the status is not updated. As other modules hosted by the 
grids also might this
information, one
should consider adding something to the gridserver for this.

I also like the idea from Akira to have the groupserver to receive the full IM 
and then sending it to everyone
instead
of having the sim send the message. One then could have a specialized server 
installed for the group module which
cannot
create any lagissues simside. This could then also be used for a gridwide 
spamfilter or filtering illigal activities
within the grid.

Havnt had much time though as I have a longer event running which ends on 
sunday...


Am 20.10.2012 04:32, schrieb Justin Clark-Casey:
Regarding the groups work, I have now implemented an OpenSimulator experimental 
option, MessageOnlineUsersOnly in
[Groups] as of git master 1937e5f.  When set to true this will only send group 
IMs to online users.  This does not
require a groups service update.  I believe OSGrid is going to test this more 
extensively soon though it appears to
work fine on Wright Plaza.

It's temporarily a little spammy on the console right now (what isn't!) with a 
debug message that says how many
online
users it is sending to and how long a send takes.

Unlike Michelle's solution, this works by querying the Presence service for 
online users, though it also caches
this
data to avoid hitting the presence service too hard.

Even though I implemented this, I'm not convinced that it's the best way to go 
- I think Michelle's approach of
sending login/logoff status directly from simulator to groups service could 
still be better.  My chief concern with
the groups approach is the potential inconsistency between online status stored 
there and in the Presence service.
However, this could be a non-issue. Need to give it more thought.

On 14/10/12 22:53, Akira Sonoda wrote:
IMHO finding out which group members are online and sending group IM/Notice 
etc. to them actually should not be
done by
the region server from which the group IM/notice etc. is sent.
This is a task which should be done centrally in case of OSgrid in Dallas TX (
http://wiki.osgrid.org/index.php/Infrastructure ). The region server should 
only collect the group IM/notice etc.
and
send it to the central group server or in the other way receiving IM/notice 
etc. from the central group server and
distribute it to the Agents active on the region(s).

That concentrates all distribution on a central point rather than spreading it 
amongst simulators.  Then OSGrid has
the problem of scaling this up.

Having said that, there are advantages to funnelling things through a reliable 
central point.  As to which is
better
is a complicated engineering issue - the kind of which there are many in the 
MMO/VW space.


But there are even other places which can and should be improved. I did some 
tests with some viewers counting the
web
requests to the central infrastructure:

Test 1: Teleport from a Plaza to one of my regions located on a server in 
Europe and afterwards logging out:

Cool VL Viewer: 912 Requests mostly SynchronousRestForms POST 
http://presence.osgrid.org/presence ( i guess to
inform
all my 809 friends [mostly only 5% online] I am going offline because the calls 
to the presence service were done
after
i closed the viewer)
Singularity Veiwer: 921 Requests mostly calls to presence after logoff
Teapot viewer: 910 Requests mostly calls to presence after logoff
Astra Viewer: 917 Requests mostly calls to presence after logoff
Firestorm: 1005 Requests mostly calls to presence after logoff
Imprudence: 918 mostly calls to presence after logoff

So far so good. I have no idea why my 760 offline friends have to be informed 
that I went offline ...
(Details can be found here: 
https://docs.google.com/open?id=0B301xueh1kxdNG1wLWo2YVVfYjA )

Test 2: Direct Login onto my Region and then Logoff-( with FetchInventory2 
disabled )

Cool VL Viewer: 2232 Requests mostly calls to presence ~800 during login and 
~800 during logout and xinventory
Singularity Viwer: 2340 Requests mostly calls to presence and xinventory
Teapot Viewer: Produced 500+ Threads in a very short time and then the 
OpenSim.exe crashed
Astra Viewer: 2831 Request mostly calls to presence and xinventory
Firestorm Viwer: ACK Timeout for me. OpenSim.exe survived on 500 Threads for 
30+ minutes producing 4996 Requests
mostly
xinventory
Imprudence: 1745 Requests mostly presence

Again why do all my 809 friends have do be verified with single requests? Then 
why this difference in xinventory
Requests? And why are both Teapot and Firestorm producing so many Threads in 
such a short time? and bring
OpenSim.exe to
crash or closely to crash ...
( Details can be found here: 
https://docs.google.com/open?id=0B301xueh1kxdMDJxWm5UR2QtU2c )

The presence information is useful data and it was possible in git master 
commit da2b23f to change the Friends
module
to fetch all presence data in one call for status notification when a user goes 
on/offline, rather than make a
separate call for each friend.

This should be more efficient since only the latency and resources of one call 
is required.  However, since each
friend still has to be messaged separately to tell them of the status change 
I'm not sure how much practical effect
this will have.


Test 3: Direct Login to my Region with FetchInventory2 enabled.

Teapot Viewer: I closed the viwer after 30 minutes. Number of Threads were 
still rising up to 260. In the end i
counted
30634 xinventory requests... My Inventory has 14190 items !!!
Firestorm Viwer: Quite normal approx 2020 Requests ... quite some slow 
FetchInventoryDescendandts2 Caps. with 100
sec
max

Regarding inventory service, unfortunately many viewers appear to behave very 
aggressively when fetching inventory
information.  For instance, I'm told that if you have certain types of AO 
enabled - some viewers will fetch your
entire inventory.  The LL infrastructure may be able to cope with this but the 
more modest machines running
grids can
have trouble, it seems.

I'm not sure what the long term solution is.  I suspect it's possible to 
greatly increase inventory fetch
efficiency,
possibly by some kind of call batching.  Or perhaps there's some viewer-side 
caching that OpenSimulator isn't
working
with properly.


( Details can be found here: 
https://docs.google.com/open?id=0B301xueh1kxdNEtEeUVFamU1QUE )

Just my observations this week end.
Akira



2012/10/13 Justin Clark-Casey <jjusti...@googlemail.com 
<mailto:jjusti...@googlemail.com>>

    Hi Michelle.  I've now had some more time to think about this. In fact, I 
established a proposal summary
page at
    [1] which I'll change as we go along (or please feel free to change 
yourself).  We do need to fix this
problem of
    group IM taking massive time with groups that aren't that big.

    I do like the approach of caching online status (and login time) in the 
groups service.

    1.  It's reasonably simple.
    2.  One network call to fetch online group members per IM.
    3.  May allow messaging across multiple OpenSimulator installations.

    However, this approach does mean

    1.  Independently updating the groups services on each login/logout.  I'm 
not saying this is a problem,
particularly
    if it saves traffic later on.
    2.  Groups service has to deal with extra information. Again, this is 
fairly simple so not necessarily a fatal
    issue though it does mean every groups implementations needs to do this in 
some manner.
    3.  Online cache is not reusable by other services in the future.

    On a technical note, the XmlRpc groups module does in theory cache data for 
30 seconds by default, so a
change in
    online status may not be seen for upto 30 seconds. I personally think that 
this is a reasonable tradeoff.

    Rather, of the above cons, 3 is the one I'm finding most serious.  If other 
services would also benefit from
online
    status caching in the future, they would have to implement their own caches 
(and be updated from simulators).

    I do agree that making a GridUser.LoggedIn() call for every single group 
member on every single IM is
unworkable.
      Even if this is only done once and cached for a certain period of time it 
could be a major issue for large
groups.

    So an alternative approach could be to add a new call to GridUser service (maybe 
LoggedIn(List<UUID>) that
will
only
    return GridInfo for those that are logged in. This could then be cached 
simulator-side for a certain period of
time
    (e.g. 30 seconds like the groups information) and used for group IM.

    This has the advantages that

    1.  Groups and future services don't need to do their own login caching.
    2.  Future services can use the same information and code rather than have 
to cache login information
themselves.

    However, it does

    1.  Require GridUserInfo caching simulator-side, I would judge this to be a 
more complex approach.
    2.  Mean that during the cache period, new online group messages will not 
receive messages. (this is going to
    happen with GetGroupMembers() caching anyway).
    3.  Traffic is still generated to the GridUser service at the end of every 
simulator-side caching period.
This is
    probably not a huge burden.

    So right now, I'm somewhat more in favour of a GridUserInfo simulator-side 
caching approach than caching login
    information within the groups service.  However, unlike you, I haven't 
actually tried to implement this
approach so
    there may well be issues that I haven't seen.

    What do you think, Michelle (or anybody else)?


    On 10/10/12 19:47, Michelle Argus wrote:

        http://code.google.com/p/__flotsam/ <http://code.google.com/p/flotsam/> 
is the the current flotsam
version and
        points to the github repro which I forked and
        then patched.

        None of the changes I proposed in my git fork have been implemented, 
neither in opensim nor in flotsam.

           Consider my proposal as a quick fix for the time beeing which does 
not solve all other issues
mentioned by
later
        mailings.

        Am 09.10.2012 10:24, schrieb Ai Austin:

            Michelle Argus on Wed Oct 3 18:00:23 CEST 2012:

                I have added some changes to the group module of OpenSim and 
the flotsam server.
                ...
                The changes can be found in the 2 gits here:
<https://github.com/MAReantals__>https://github.com/MAReantals

                NB: Both changes to flotsam and opensim are backward compatible 
and do
                not require that both parts are updated. If some simulators are 
not
                updated it can happen that some groupmembers do not receive
                groupmessages as their online status is not updated correctly. 
In a grid
                like OSgrid my recomendation would thus be to first update the
                simulators and at a later stage flotsam.


            Hi Michelle... I am looking at what is needed to update the Openvue 
grid which is using the flotsam
XmlRpcGroups
            module.  the GITHub repository has the changes from a few days 
ago... but I wonder if there has
been an
            update/commit
            into the main Opensim Github area already. I cannot see a  related 
commit looking back over the last
week
            or so.  Is
            the core system updated so this module is up to date in that?  I 
also note that the
Opensim.ini.example
file
            contains
            a reference to http://code.google.com/p/__flotsam/ 
<http://code.google.com/p/flotsam/> for details of
how to
            install the service.. but that seems to be
            pointing at an out of date version?

            I think for the flotsam php end it is straightforward and I 
obtained the changed groups.sql and
xmlrpc.php files
            needed.  But note that people are still pointed via the 
opensim.ini.example comments at the old
version on
            http://code.google.com/p/__flotsam/ 
<http://code.google.com/p/flotsam/> so that either needs updating
to teh
            latest version, or the comment in
            opensim.ini.exmaple needs to be changed.

            To avoid mistakes, I wonder if you can clarify where to go for the 
parts needed and at what
revision/date of
            OpenSim
            0.7.5 dev master this was introduced, what to get and what to 
change for an existing service in terms
of the
            data base
            tables, OpenSim.exe instance and the web support php code areas?

            Thanks Michelle, Ai

_________________________________________________
            Opensim-dev mailing list
            Opensim-dev@lists.berlios.de <mailto:Opensim-dev@lists.berlios.de>
https://lists.berlios.de/__mailman/listinfo/opensim-dev 
<https://lists.berlios.de/mailman/listinfo/opensim-dev>


_________________________________________________
        Opensim-dev mailing list
        Opensim-dev@lists.berlios.de <mailto:Opensim-dev@lists.berlios.de>
https://lists.berlios.de/__mailman/listinfo/opensim-dev 
<https://lists.berlios.de/mailman/listinfo/opensim-dev>



    --
    Justin Clark-Casey (justincc)
    OSVW Consulting
    http://justincc.org
    http://twitter.com/justincc
_________________________________________________
    Opensim-dev mailing list
    Opensim-dev@lists.berlios.de <mailto:Opensim-dev@lists.berlios.de>
https://lists.berlios.de/__mailman/listinfo/opensim-dev 
<https://lists.berlios.de/mailman/listinfo/opensim-dev>




_______________________________________________
Opensim-dev mailing list
Opensim-dev@lists.berlios.de
https://lists.berlios.de/mailman/listinfo/opensim-dev




_______________________________________________
Opensim-dev mailing list
Opensim-dev@lists.berlios.de
https://lists.berlios.de/mailman/listinfo/opensim-dev




_______________________________________________
Opensim-dev mailing list
Opensim-dev@lists.berlios.de
https://lists.berlios.de/mailman/listinfo/opensim-dev




_______________________________________________
Opensim-dev mailing list
Opensim-dev@lists.berlios.de
https://lists.berlios.de/mailman/listinfo/opensim-dev




_______________________________________________
Opensim-dev mailing list
Opensim-dev@lists.berlios.de
https://lists.berlios.de/mailman/listinfo/opensim-dev



--
Justin Clark-Casey (justincc)
OSVW Consulting
http://justincc.org
http://twitter.com/justincc
_______________________________________________
Opensim-dev mailing list
Opensim-dev@lists.berlios.de
https://lists.berlios.de/mailman/listinfo/opensim-dev

Reply via email to