Re: [Linux-ha-dev] cl_log dropping messages

2007-11-08 Thread Lars Marowsky-Bree
On 2007-11-06T09:53:13, Alan Robertson [EMAIL PROTECTED] wrote:

 Cutting out that debug should be OK - or raising it to happen if debug is  
 1 would probably also be OK.  If you're seeing this happen a lot, that's 
 not a good thing.  Getting behind 200 messages seems like a lot to me - off 
 hand.

It's not. It happens very quickly. Alan, when have you last run CTS on a
9 node cluster? ;-)

Network transmissions take time, and a longer time than the transmission
from one process to the next via IPC - when the TE initiates the full
load of actions nearly instantaneously, the network layer lags behind.

And keep in mind that this is all from the DC, so the DC's network
connectivity is a choke point.

Here, we have ~250 actions - the messages started being dumped at ~200
messages or so. Once past that threshold, the logging mania started,
_and_ the logging mania contributed to making the MCP even slower, so it
was more likely to stay behind.

 Are you also seeing retransmissions?

A few (it also overflows the network buffers), but not very many.

 Just because you have a lot of processors doesn't mean that Xen is 
 scheduling you properly.  You have two different schedulers going on here - 
 so the opportunities for problems go up rather rapidly.

That is not quite true; the hypervisor has essentially scheduled the
guests to one CPU each, and doesn't need to interfere with the local
scheduling. They are - except for the networking, and other shared
resources, of course - running concurrently and independently, and have
a full CPU to themselves.


Regards,
Lars

-- 
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] cl_log dropping messages

2007-11-06 Thread Alan Robertson

Lars Marowsky-Bree wrote:

On 2007-10-25T16:25:30, Lars Marowsky-Bree [EMAIL PROTECTED] wrote:


http://hg.linux-ha.org/dev/rev/69f0395c2ead seems to fix some of this
for me.


BTW, I was able to conclude a 100 cycle run with that patch applied on 7
nodes, and absolutely not a single BadNews, which is a first.



Cutting out that debug should be OK - or raising it to happen if debug 
is  1 would probably also be OK.  If you're seeing this happen a lot, 
that's not a good thing.  Getting behind 200 messages seems like a lot 
to me - off hand.


Are you also seeing retransmissions?

Just because you have a lot of processors doesn't mean that Xen is 
scheduling you properly.  You have two different schedulers going on 
here - so the opportunities for problems go up rather rapidly.


--
Alan Robertson [EMAIL PROTECTED]

Openness is the foundation and preservative of friendship...  Let me 
claim from you at all times your undisguised opinions. - William 
Wilberforce

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


[Linux-ha-dev] cl_log dropping messages

2007-10-25 Thread Lars Marowsky-Bree
Hi all,

on my 7 node cluster, I see the occasional - every 5-10 tests - bunch of
messages dropped during a burst; usually on the DC (what a surprise), on
the order of ~200 messages dropped per incident.

This occurs only with debug 1, and only above 5 nodes or so.

So yes, my cluster is fully virtualized. However, the physical host has
8 x 2.66 Ghz cores; the guests don't write the messages to their own
image, but relay it via syslog-ng to the host, where it gets written
to a RAM disk, so no IO bottleneck. Each guest essentially has 1 core to
itself + 512MB RAM.

The network is fully virtual, so I can't be hitting that limit.

syslog-ng is running with a fifosize of 4 lines, and I upped logd to
2048 sendqlen/recvqlen.

As a data point: I was experiencing the very same drop message rate and
doubled the buffers on syslog-ng and logd then; no change.

Any suggestions?


Regards,
Lars

-- 
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] cl_log dropping messages

2007-10-25 Thread Dejan Muhamedagic
Hi,

On Thu, Oct 25, 2007 at 12:59:14PM +0200, Lars Marowsky-Bree wrote:
 Hi all,
 
 on my 7 node cluster, I see the occasional - every 5-10 tests - bunch of
 messages dropped during a burst; usually on the DC (what a surprise), on
 the order of ~200 messages dropped per incident.
 
 This occurs only with debug 1, and only above 5 nodes or so.
 
 So yes, my cluster is fully virtualized. However, the physical host has
 8 x 2.66 Ghz cores; the guests don't write the messages to their own
 image, but relay it via syslog-ng to the host, where it gets written
 to a RAM disk, so no IO bottleneck. Each guest essentially has 1 core to
 itself + 512MB RAM.
 
 The network is fully virtual, so I can't be hitting that limit.

Probably your xen is better than mine. Here I have a transfer
rate (guest to host) at times around 10mbit.

 syslog-ng is running with a fifosize of 4 lines, and I upped logd to
 2048 sendqlen/recvqlen.
 
 As a data point: I was experiencing the very same drop message rate and
 doubled the buffers on syslog-ng and logd then; no change.
 
 Any suggestions?

Remember this one:
http://lists.linux-ha.org/pipermail/linux-ha-dev/2007-April/014378.html

Cheers,

Dejan

 
 Regards,
 Lars
 
 -- 
 Teamlead Kernel, SuSE Labs, Research and Development
 SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
 Experience is the name everyone gives to their mistakes. -- Oscar Wilde
 
 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] cl_log dropping messages

2007-10-25 Thread Lars Marowsky-Bree
On 2007-10-25T13:11:56, Dejan Muhamedagic [EMAIL PROTECTED] wrote:

  The network is fully virtual, so I can't be hitting that limit.
 Probably your xen is better than mine. Here I have a transfer
 rate (guest to host) at times around 10mbit.

Paravirtualized is quite fast. I also don't connect my Xen network to
the external NIC (which is the default), but use a decoupled internal
bridged network, which is quite a bit faster, and also doesn't flood my
LAN ;-)

I couldn't figure it out myself, but it is described easily here:
http://en.opensuse.org/Xen3_and_a_Virtual_Network

With scp (and that includes encryption) I get ~35 MB/s between the
DomU and Dom0. Certainly plenty for logging.

  As a data point: I was experiencing the very same drop message rate and
  doubled the buffers on syslog-ng and logd then; no change.
  
  Any suggestions?
 
 Remember this one:
 http://lists.linux-ha.org/pipermail/linux-ha-dev/2007-April/014378.html

I remember this, but I'm not quite sure I see how this relates to a
performance issue.

I eventually noticed that all those lost messages were not harmful (ie,
CTS wasn't missing any of the patterns it was looking for), but that I
was getting a _lot_ of noise from heartbeat's messaging core /
flow-control during the message spikes.
http://hg.linux-ha.org/dev/rev/69f0395c2ead seems to fix some of this
for me.

Now I do get lost packets on the wire, but heartbeat retransmits nicely,
so nothing actually seems to go wrong ...

Probably this means that MAXMISSING and FLOWCONTROL_LIMIT might require
tuning. It'd be so much nicer if the messaging core was slightly more
adaptive, but as we're moving towards openAIS, I'm not sure whether
that's worth investigating in too much detail.


Regards,
Lars

-- 
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] cl_log dropping messages

2007-10-25 Thread Lars Marowsky-Bree
On 2007-10-25T16:25:30, Lars Marowsky-Bree [EMAIL PROTECTED] wrote:

 http://hg.linux-ha.org/dev/rev/69f0395c2ead seems to fix some of this
 for me.

BTW, I was able to conclude a 100 cycle run with that patch applied on 7
nodes, and absolutely not a single BadNews, which is a first.


Regards,
Lars

-- 
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] cl_log dropping messages

2007-10-25 Thread Lars Marowsky-Bree
On 2007-10-25T19:39:23, Dejan Muhamedagic [EMAIL PROTECTED] wrote:

  Probably this means that MAXMISSING and FLOWCONTROL_LIMIT might require
  tuning.
 Since both directly depend on MAXMSGHIST, I guess that it should
 be OK as it is.

Exactly why. Those thresholds depend on a compile-time choice, but I
guess in practice, when to engage flowcontrol and when MAXMISSING is too
high ought to depend on the actual network characteristics.


Regards,
Lars

-- 
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/