Re: [Linux-ha-dev] cl_log dropping messages
On 2007-11-06T09:53:13, Alan Robertson [EMAIL PROTECTED] wrote: Cutting out that debug should be OK - or raising it to happen if debug is 1 would probably also be OK. If you're seeing this happen a lot, that's not a good thing. Getting behind 200 messages seems like a lot to me - off hand. It's not. It happens very quickly. Alan, when have you last run CTS on a 9 node cluster? ;-) Network transmissions take time, and a longer time than the transmission from one process to the next via IPC - when the TE initiates the full load of actions nearly instantaneously, the network layer lags behind. And keep in mind that this is all from the DC, so the DC's network connectivity is a choke point. Here, we have ~250 actions - the messages started being dumped at ~200 messages or so. Once past that threshold, the logging mania started, _and_ the logging mania contributed to making the MCP even slower, so it was more likely to stay behind. Are you also seeing retransmissions? A few (it also overflows the network buffers), but not very many. Just because you have a lot of processors doesn't mean that Xen is scheduling you properly. You have two different schedulers going on here - so the opportunities for problems go up rather rapidly. That is not quite true; the hypervisor has essentially scheduled the guests to one CPU each, and doesn't need to interfere with the local scheduling. They are - except for the networking, and other shared resources, of course - running concurrently and independently, and have a full CPU to themselves. Regards, Lars -- Teamlead Kernel, SuSE Labs, Research and Development SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg) Experience is the name everyone gives to their mistakes. -- Oscar Wilde ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] cl_log dropping messages
Lars Marowsky-Bree wrote: On 2007-10-25T16:25:30, Lars Marowsky-Bree [EMAIL PROTECTED] wrote: http://hg.linux-ha.org/dev/rev/69f0395c2ead seems to fix some of this for me. BTW, I was able to conclude a 100 cycle run with that patch applied on 7 nodes, and absolutely not a single BadNews, which is a first. Cutting out that debug should be OK - or raising it to happen if debug is 1 would probably also be OK. If you're seeing this happen a lot, that's not a good thing. Getting behind 200 messages seems like a lot to me - off hand. Are you also seeing retransmissions? Just because you have a lot of processors doesn't mean that Xen is scheduling you properly. You have two different schedulers going on here - so the opportunities for problems go up rather rapidly. -- Alan Robertson [EMAIL PROTECTED] Openness is the foundation and preservative of friendship... Let me claim from you at all times your undisguised opinions. - William Wilberforce ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
[Linux-ha-dev] cl_log dropping messages
Hi all, on my 7 node cluster, I see the occasional - every 5-10 tests - bunch of messages dropped during a burst; usually on the DC (what a surprise), on the order of ~200 messages dropped per incident. This occurs only with debug 1, and only above 5 nodes or so. So yes, my cluster is fully virtualized. However, the physical host has 8 x 2.66 Ghz cores; the guests don't write the messages to their own image, but relay it via syslog-ng to the host, where it gets written to a RAM disk, so no IO bottleneck. Each guest essentially has 1 core to itself + 512MB RAM. The network is fully virtual, so I can't be hitting that limit. syslog-ng is running with a fifosize of 4 lines, and I upped logd to 2048 sendqlen/recvqlen. As a data point: I was experiencing the very same drop message rate and doubled the buffers on syslog-ng and logd then; no change. Any suggestions? Regards, Lars -- Teamlead Kernel, SuSE Labs, Research and Development SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg) Experience is the name everyone gives to their mistakes. -- Oscar Wilde ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] cl_log dropping messages
Hi, On Thu, Oct 25, 2007 at 12:59:14PM +0200, Lars Marowsky-Bree wrote: Hi all, on my 7 node cluster, I see the occasional - every 5-10 tests - bunch of messages dropped during a burst; usually on the DC (what a surprise), on the order of ~200 messages dropped per incident. This occurs only with debug 1, and only above 5 nodes or so. So yes, my cluster is fully virtualized. However, the physical host has 8 x 2.66 Ghz cores; the guests don't write the messages to their own image, but relay it via syslog-ng to the host, where it gets written to a RAM disk, so no IO bottleneck. Each guest essentially has 1 core to itself + 512MB RAM. The network is fully virtual, so I can't be hitting that limit. Probably your xen is better than mine. Here I have a transfer rate (guest to host) at times around 10mbit. syslog-ng is running with a fifosize of 4 lines, and I upped logd to 2048 sendqlen/recvqlen. As a data point: I was experiencing the very same drop message rate and doubled the buffers on syslog-ng and logd then; no change. Any suggestions? Remember this one: http://lists.linux-ha.org/pipermail/linux-ha-dev/2007-April/014378.html Cheers, Dejan Regards, Lars -- Teamlead Kernel, SuSE Labs, Research and Development SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg) Experience is the name everyone gives to their mistakes. -- Oscar Wilde ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] cl_log dropping messages
On 2007-10-25T13:11:56, Dejan Muhamedagic [EMAIL PROTECTED] wrote: The network is fully virtual, so I can't be hitting that limit. Probably your xen is better than mine. Here I have a transfer rate (guest to host) at times around 10mbit. Paravirtualized is quite fast. I also don't connect my Xen network to the external NIC (which is the default), but use a decoupled internal bridged network, which is quite a bit faster, and also doesn't flood my LAN ;-) I couldn't figure it out myself, but it is described easily here: http://en.opensuse.org/Xen3_and_a_Virtual_Network With scp (and that includes encryption) I get ~35 MB/s between the DomU and Dom0. Certainly plenty for logging. As a data point: I was experiencing the very same drop message rate and doubled the buffers on syslog-ng and logd then; no change. Any suggestions? Remember this one: http://lists.linux-ha.org/pipermail/linux-ha-dev/2007-April/014378.html I remember this, but I'm not quite sure I see how this relates to a performance issue. I eventually noticed that all those lost messages were not harmful (ie, CTS wasn't missing any of the patterns it was looking for), but that I was getting a _lot_ of noise from heartbeat's messaging core / flow-control during the message spikes. http://hg.linux-ha.org/dev/rev/69f0395c2ead seems to fix some of this for me. Now I do get lost packets on the wire, but heartbeat retransmits nicely, so nothing actually seems to go wrong ... Probably this means that MAXMISSING and FLOWCONTROL_LIMIT might require tuning. It'd be so much nicer if the messaging core was slightly more adaptive, but as we're moving towards openAIS, I'm not sure whether that's worth investigating in too much detail. Regards, Lars -- Teamlead Kernel, SuSE Labs, Research and Development SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg) Experience is the name everyone gives to their mistakes. -- Oscar Wilde ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] cl_log dropping messages
On 2007-10-25T16:25:30, Lars Marowsky-Bree [EMAIL PROTECTED] wrote: http://hg.linux-ha.org/dev/rev/69f0395c2ead seems to fix some of this for me. BTW, I was able to conclude a 100 cycle run with that patch applied on 7 nodes, and absolutely not a single BadNews, which is a first. Regards, Lars -- Teamlead Kernel, SuSE Labs, Research and Development SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg) Experience is the name everyone gives to their mistakes. -- Oscar Wilde ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] cl_log dropping messages
On 2007-10-25T19:39:23, Dejan Muhamedagic [EMAIL PROTECTED] wrote: Probably this means that MAXMISSING and FLOWCONTROL_LIMIT might require tuning. Since both directly depend on MAXMSGHIST, I guess that it should be OK as it is. Exactly why. Those thresholds depend on a compile-time choice, but I guess in practice, when to engage flowcontrol and when MAXMISSING is too high ought to depend on the actual network characteristics. Regards, Lars -- Teamlead Kernel, SuSE Labs, Research and Development SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg) Experience is the name everyone gives to their mistakes. -- Oscar Wilde ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/