from:"Steven Dake"

Re: [Openais] totem token timeout increase

2014-08-27 Thread Steven Dake


On 08/27/2014 02:20 PM, Vasil Valchev wrote:

Hello Steve,

If I just increaseÂ token_retransmits_before_loss, without increasing 
token, won't it just send more tokens during the same time? For 
example 8 in 30s instead of 4?
The last few times the network interruption wasn't longer than a 
minute, last time the cluster was even going to reform, but the 
fencing was already initiated by fenced.


I want to allow a bit more time in which the nodes can resume 
communication and though increasing token timeout should do it. Do you 
mean to also increase theÂ token_retransmits_before_loss proportionally?



Vasil,

I am not sure why you would have network disruption for 4 lost tokens, 
but transmitting 8 gives better chance they will reach.  UDP (the 
transport used) can lose those retransmitted tokens. Increasing the 
token timer will allow more time for whatever action your doing on the 
network that takes it out of service to repair.


Regards,
-steve


BR,
Vasil


On Thu, Aug 28, 2014 at 12:03 AM, Steven Dake <mailto:sd...@redhat.com>> wrote:


On 08/26/2014 07:17 AM, Vasil Valchev wrote:

Hello all,

I have a RHEL 5 (openais) cluster with intermittent issues on the
heartbeat network, and was thinking to increase the totem token
value to 90s (currently is 30s).

Are there any negative effects from this change, apart from the
cluster taking longer to detect a node is failed - can this cause
data corruption for example or something like that?


BR,
Vasil Valchev


Vasil,

I doubt going from 30s to 90s would make a difference with
healthchecking performed.Â  You may be better off increasing
token_retransmits_before_loss_const.Â  Also make sure your running
the latest z stream.

Regards,
-steve



___
Openais mailing list
Openais@lists.linux-foundation.org  
<mailto:Openais@lists.linux-foundation.org>
https://lists.linuxfoundation.org/mailman/listinfo/openais





___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/openais

Re: [Openais] totem token timeout increase

2014-08-27 Thread Steven Dake


On 08/26/2014 07:17 AM, Vasil Valchev wrote:

Hello all,

I have a RHEL 5 (openais) cluster with intermittent issues on the 
heartbeat network, and was thinking to increase the totem token value 
to 90s (currently is 30s).


Are there any negative effects from this change, apart from the 
cluster taking longer to detect a node is failed - can this cause data 
corruption for example or something like that?



BR,
Vasil Valchev


Vasil,

I doubt going from 30s to 90s would make a difference with 
healthchecking performed.  You may be better off increasing 
token_retransmits_before_loss_const.  Also make sure your running the 
latest z stream.


Regards,
-steve



___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/openais


___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/openais

Re: [Openais] Request of information about rrp mode passive versus rrp mode active

2013-11-27 Thread Steven Dake


On 11/27/2013 08:20 AM, Moullé Alain wrote:

Hi,

the man page of corosync.conf gives :

"Active replication offers slightly lower latency from transmit to 
delivery in faulty network environments but with less performance.
Passive replication may nearly double the speed of the totem protocol 
if the protocol doesn’t become cpu bound"


OK but knowing that, could someone give the pro & cons for passive 
mode, and the pro & cons for active mode,

and/or how must we choose the real better mode for a HA cluster ?



If you care about latency use active, if you care about throughput, use 
passive. From a functional perspective they both deliver the same 
functionality. I have seen on this list that something with active may 
be broken atm though, best to check with Jan Friesse.


Regards
-steve


Thanks a lot
Alain



___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/openais


___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/openais

Re: [Openais] question stack Pacemaker/corosync on SLES11

2012-03-08 Thread Steven Dake

On 03/08/2012 03:10 AM, Tim Serong wrote:
> On 03/08/2012 07:23 PM, alain.mou...@bull.net wrote:
>> Hi Darren
>> And thanks. I effectively found that the stack is started with the service 
>> "openais" : no more
>> 'corosync' neither 'pacemaker' lsb scripts.
>> But I'am surprised because I think I remind that one or two years ago, it 
>> was said that corosync
>> was sort of an 'extract' of openais , just to isolate needed code for 
>> stack Pacemaker/corosync to
>> work ... and now it seems that all is managed again by openais ... so I 
>> don't completely
>> understand the evolution of 'architecture'  ... but perhaps am I wrong ? 
>> Could you clarify the "history" for me ?
> 
> openais 0.80.x (before the creation of corosync) shipped for SLES 11.
> You started it by running /etc/init.d/openais start.
> 
> corosync 1.2.x + openais 1.1.x shipped for SLES 11 SP1.  corosync is the
> core messaging layer, and openais just includes some extra magic for
> OCFS2 etc.  But (on SLES at least) the /etc/init.d/openais init script
> remained, even though that init script now starts corosync.
> 
> The same is true now for SLES 11 SP2 (albeit with corosync 1.4.1);
> corosync is what's running, but you use the openais init script to start it.
> 
> So your history of project splits and whatnot is correct, you're just
> being misled by the name of an init script :)
> 
> Regards,
> 
> Tim
> 
>>
>>
>> De :Darren Thompson 
>> A : alain.mou...@bull.net
>> Date :  07/03/2012 21:03
>> Objet : Re: [Openais] question stack Pacemaker/corosync on SLES11
>>
>>
>>
>> Alain
>> With SLES you also need to install the OpenAIS stack as that is where the 
>> init.d service comes from etc.
>> Darren
>> On Mar 8, 2012 2:14 AM,  wrote:
>> Hi, 
>>
>> In rpm corosync-1.4.1-4 on rhel are installed : 
>> /etc/rc.d/init.d/corosync 
>>
>> but in rpm corosync-1.4.1-0.11.29 on SLES 11, I don't have anything 
>> installed 
>> as init.d service, even in /etc/init.d, and I checked the rpm, there is no 
>> more /etc/rc.d/init.d/corosync 
>>
>> same thing for pacemaker , the rpm pacemaker-1.1.6-1.25.1 on SLES does not 
>> install the lsb script 
>> pacemaker as in the rpm pacemaker-1.1.6-3 on rhel 
>>
>> could someone tell me how to start the stack Pacemaker/corosync service 
>> with the pacemaker-1.1.6-1.25.1/corosync-1.4.1-0.11.29 on SLES 11 ? 
>>
Another thing to keep in mind is a particular company's product trails
upstream by some time interval...  (which is good, upstream has bugs,
product should not :)

Regards
-steve
>> Many thanks 
>> Alain
>> ___
>> Openais mailing list
>> Openais@lists.linux-foundation.org
>> https://lists.linuxfoundation.org/mailman/listinfo/openais
>>
>>
>>
>>
>> ___
>> Openais mailing list
>> Openais@lists.linux-foundation.org
>> https://lists.linuxfoundation.org/mailman/listinfo/openais
> 
> 

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/openais

[Openais] change to commit policy

2011-09-09 Thread Steven Dake

Russell pointed out a problem with his recent patch for mutexes.  It is
only applicable to 1.4/1.3 branches.  It is not applicable to master.
Currently our policy is that all patches go into master, and 1 person is
responsible for backports to other branches.  This leaves out the
important case above that Russell ran into.

As a result, if the patch is not suitable for master because of our
de-threading of the software, please commit to flatron-1.4 and then git
cherry-pick to flatiron-1.3 branch.

Regards
-steve
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Configuration Hash Table - API proposal

2011-09-07 Thread Steven Dake

On 09/01/2011 05:17 AM, Jan Friesse wrote:
> Included is API proposal for replacement of objdb/confdb API. It should
> keep all good things there (triggers, ...), remove hard to use bits
> (like whole object idea) and improve existing things (like typing)
> 
> Even I wrote it before, also configuration file will need change.
> 
> Proposed change is
> 
> ht_key value
> ht_key. {
>   ht_subkey value2
> }
> 

We absolutely can't change the config file - it will cause massive
confusion in the user base.  Although changing the internal
representation in whatever way is necessary seems fine.  If the parsing
code has to be suboptimal that is preferable to confusing the user base.

> which is (internally) converted to
> ht_key value
> ht_key.ht_subkey value2
> 
> Also value should become typed, so
> value ~= ^-?[0-9]+$ = integer 32 bits, with modificators like l, ll, ...
> value ~= ^-?[0-9]*.[0-9]*$ = float (or double) (also should handle all
> variants with E .. basically C format)
> value = "[:alpha:]*" = string
> value = bin:base64 encoded binary data
> 
> Regards,
>   Honza
> 
> 
The API looks really solid.  I don't totally like the error returns in
cht_get and set calls, but understand the need for the programmer to be
able to determine what went wrong with the API call.  If we didn't have
typing we wouldn't need error codes, but I am pretty certain we need
typing in corosync (but perhaps not the underlying libqb).  A typical
map doesn't need an error code because it doesn't care about errors
(worst case error, malloc = whole system going to blow up anyway).  The
only other option is asserting in libs, which is evil, so we should
count that out.

On the topic of prefix, this is a great feature, but doesn't fit in well
with a hash table.  Another option is to use direct integration in the
skiplist in libqb to implement this.

Since what you are delivering on not really a hash table, but more like
a map table, may consider a rename to

cs_map or similar

Really great work

Feed missing requirements for libqb into Angus's work on libqb when you
start progressing with implementation.

Regards
-steve


> 
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Installing corosync from source

2011-09-07 Thread Steven Dake

On 09/06/2011 06:05 PM, Nick Khamis wrote:
> Hello Everyone,
> 
> We are moving everything over from heartbeat, after the last update
> brought the cluster to it's knees... What we are interested in is
> using corosync, pacemaker to LVS mysql, and asterisk. We have not
> looked into asterisk yet, and we don't know if it's even possible
> (i.e. if there is already an ocf a;ready created).
> Regardless, our attempt to install corosync from source using the
> directrions found in "http://www.clusterlabs.org/wiki/Install"; seemed
> to go ok however, nothing was created. We had to manually copy:
> cp /usr/etc/corosync/corosync.conf.example /etc/corosync/corosync.conf
> cp /usr/etc/init.d/corosync /etc/init.d/corosync
> We have a long way to go, you're help is greatly appreciated.
> 
> simple startup conf
> 
> totem {
> version: 2
> secauth: off
> threads: 0
> interface {
> ringnumber: 0
> bindnetaddr: 192.168.1.1
> mcastaddr: 226.94.1.1
> mcastport: 5405
> ttl: 1
> }
> }
> 
> logging {
> fileline: off
> to_stderr: no
> to_logfile: yes
> to_syslog: yes
> logfile: /var/log/cluster/corosync.log
> debug: off
> timestamp: on
> logger_subsys {
> subsys: AMF
> debug: off
> }
> }
> 
> amf {
> mode: disabled
> }
> 
> If there are script that I am suppose to be running that will create
> everyting? Do I need to install OpenAIS as well? We downloaded the
> latest version of resource agents, cluster glue, corosync, pacemaker.
> I know we can install everything from gentoo source tree, but we are
> trying to avoid that...
> 

There is no script to build everything.  Compiling all that code from
source is sure to be painful.  Corosync is pretty straight forward
(autogen - confgiure - make install) but getting everything else
operational may be challenging.

You shouldn't need openais package.

Regards
-steve
> Your help is greatly appreciated,
> 
> Nick.
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] cpg behavior on transitional membership change

2011-09-02 Thread Steven Dake

On 09/02/2011 12:59 AM, Vladislav Bogdanov wrote:
> Hi all,
> 
> I'm trying to further investigate problem I described at
> https://www.redhat.com/archives/cluster-devel/2011-August/msg00133.html
> 
> The main problem for me there is that pacemaker first sees transitional
> membership with left nodes, then it sees stable membership with that
> nodes returned back, and does nothing about that. On the other hand,
> dlm_controld sees CPG_REASON_NODEDOWN events on CPGs related to all its
> lockspaces (at the same time with transitional membership change) and
> stops kernel part of each lockspace until whole cluster is rebooted (or
> until some other recovery procedure which unfortunately does not happen

I believe fenced should reboot the node, but only if there is quorum.
It is possible your cluster has lost quorum during this series of
events.  I have copied Dave for his feedback on this point.

> :( ). It neither requests to fence left node nor recovers when node is
> returned on next stable membership.
> 
> Could anyone please help me to understand, what is a correct CPG
> behavior on membership change?
> From what I see, CPG emits CPG_REASON_NODEDOWN event on both
> transitional and stable membership if there is node which left the
> cluster. Am I correct here? And is that a right thing if I am?
> 

Line #'s where this happens?

> If yes, is there a way do detect membership change type (transitional pr
> stable) through CPG API?
> 

A transitional membership will always contain a subset of the previous
regular membership.  This means it will always contains 0 or more left
members.  A transitional membership means "The membership of nodes
transitioning from previous regular membership to new regular mebmership".

A regular configuration is where members are added to the configuration
when detected.  A transitional membership never has nodes added to it.

> Hoping for answer,
> 

It would be nice if cpg and totem had a direct relationship in how their
transitional and regular configurations were generated, but this doesn't
happen currently.  I am not sure if there is a good reason for this.

Regards
-steve

> Best regards,
> Vladislav
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

[Openais] [PATCH] Allow nss building conditionally with rpmbuild operation

2011-09-02 Thread Steven Dake

Signed-off-by: Steven Dake 
---
 corosync.spec.in |8 
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/corosync.spec.in b/corosync.spec.in
index 74ab851..5c651aa 100644
--- a/corosync.spec.in
+++ b/corosync.spec.in
@@ -11,6 +11,7 @@
 %bcond_with snmp
 %bcond_with dbus
 %bcond_with rdma
+%bcond_with nss
 
 Name: corosync
 Summary: The Corosync Cluster Engine and Application Programming Interfaces
@@ -36,7 +37,9 @@ Conflicts: openais <= 0.89, openais-devel <= 0.89
 %if %{buildtrunk}
 BuildRequires: autoconf automake
 %endif
+%if %{with nss}
 BuildRequires: nss-devel
+%endif
 %if %{with rdma}
 BuildRequires: libibverbs-devel librdmacm-devel
 %endif
@@ -83,6 +86,11 @@ export rdmacm_LIBS=-lrdmacm \
 %if %{with rdma}
--enable-rdma \
 %endif
+%if %{with nss}
+   --enable-nss \
+%else
+   --disable-nss \
+%endif
--with-initddir=%{_initrddir}
 
 make %{_smp_mflags}
-- 
1.7.6

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

[Openais] Sorry for noise on mailing list

2011-09-01 Thread Steven Dake

The mailing list server had a short outage.  Apologies for noise on the
mailing list.

Regards
-steve
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Corosync 2.0 Feature Request: Notification of the status change for each rings via SNMP

2011-09-01 Thread Steven Dake

On 08/31/2011 12:37 AM, Keisuke MORI wrote:
> Hi,
> 
> We would like to be notified when a ring gets down or up to let them know
> when they need to check and repair which the network interfaces.
> The notification should be sent via SNMP traps to co-operate with
> various kinds of NMSs.
> 
> A proposed implementation was included in the patch posted at March
> 2010 by Sato Yuki-san:
> https://lists.linux-foundation.org/pipermail/openais/2010-March/014036.html
> 
> The following MIB item in the patch is related to this.
> 
> +corosyncNoticeIfaceEntry ::= SEQUENCE {
> +corosyncNoticeIfaceIndexINTEGER,
> +corosyncNoticeIface OCTET STRING,
> +corosyncNoticeIfaceStatus   OCTET STRING
> +}
> 
> The notification should be sent when;
>  1) when one ring detect a failure (get in FAULTY state)
>  2) when the failed ring get recovered.
>  3) when the second (last) ring also detect a failure and no longer
> usable to communicate with others
>  3) when the second ring get recovered.
> 
> It is also preferable that the current status can be checked by some
> command line tools or an user-customized service plug-in.
> The proposed patch above tried to store the status into the objdb to
> achieve this but the implementation details does not matter.
> 
> It would be glad if you would be considering it.
> 

Yup this seems reasonable.  The way that data comes out now is through
corosync-notifyd vs a direct snmp integration into the corosync process.
 As a result, the linked patch needs to be changed to match the new model.

Regards
-steve

> Regards,
> Keisuke MORI
> 
> 2011/7/22 Steven Dake :
>> The Corosync flatiron 1.y series had many more features added then I
>> would have liked, but the development team feels the 1.y series
>> addresses any major gaps users of the software have had.  As a result,
>> we are freezing any future feature development of the flatiron branch
>> permanently.  We will continue to maintain z streams (1.4.z) bug fixes
>> for many years to come in a robust and aggressive fashion.
>>
>> Now that the flatiron chapter of Corosync is finished, we can move on to
>> new r&d work around Corosync 2.0.  There are a few RFEs floating around
>> in bugzilla and the TODO list.  This is your chance to provide feedback
>> about feature development you would like to see in Corosync.
>>
>> The overall theme for Corosync 2.0 is focused around trimming the fat
>> and simplifying the implementation without major performance regressions.
>>
>> The developers will take feature submission suggestions until Aug 31, at
>> which point we will prioritize features for 2.0 and close feature
>> submission requests.
>>
>> Regards
>> -steve
>> ___
>> Openais mailing list
>> Openais@lists.linux-foundation.org
>> https://lists.linux-foundation.org/mailman/listinfo/openais
>>
> 
> 
> 

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

[Openais] test 3

2011-09-01 Thread Steven Dake

test 3
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

[Openais] test

2011-09-01 Thread Steven Dake


___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

[Openais] test two

2011-09-01 Thread Steven Dake


___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

[Openais] [PATCH] Ignore memb_join messages during flush operations

2011-09-01 Thread Steven Dake

a memb_join operation that occurs during flushing can result in an
entry into the GATHER state from the RECOVERY state.  This results in the
regular sort queue being used instead of the recovery sort queue, resulting
in segfault.

Signed-off-by: Steven Dake 
---
 exec/totemudp.c |   13 +
 1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/exec/totemudp.c b/exec/totemudp.c
index 96849b7..0c12b56 100644
--- a/exec/totemudp.c
+++ b/exec/totemudp.c
@@ -90,6 +90,8 @@
 #define BIND_STATE_REGULAR 1
 #define BIND_STATE_LOOPBACK2
 
+#define MESSAGE_TYPE_MCAST 1
+
 #define HMAC_HASH_SIZE 20
 struct security_header {
unsigned char hash_digest[HMAC_HASH_SIZE]; /* The hash *MUST* be first 
in the data structure */
@@ -1172,6 +1174,7 @@ static int net_deliver_fn (
int res = 0;
unsigned char *msg_offset;
unsigned int size_delv;
+   char *message_type;
 
if (instance->flushing == 1) {
iovec = &instance->totemudp_iov_recv_flush;
@@ -1234,6 +1237,16 @@ static int net_deliver_fn (
}
 
/*
+* Drop all non-mcast messages (more specifically join
+* messages should be dropped)
+*/
+   message_type = (char *)msg_offset;
+   if (instance->flushing == 1 && *message_type != MESSAGE_TYPE_MCAST) {
+   iovec->iov_len = FRAME_SIZE_MAX;
+   return (0);
+   }
+   
+   /*
 * Handle incoming message
 */
instance->totemudp_deliver_fn (
-- 
1.7.6

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] aisexec crashes with SIGABRT

2011-08-30 Thread Steven Dake

On 08/30/2011 06:39 AM, Christopher A. Kirke wrote:
> Steve,
> 
> i sometimes need to be smacked by the obvious :^)
> 
> updated analysis attached ...
> 

THe core happens because the event service from openais de-referenes an
object with a reference count of 0.  This shouldn't happen, but can't
explain why it does.

We stopped "supporting" the sa forum services last year in this project,
but understanding that we may not be able to fix the problem, If you
could provide more details of the exact scenario which triggers this
problem, that might be helpful in reproducing the issue.

Regards
-steve

> Thanks,
> --
> Chris Kirke
> Director - Systems Architecture
> Multi Service Corporation
> www.multiservice.com <http://www.multiservice.com>
> +1.913.663.9483 (direct)
> +1.816.718.0468 (mobile)
> +1.913.217.9318 (fax)
> 
> 
> 
> 
> On Mon, Aug 29, 2011 at 22:43, Steven Dake  <mailto:sd...@redhat.com>> wrote:
> 
> On 08/29/2011 06:26 PM, Christopher A. Kirke wrote:
> > Steve,
> >
> 
> The core file doesn't have debuginfo installed when it was analyzed.
> The package you want is something like "openais-debuginfo".  You may
> have to enable the debuginfo yum repo if you have not already.
> 
> Regards
> -steve
> > our setup is a 2-node cluster:
> >
> >   * mstel21 (172.24.100.10) - attached gdb output from here
> >   * mstel22 (172.24.100.12)
> >
> > log file from 172.24.100.12 <http://172.24.100.12>:
> >
> > Aug 29 08:04:49 172.24.100.12 openais[4656]: [TOTEM] The token was
> lost
> > in the OPERATIONAL state.
> > Aug 29 08:04:50 172.24.100.12 openais[4656]: [TOTEM] Receive multicast
> > socket recv buffer size (262142 bytes).
> > Aug 29 08:04:50 172.24.100.12 openais[4656]: [TOTEM] Transmit
> multicast
> > socket send buffer size (262142 bytes).
> > Aug 29 08:04:50 172.24.100.12 openais[4656]: [TOTEM] entering GATHER
> > state from 2.
> > Aug 29 08:04:50 172.24.100.12 openais[4656]: [TOTEM] entering GATHER
> > state from 0.
> > Aug 29 08:04:50 172.24.100.12 openais[4656]: [TOTEM] Creating commit
> > token because I am the rep.
> > Aug 29 08:04:50 172.24.100.12 openais[4656]: [TOTEM] Storing new
> > sequence id for ring ff4
> > Aug 29 08:04:50 172.24.100.12 openais[4656]: [TOTEM] entering COMMIT
> > state.
> > Aug 29 08:04:50 172.24.100.12 openais[4656]: [TOTEM] entering RECOVERY
> > state.
> > Aug 29 08:04:50 172.24.100.12 openais[4656]: [TOTEM] position [0]
> member
> > 172.24.100.12 <http://172.24.100.12>:
> > Aug 29 08:04:50 172.24.100.12 openais[4656]: [TOTEM] previous ring seq
> > 4080 rep 172.24.100.10
> > Aug 29 08:04:50 172.24.100.12 openais[4656]: [TOTEM] aru 425c1 high
> > delivered 425c1 received flag 1
> > Aug 29 08:04:50 172.24.100.12 openais[4656]: [TOTEM] Did not need to
> > originate any messages in recovery.
> > Aug 29 08:04:50 172.24.100.12 openais[4656]: [TOTEM] Sending
> initial ORF
> > token
> > Aug 29 08:04:50 172.24.100.12 openais[4656]: [CLM  ] CLM CONFIGURATION
> > CHANGE
> > Aug 29 08:04:50 172.24.100.12 openais[4656]: [CLM  ] New
> Configuration:
> > Aug 29 08:04:50 172.24.100.12 openais[4656]: [CLM  ] r(0)
> > ip(172.24.100.12)
> > Aug 29 08:04:50 172.24.100.12 openais[4656]: [CLM  ] Members Left:
> > Aug 29 08:04:50 172.24.100.12 openais[4656]: [CLM  ] r(0)
> > ip(172.24.100.10)
> > Aug 29 08:04:50 172.24.100.12 openais[4656]: [CLM  ] Members Joined:
> > Aug 29 08:04:50 172.24.100.12 openais[4656]: [CLM  ] CLM CONFIGURATION
> > CHANGE
> > Aug 29 08:04:50 172.24.100.12 openais[4656]: [CLM  ] New
> Configuration:
> > Aug 29 08:04:50 172.24.100.12 openais[4656]: [CLM  ] r(0)
> > ip(172.24.100.12)
> > Aug 29 08:04:50 172.24.100.12 openais[4656]: [CLM  ] Members Left:
> > Aug 29 08:04:50 172.24.100.12 openais[4656]: [CLM  ] Members Joined:
> > Aug 29 08:04:50 172.24.100.12 openais[4656]: [SYNC ] This node is
> within
> > the primary component and will provide service.
> > Aug 29 08:04:50 172.24.100.12 openais[4656]: [TOTEM] entering
> > OPERATIONAL state.
> > Aug 29 08:04:50 172.24.100.12 openais[4656]: [CLM  ] got nodejoin
> > message 172.24.100.12
> > Aug 29 08:04:50 172.24.100.12 openais[4656]: [EVT  ] Channel
> > device_state, total 1, local 1
>

Re: [Openais] aisexec crashes with SIGABRT

2011-08-30 Thread Steven Dake

On 08/29/2011 06:26 PM, Christopher A. Kirke wrote:
> Steve,
> 

The core file doesn't have debuginfo installed when it was analyzed.
The package you want is something like "openais-debuginfo".  You may
have to enable the debuginfo yum repo if you have not already.

Regards
-steve
> our setup is a 2-node cluster:
> 
>   * mstel21 (172.24.100.10) - attached gdb output from here
>   * mstel22 (172.24.100.12)
> 
> log file from 172.24.100.12 <http://172.24.100.12>:
> 
> Aug 29 08:04:49 172.24.100.12 openais[4656]: [TOTEM] The token was lost
> in the OPERATIONAL state.  
> Aug 29 08:04:50 172.24.100.12 openais[4656]: [TOTEM] Receive multicast
> socket recv buffer size (262142 bytes).  
> Aug 29 08:04:50 172.24.100.12 openais[4656]: [TOTEM] Transmit multicast
> socket send buffer size (262142 bytes).  
> Aug 29 08:04:50 172.24.100.12 openais[4656]: [TOTEM] entering GATHER
> state from 2.  
> Aug 29 08:04:50 172.24.100.12 openais[4656]: [TOTEM] entering GATHER
> state from 0.  
> Aug 29 08:04:50 172.24.100.12 openais[4656]: [TOTEM] Creating commit
> token because I am the rep.  
> Aug 29 08:04:50 172.24.100.12 openais[4656]: [TOTEM] Storing new
> sequence id for ring ff4  
> Aug 29 08:04:50 172.24.100.12 openais[4656]: [TOTEM] entering COMMIT
> state.  
> Aug 29 08:04:50 172.24.100.12 openais[4656]: [TOTEM] entering RECOVERY
> state.  
> Aug 29 08:04:50 172.24.100.12 openais[4656]: [TOTEM] position [0] member
> 172.24.100.12 <http://172.24.100.12>:  
> Aug 29 08:04:50 172.24.100.12 openais[4656]: [TOTEM] previous ring seq
> 4080 rep 172.24.100.10  
> Aug 29 08:04:50 172.24.100.12 openais[4656]: [TOTEM] aru 425c1 high
> delivered 425c1 received flag 1  
> Aug 29 08:04:50 172.24.100.12 openais[4656]: [TOTEM] Did not need to
> originate any messages in recovery.  
> Aug 29 08:04:50 172.24.100.12 openais[4656]: [TOTEM] Sending initial ORF
> token  
> Aug 29 08:04:50 172.24.100.12 openais[4656]: [CLM  ] CLM CONFIGURATION
> CHANGE  
> Aug 29 08:04:50 172.24.100.12 openais[4656]: [CLM  ] New Configuration:  
> Aug 29 08:04:50 172.24.100.12 openais[4656]: [CLM  ] r(0)
> ip(172.24.100.12)   
> Aug 29 08:04:50 172.24.100.12 openais[4656]: [CLM  ] Members Left:  
> Aug 29 08:04:50 172.24.100.12 openais[4656]: [CLM  ] r(0)
> ip(172.24.100.10)   
> Aug 29 08:04:50 172.24.100.12 openais[4656]: [CLM  ] Members Joined:  
> Aug 29 08:04:50 172.24.100.12 openais[4656]: [CLM  ] CLM CONFIGURATION
> CHANGE  
> Aug 29 08:04:50 172.24.100.12 openais[4656]: [CLM  ] New Configuration:  
> Aug 29 08:04:50 172.24.100.12 openais[4656]: [CLM  ] r(0)
> ip(172.24.100.12)   
> Aug 29 08:04:50 172.24.100.12 openais[4656]: [CLM  ] Members Left:  
> Aug 29 08:04:50 172.24.100.12 openais[4656]: [CLM  ] Members Joined:  
> Aug 29 08:04:50 172.24.100.12 openais[4656]: [SYNC ] This node is within
> the primary component and will provide service.  
> Aug 29 08:04:50 172.24.100.12 openais[4656]: [TOTEM] entering
> OPERATIONAL state.  
> Aug 29 08:04:50 172.24.100.12 openais[4656]: [CLM  ] got nodejoin
> message 172.24.100.12  
> Aug 29 08:04:50 172.24.100.12 openais[4656]: [EVT  ] Channel
> device_state, total 1, local 1  
> Aug 29 08:04:50 172.24.100.12 openais[4656]: [EVT  ] Node r(0)
> ip(172.24.100.12) , count 1  
> Aug 29 08:04:50 172.24.100.12 openais[4656]: [EVT  ] Channel mwi, total
> 1, local 1  
> Aug 29 08:04:50 172.24.100.12 openais[4656]: [EVT  ] Node r(0)
> ip(172.24.100.12) , count 1  
> 
> Thanks,
> --
> Chris Kirke
> Director - Systems Architecture
> Multi Service Corporation
> www.multiservice.com <http://www.multiservice.com>
> +1.913.663.9483 (direct)
> +1.816.718.0468 (mobile)
> +1.913.217.9318 (fax)
> 
> 
> 
> 
> On Tue, Aug 23, 2011 at 11:45, Christopher A. Kirke
> mailto:caki...@multiservice.com>> wrote:
> 
> Steve,
> 
> appreciate the quick response, only happened to be running
> strace during one of the crashes.
> i've updated /etc/init.d/openais to enable core dump and changed
> openais.conf to run
> as root:asterisk instead of asterisk:asterisk so /var/lib/openais is
> available for writing.
> 
> will post gdb output from the next aisexec crash.
> 
> Thanks,
> --
> Chris Kirke
> Director - Systems Architecture
> Multi Service Corporation
> www.multiservice.com <http://www.multiservice.com>
> +1.913.663.9483  (direct)
> +1.816.718.0468  (mobile)
> +1.913.217.9318  (fax)
> 
> 
> 
> 
> On Mon, Aug 22, 2011 at 12:13, Steven Dake  <mailto:sd...@redhat.com>> wrote:
> 
> On 08/22/2011 09:58 AM, Christopher A. Kirke wrote:
> > currently usin

[Openais] [PATCH] Get rid of hdb usage in totempg.h interface

2011-08-23 Thread Steven Dake

hdb has some expense and is not necessary in the totempg.so runtime.  This
patch removes the dependence on hdb and instead uses a direct pointer.

Signed-off-by: Steven Dake 
---
 exec/main.c  |2 +-
 exec/sync.c  |2 +-
 exec/syncv2.c|2 +-
 exec/totempg.c   |  213 +-
 include/corosync/totem/totempg.h |   20 ++--
 5 files changed, 83 insertions(+), 156 deletions(-)

diff --git a/exec/main.c b/exec/main.c
index fde77da..582f1e2 100644
--- a/exec/main.c
+++ b/exec/main.c
@@ -244,7 +244,7 @@ static void sigabrt_handler (int num)
 
 #define LOCALHOST_IP inet_addr("127.0.0.1")
 
-static hdb_handle_t corosync_group_handle;
+static void *corosync_group_handle;
 
 static struct totempg_group corosync_group = {
.group  = "a",
diff --git a/exec/sync.c b/exec/sync.c
index b9cc84a..ce99129 100644
--- a/exec/sync.c
+++ b/exec/sync.c
@@ -142,7 +142,7 @@ static struct totempg_group sync_group = {
 .group_len  = 4
 };
 
-static hdb_handle_t sync_group_handle;
+static void *sync_group_handle;
 
 struct req_exec_sync_barrier_start {
struct qb_ipc_request_header header;
diff --git a/exec/syncv2.c b/exec/syncv2.c
index f9eebac..8a96615 100644
--- a/exec/syncv2.c
+++ b/exec/syncv2.c
@@ -167,7 +167,7 @@ static struct totempg_group sync_group = {
 .group_len  = 6
 };
 
-static hdb_handle_t sync_group_handle;
+static void *sync_group_handle;
 
 int sync_v2_init (
 int (*sync_callbacks_retrieve) (
diff --git a/exec/totempg.c b/exec/totempg.c
index c5ba01c..a3eee15 100644
--- a/exec/totempg.c
+++ b/exec/totempg.c
@@ -98,7 +98,6 @@
 #include 
 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -212,6 +211,8 @@ DECLARE_LIST_INIT(assembly_list_inuse);
 
 DECLARE_LIST_INIT(assembly_list_free);
 
+DECLARE_LIST_INIT(totempg_groups_list);
+
 /*
  * Staging buffer for packed messages.  Messages are staged in this buffer
  * before sending.  Multiple messages may fit which cuts down on the
@@ -230,8 +231,6 @@ static int fragment_continuation = 0;
 
 static struct iovec iov_delv;
 
-static unsigned int totempg_max_handle = 0;
-
 struct totempg_group_instance {
void (*deliver_fn) (
unsigned int nodeid,
@@ -250,6 +249,8 @@ struct totempg_group_instance {
 
int groups_cnt;
int32_t q_level;
+
+   struct list_head list;
 };
 
 DECLARE_HDB_DATABASE (totempg_groups_instance_database,NULL);
@@ -342,7 +343,7 @@ static inline void app_confchg_fn (
int i;
struct totempg_group_instance *instance;
struct assembly *assembly;
-   unsigned int res;
+   struct list_head *list;
 
/*
 * For every leaving processor, add to free list
@@ -354,25 +355,23 @@ static inline void app_confchg_fn (
list_del (&assembly->list);
list_add (&assembly->list, &assembly_list_free);
}
-   for (i = 0; i <= totempg_max_handle; i++) {
-   res = hdb_handle_get (&totempg_groups_instance_database,
-   hdb_nocheck_convert (i), (void *)&instance);
-
-   if (res == 0) {
-   if (instance->confchg_fn) {
-   instance->confchg_fn (
-   configuration_type,
-   member_list,
-   member_list_entries,
-   left_list,
-   left_list_entries,
-   joined_list,
-   joined_list_entries,
-   ring_id);
-   }
 
-   hdb_handle_put (&totempg_groups_instance_database,
-   hdb_nocheck_convert (i));
+   for (list = totempg_groups_list.next;
+   list != &totempg_groups_list;
+   list = list->next) {
+
+   instance = list_entry (list, struct totempg_group_instance, 
list);
+
+   if (instance->confchg_fn) {
+   instance->confchg_fn (
+   configuration_type,
+   member_list,
+   member_list_entries,
+   left_list,
+   left_list_entries,
+   joined_list,
+   joined_list_entries,
+   ring_id);
}
}
 }
@@ -474,12 +473,11 @@ static inline void app_deliver_fn (
unsigned int msg_len,
int endian_conversion_required)
 {
-   int i;
struct totempg_group_instance *instance;
struct iovec stripped_iovec;
unsigned int adjust_iovec;
-   unsig

[Openais] [PATCH] Remove hdb.h header includes from unnecessary files

2011-08-23 Thread Steven Dake

The files in this patch do not use the hdb.h header.

Signed-off-by: Steven Dake 
---
 exec/totemrrp.c  |1 -
 exec/totemsrp.c  |1 -
 exec/totemudp.c  |1 -
 exec/totemudp.h  |1 -
 exec/totemudpu.c |1 -
 exec/totemudpu.h |1 -
 6 files changed, 0 insertions(+), 6 deletions(-)

diff --git a/exec/totemrrp.c b/exec/totemrrp.c
index 8fe3ef7..73cb996 100644
--- a/exec/totemrrp.c
+++ b/exec/totemrrp.c
@@ -60,7 +60,6 @@
 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
diff --git a/exec/totemsrp.c b/exec/totemsrp.c
index 71ccd59..861c75b 100644
--- a/exec/totemsrp.c
+++ b/exec/totemsrp.c
@@ -81,7 +81,6 @@
 #include 
 #include 
 #include 
-#include 
 
 #define LOGSYS_UTILS_ONLY 1
 #include 
diff --git a/exec/totemudp.c b/exec/totemudp.c
index ed2f03c..740e246 100644
--- a/exec/totemudp.c
+++ b/exec/totemudp.c
@@ -61,7 +61,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #define LOGSYS_UTILS_ONLY 1
diff --git a/exec/totemudp.h b/exec/totemudp.h
index 6d509c1..de39c81 100644
--- a/exec/totemudp.h
+++ b/exec/totemudp.h
@@ -37,7 +37,6 @@
 
 #include 
 #include 
-#include 
 #include 
 
 #include 
diff --git a/exec/totemudpu.c b/exec/totemudpu.c
index 529c362..21e57c7 100644
--- a/exec/totemudpu.c
+++ b/exec/totemudpu.c
@@ -62,7 +62,6 @@
 
 #include 
 #include 
-#include 
 #include 
 #define LOGSYS_UTILS_ONLY 1
 #include 
diff --git a/exec/totemudpu.h b/exec/totemudpu.h
index 977148f..93b31a0 100644
--- a/exec/totemudpu.h
+++ b/exec/totemudpu.h
@@ -37,7 +37,6 @@
 
 #include 
 #include 
-#include 
 #include 
 
 #include 
-- 
1.7.6

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

[Openais] [PATCH] Use qb_hdb instead of mutex based hdb code

2011-08-22 Thread Steven Dake

Rid ourselves of the mutex usage still in the code base

Signed-off-by: Steven Dake 
---
 include/corosync/hdb.h |  307 +++-
 lcr/Makefile.am|3 +-
 2 files changed, 21 insertions(+), 289 deletions(-)

diff --git a/include/corosync/hdb.h b/include/corosync/hdb.h
index 88742f9..00fa459 100644
--- a/include/corosync/hdb.h
+++ b/include/corosync/hdb.h
@@ -47,36 +47,17 @@
 #include 
 #include 
 #include 
+#include 
 
-typedef uint64_t hdb_handle_t;
+typedef qb_handle_t hdb_handle_t;
 
 /*
  * Formatting for string printing on 32/64 bit systems
  */
-#define HDB_D_FORMAT "%"PRIu64
-#define HDB_X_FORMAT "%"PRIx64
+#define HDB_D_FORMAT QB_HDB_D_FORMAT
+#define HDB_X_FORMAT QB_HDB_X_FORMAT
 
-enum HDB_HANDLE_STATE {
-   HDB_HANDLE_STATE_EMPTY,
-   HDB_HANDLE_STATE_PENDINGREMOVAL,
-   HDB_HANDLE_STATE_ACTIVE
-};
-
-struct hdb_handle {
-   int state;
-   void *instance;
-   int check;
-   int ref_count;
-};
-
-struct hdb_handle_database {
-   unsigned int handle_count;
-   struct hdb_handle *handles;
-   unsigned int iterator;
-void (*destructor) (void *);
-   pthread_mutex_t lock;
-   unsigned int first_run;
-};
+#define hdb_handle_database qb_hdb
 
 static inline void hdb_database_lock (pthread_mutex_t *mutex)
 {
@@ -97,28 +78,18 @@ static inline void hdb_database_lock_destroy 
(pthread_mutex_t *mutex)
pthread_mutex_destroy (mutex);
 }
 
-#define DECLARE_HDB_DATABASE(database_name,destructor_function)
\
-static struct hdb_handle_database (database_name) = {  \
-   .handle_count   = 0,\
-   .handles= NULL, \
-   .iterator   = 0,\
-   .destructor = destructor_function,  \
-   .first_run  = 1 \
-}; \
+#define DECLARE_HDB_DATABASE QB_HDB_DECLARE
 
 static inline void hdb_create (
struct hdb_handle_database *handle_database)
 {
-   memset (handle_database, 0, sizeof (struct hdb_handle_database));
-   hdb_database_lock_init (&handle_database->lock);
+   qb_hdb_create (handle_database);
 }
 
 static inline void hdb_destroy (
struct hdb_handle_database *handle_database)
 {
-   free (handle_database->handles);
-   hdb_database_lock_destroy (&handle_database->lock);
-   memset (handle_database, 0, sizeof (struct hdb_handle_database));
+   qb_hdb_destroy (handle_database);
 }
 
 
@@ -127,72 +98,8 @@ static inline int hdb_handle_create (
int instance_size,
hdb_handle_t *handle_id_out)
 {
-   int handle;
-   unsigned int check;
-   void *new_handles;
-   int found = 0;
-   void *instance;
-   int i;
-
-   if (handle_database->first_run == 1) {
-   handle_database->first_run = 0;
-   hdb_database_lock_init (&handle_database->lock);
-   }
-   hdb_database_lock (&handle_database->lock);
-
-   for (handle = 0; handle < handle_database->handle_count; handle++) {
-   if (handle_database->handles[handle].state == 
HDB_HANDLE_STATE_EMPTY) {
-   found = 1;
-   break;
-   }
-   }
-
-   if (found == 0) {
-   handle_database->handle_count += 1;
-   new_handles = (struct hdb_handle *)realloc 
(handle_database->handles,
-   sizeof (struct hdb_handle) * 
handle_database->handle_count);
-   if (new_handles == NULL) {
-   hdb_database_unlock (&handle_database->lock);
-   errno = ENOMEM;
-   return (-1);
-   }
-   handle_database->handles = new_handles;
-   }
-
-   instance = (void *)malloc (instance_size);
-   if (instance == 0) {
-   errno = ENOMEM;
-   return (-1);
-   }
-
-   /*
-* This code makes sure the random number isn't zero
-* We use 0 to specify an invalid handle out of the 1^64 address space
-* If we get 0 200 times in a row, the RNG may be broken
-*/
-   for (i = 0; i < 200; i++) {
-   check = random();
-
-   if (check != 0 && check != 0x) {
-   break;
-   }
-   }
-
-   memset (instance, 0, instance_size);
-
-   handle_database->handles[handle].state = HDB_HANDLE_STATE_ACTIVE;
-
-   handle_database->handles[handle].instance = instance;
-
-   handle_database->handles[handle].ref_count = 1;
-
-   handle_database->handles[handle].check = check;
-
-   *handle_id_out = (((unsigne

[Openais] [PATCH] Add totempg_threaded_mode_enable() api

2011-08-22 Thread Steven Dake

This API allows totem to operate as a multithreaded library.  Performance is
better without threads but some library users may only have multithreaded
systems.  In the corosync case where we have removed threads, this reduces
cpu utilization by ~10% by removing about 50% of the mutex lock and unlock calls
that occur during typical operation.  Since the latest corosync is nearly
thread free, there is no need for mutex operations.

Signed-off-by: Steven Dake 
---
 exec/cs_queue.h |  122 
 exec/totemmrp.c |9 
 exec/totemmrp.h |2 +
 exec/totempg.c  |  140 +-
 exec/totemsrp.c |   13 -
 exec/totemsrp.h |3 +
 6 files changed, 223 insertions(+), 66 deletions(-)

diff --git a/exec/cs_queue.h b/exec/cs_queue.h
index 1e8439f..2e31c0f 100644
--- a/exec/cs_queue.h
+++ b/exec/cs_queue.h
@@ -50,58 +50,76 @@ struct cs_queue {
int size_per_item;
int iterator;
pthread_mutex_t mutex;
+   int threaded_mode_enabled;
 };
 
-static inline int cs_queue_init (struct cs_queue *cs_queue, int 
cs_queue_items, int size_per_item) {
+static inline int cs_queue_init (struct cs_queue *cs_queue, int 
cs_queue_items, int size_per_item, int threaded_mode_enabled) {
cs_queue->head = 0;
cs_queue->tail = cs_queue_items - 1;
cs_queue->used = 0;
cs_queue->usedhw = 0;
cs_queue->size = cs_queue_items;
cs_queue->size_per_item = size_per_item;
+   cs_queue->threaded_mode_enabled = threaded_mode_enabled;
 
cs_queue->items = malloc (cs_queue_items * size_per_item);
if (cs_queue->items == 0) {
return (-ENOMEM);
}
memset (cs_queue->items, 0, cs_queue_items * size_per_item);
-   pthread_mutex_init (&cs_queue->mutex, NULL);
+   if (cs_queue->threaded_mode_enabled) {
+   pthread_mutex_init (&cs_queue->mutex, NULL);
+   }
return (0);
 }
 
 static inline int cs_queue_reinit (struct cs_queue *cs_queue)
 {
-   pthread_mutex_lock (&cs_queue->mutex);
+   if (cs_queue->threaded_mode_enabled) {
+   pthread_mutex_lock (&cs_queue->mutex);
+   }
cs_queue->head = 0;
cs_queue->tail = cs_queue->size - 1;
cs_queue->used = 0;
cs_queue->usedhw = 0;
 
memset (cs_queue->items, 0, cs_queue->size * cs_queue->size_per_item);
-   pthread_mutex_unlock (&cs_queue->mutex);
+   if (cs_queue->threaded_mode_enabled) {
+   pthread_mutex_unlock (&cs_queue->mutex);
+   }
return (0);
 }
 
 static inline void cs_queue_free (struct cs_queue *cs_queue) {
-   pthread_mutex_destroy (&cs_queue->mutex);
+   if (cs_queue->threaded_mode_enabled) {
+   pthread_mutex_destroy (&cs_queue->mutex);
+   }
free (cs_queue->items);
 }
 
 static inline int cs_queue_is_full (struct cs_queue *cs_queue) {
int full;
 
-   pthread_mutex_lock (&cs_queue->mutex);
+   if (cs_queue->threaded_mode_enabled) {
+   pthread_mutex_lock (&cs_queue->mutex);
+   }
full = ((cs_queue->size - 1) == cs_queue->used);
-   pthread_mutex_unlock (&cs_queue->mutex);
+   if (cs_queue->threaded_mode_enabled) {
+   pthread_mutex_unlock (&cs_queue->mutex);
+   }
return (full);
 }
 
 static inline int cs_queue_is_empty (struct cs_queue *cs_queue) {
int empty;
 
-   pthread_mutex_lock (&cs_queue->mutex);
+   if (cs_queue->threaded_mode_enabled) {
+   pthread_mutex_lock (&cs_queue->mutex);
+   }
empty = (cs_queue->used == 0);
-   pthread_mutex_unlock (&cs_queue->mutex);
+   if (cs_queue->threaded_mode_enabled) {
+   pthread_mutex_unlock (&cs_queue->mutex);
+   }
return (empty);
 }
 
@@ -110,7 +128,9 @@ static inline void cs_queue_item_add (struct cs_queue 
*cs_queue, void *item)
char *cs_queue_item;
int cs_queue_position;
 
-   pthread_mutex_lock (&cs_queue->mutex);
+   if (cs_queue->threaded_mode_enabled) {
+   pthread_mutex_lock (&cs_queue->mutex);
+   }
cs_queue_position = cs_queue->head;
cs_queue_item = cs_queue->items;
cs_queue_item += cs_queue_position * cs_queue->size_per_item;
@@ -123,7 +143,9 @@ static inline void cs_queue_item_add (struct cs_queue 
*cs_queue, void *item)
if (cs_queue->used > cs_queue->usedhw) {
cs_queue->usedhw = cs_queue->used;
}
-   pthread_mutex_unlock (&cs_queue->mutex);
+   if (cs_queue->threaded_mode_enabled) {
+   pthread_mutex_unlock (&cs_queue->mutex);
+   }
 }
 
 static inline void *cs_q

[Openais] [PATCH] Move cs_queue.h from include directory to exec directory

2011-08-22 Thread Steven Dake

This file is only used by totemsrp.c.  Move out of general include
directory.

Signed-off-by: Steven Dake 
---
 exec/Makefile.am|2 +-
 exec/cs_queue.h |  229 +++
 exec/totemsrp.c |2 +-
 include/Makefile.am |2 +-
 include/corosync/cs_queue.h |  227 --
 5 files changed, 232 insertions(+), 230 deletions(-)
 create mode 100644 exec/cs_queue.h
 delete mode 100644 include/corosync/cs_queue.h

diff --git a/exec/Makefile.am b/exec/Makefile.am
index 49e9f5a..8514afa 100644
--- a/exec/Makefile.am
+++ b/exec/Makefile.am
@@ -37,7 +37,7 @@ INCLUDES  = -I$(top_builddir)/include 
-I$(top_srcdir)/include $(nss_CFLAGS) $(rd
 
 TOTEM_SRC  = totemip.c totemnet.c totemudp.c \
  totemudpu.c totemrrp.c totemsrp.c totemmrp.c \
- totempg.c crypto.c
+ totempg.c crypto.c cs_queue.h
 if BUILD_RDMA
 TOTEM_SRC  += totemiba.c
 endif
diff --git a/exec/cs_queue.h b/exec/cs_queue.h
new file mode 100644
index 000..1e8439f
--- /dev/null
+++ b/exec/cs_queue.h
@@ -0,0 +1,229 @@
+/*
+ * Copyright (c) 2002-2004 MontaVista Software, Inc.
+ *
+ * All rights reserved.
+ *
+ * Author: Steven Dake (sd...@redhat.com)
+ *
+ * This software licensed under BSD license, the text of which follows:
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * - Redistributions of source code must retain the above copyright notice,
+ *   this list of conditions and the following disclaimer.
+ * - Redistributions in binary form must reproduce the above copyright notice,
+ *   this list of conditions and the following disclaimer in the documentation
+ *   and/or other materials provided with the distribution.
+ * - Neither the name of the MontaVista Software, Inc. nor the names of its
+ *   contributors may be used to endorse or promote products derived from this
+ *   software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
+ * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+ * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+ * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+ * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF
+ * THE POSSIBILITY OF SUCH DAMAGE.
+ */
+#ifndef CS_QUEUE_H_DEFINED
+#define CS_QUEUE_H_DEFINED
+
+#include 
+#include 
+#include 
+#include 
+#include "assert.h"
+
+struct cs_queue {
+   int head;
+   int tail;
+   int used;
+   int usedhw;
+   int size;
+   void *items;
+   int size_per_item;
+   int iterator;
+   pthread_mutex_t mutex;
+};
+
+static inline int cs_queue_init (struct cs_queue *cs_queue, int 
cs_queue_items, int size_per_item) {
+   cs_queue->head = 0;
+   cs_queue->tail = cs_queue_items - 1;
+   cs_queue->used = 0;
+   cs_queue->usedhw = 0;
+   cs_queue->size = cs_queue_items;
+   cs_queue->size_per_item = size_per_item;
+
+   cs_queue->items = malloc (cs_queue_items * size_per_item);
+   if (cs_queue->items == 0) {
+   return (-ENOMEM);
+   }
+   memset (cs_queue->items, 0, cs_queue_items * size_per_item);
+   pthread_mutex_init (&cs_queue->mutex, NULL);
+   return (0);
+}
+
+static inline int cs_queue_reinit (struct cs_queue *cs_queue)
+{
+   pthread_mutex_lock (&cs_queue->mutex);
+   cs_queue->head = 0;
+   cs_queue->tail = cs_queue->size - 1;
+   cs_queue->used = 0;
+   cs_queue->usedhw = 0;
+
+   memset (cs_queue->items, 0, cs_queue->size * cs_queue->size_per_item);
+   pthread_mutex_unlock (&cs_queue->mutex);
+   return (0);
+}
+
+static inline void cs_queue_free (struct cs_queue *cs_queue) {
+   pthread_mutex_destroy (&cs_queue->mutex);
+   free (cs_queue->items);
+}
+
+static inline int cs_queue_is_full (struct cs_queue *cs_queue) {
+   int full;
+
+   pthread_mutex_lock (&cs_queue->mutex);
+   full = ((cs_queue->size - 1) == cs_queue->used);
+   pthread_mutex_unlock (&cs_queue->mutex);
+   return (full);
+}
+
+static inline int cs_queue_is_empty (struct cs_queue *cs_queue) {
+   int empty;
+
+   pthread_mutex_lock (&cs_queue->mut

[Openais] [PATCH] use va version of external log function

2011-08-22 Thread Steven Dake

This removes a sprintf operation in the totem and ipc logging operations

Signed-off-by: Steven Dake 
---
 exec/main.c |   13 +++--
 1 files changed, 3 insertions(+), 10 deletions(-)

diff --git a/exec/main.c b/exec/main.c
index a9a3e3e..fde77da 100644
--- a/exec/main.c
+++ b/exec/main.c
@@ -1055,17 +1055,10 @@ _logsys_log_printf(int level, int subsys,
size_t len;
 
va_start(ap, format);
-   len = vsnprintf(buf, sizeof(buf), format, ap);
-   va_end(ap);
-
-   if (buf[len - 1] == '\n') {
-   buf[len - 1] = '\0';
-   len -= 1;
-   }
-
-   qb_log_from_external_source(function_name, file_name,
+   qb_log_from_external_source_va(function_name, file_name,
format, level, file_line,
-   subsys, buf);
+   subsys, ap);
+   va_end(ap);
 }
 
 static void fplay_key_change_notify_fn (
-- 
1.7.6

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] aisexec crashes with SIGABRT

2011-08-22 Thread Steven Dake

On 08/22/2011 09:58 AM, Christopher A. Kirke wrote:
> currently using the REL5-provided package on two nodes on local
> Cisco-switched LAN:
> 
> openais.x86_64   0.80.6-28.el5_6.1
> installed
> 
> with following configuration:
> 
> # Please read the openais.conf.5 manual page
> 
> aisexec {
> user: asterisk
> group: asterisk
> }
> 
> totem {
> version: 2
> secauth: off
> threads: 0
> interface {
> ringnumber: 0
> bindnetaddr: 172.24.100.0
> mcastaddr: 239.255.4.1
> mcastport: 5405
> }
> }
> 
> logging {
> debug: off
> syslog_facility: local1
> syslog_priority: info
> timestamp: off
> to_file: no
> to_syslog: yes
> }
> 
> amf {
> mode: disabled
> }
> 
> to enable Asterisk distributed device state.
> 
> we see cases where aisexec crashes, both with Asterisk running and
> stopped - strace output below:
> 
>  0.73 poll([{fd=1, events=POLLIN}, {fd=3, events=POLLIN}, {fd=4,
> events=POLLIN|POLLNVAL}], 3, 10) = 0 (Timeout) <0.009994>
>  0.010031 poll([{fd=1, events=POLLIN}, {fd=3, events=POLLIN}, {fd=4,
> events=POLLIN|POLLNVAL}], 3, 237) = 1 ([{fd=1, revents=POLLIN}]) <0.07>
>  0.37 recvmsg(1, {msg_name(16)={sa_family=AF_INET,
> sin_port=htons(5149), sin_addr=inet_addr("172.24.100.10")},
> msg_iov(1)=[{"\2\0\"\377\254\30d\n\254\30d\n\2\0\254\30d\n\10\0\2\0\254\30d\n\10\0\4\0\0\0"...,
> 1}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 82
> <0.10>
>  0.72 poll([{fd=1, events=POLLIN}, {fd=3, events=POLLIN}, {fd=4,
> events=POLLIN|POLLNVAL}], 3, 237) = 1 ([{fd=3, revents=POLLIN}]) <0.180257>
>  0.180339 recvmsg(3, {msg_name(16)={sa_family=AF_INET,
> sin_port=htons(5149), sin_addr=inet_addr("172.24.100.10")},
> msg_iov(1)=[{"\0\0\"\377\254\30d\fN\1\0\0004/\0\0N\1\0\0\0\0\0\0\254\30d\n\2\0\254\30"...,
> 1}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 70
> <0.22>
>  0.000104 sendmsg(2, {msg_name(16)={sa_family=AF_INET,
> sin_port=htons(5405), sin_addr=inet_addr("172.24.100.10")},
> msg_iov(1)=[{"\0\0\"\377\254\30d\fN\1\0\0005/\0\0N\1\0\0\0\0\0\0\254\30d\n\2\0\254\30"...,
> 70}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 70 <0.14>
>  0.72 poll([{fd=1, events=POLLIN}, {fd=3, events=POLLIN}, {fd=4,
> events=POLLIN|POLLNVAL}], 3, 209) = 1 ([{fd=4, revents=POLLIN}]) <0.037614>
>  0.037682 accept(4, {sa_family=AF_FILE, path=@""}, [4294967298]) = 5
> <0.23>
>  0.96 fcntl(5, F_SETFL, O_RDONLY|O_NONBLOCK) = 0 <0.12>
>  0.40 setsockopt(5, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0 <0.06>
>  0.70 poll([{fd=1, events=POLLIN}, {fd=3, events=POLLIN}, {fd=4,
> events=POLLIN|POLLNVAL}, {fd=5, events=POLLIN|POLLNVAL}], 4, 172) = 1
> ([{fd=5, revents=POLLIN}]) <0.000158>
>  0.000207 setsockopt(5, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0 <0.06>
>  0.29 recvmsg(5, {msg_name(0)=NULL,
> msg_iov(1)=[{"\1\0\0\0\252*\0\0\224\343TE\0\0\0\0\270\355\362+\0\0\0\0",
> 24}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET,
> cmsg_type=SCM_CREDENTIALS{pid=819, uid=301, gid=301}}, msg_flags=0},
> MSG_NOSIGNAL) = 24 <0.14>
>  0.84 setsockopt(5, SOL_SOCKET, SO_PASSCRED, [0], 4) = 0 <0.06>
>  0.28 sendto(5, "\1\0\0\0\0\0\0\0", 8, MSG_WAITALL, NULL, 0) = 8
> <0.07>
>  0.31 shmget(0x4554e394, 308, 0600) = 5144599 <0.000118>
>  0.000198 shmat(5144599, 0, 0)  = ? <0.002286>
>  0.002332 semget(0x2bf2edb8, 3, 0600) = 1081360 <0.000108>
>  0.000155 mmap(NULL, 20, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_ANONYMOUS|MAP_32BIT, -1, 0) = 0x41781000 <0.000212>
>  0.000262 mprotect(0x41781000, 4096, PROT_NONE) = 0 <0.20>
>  0.51 clone(child_stack=0x417b0f90,
> flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID,
> parent_tidptr=0x417b1710, tls=0x417b1680, child_tidptr=0x417b1710) = 859
> <0.46>
>  0.000109 poll([{fd=1, events=POLLIN}, {fd=3, events=POLLIN}, {fd=4,
> events=POLLIN|POLLNVAL}, {fd=5, events=POLLIN|POLLNVAL}], 4, 168) = 1
> ([{fd=4, revents=POLLIN}]) <0.000924>
>  0.000980 accept(4, {sa_family=AF_FILE, path=@""}, [4294967298]) = 6
> <0.11>
>  0.63 fcntl(6, F_SETFL, O_RDONLY|O_NONBLOCK) = 0 <0.06>
>  0.49 setsockopt(6, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0 <0.06>
>  0.33 poll([{fd=1, events=POLLIN}, {fd=3, events=POLLIN}, {fd=4,
> events=POLLIN|POLLNVAL}, {fd=5, events=POLLIN|POLLNVAL}, {fd=6,
> events=POLLIN|POLLNVAL}], 5, 167) = 1 ([{fd=6, revents=POLLIN}]) <0.07>
>  0.44 setsockopt(6, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0 <0.06>
>  0.42 recvmsg(6, {msg_name(0)=NULL,
> msg_iov(1)=[{"\4\0\0\0\252*\0\0Rv-b\0\0\0\0A\246\10B\0\0\0\0", 24}],
> msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET,
> cmsg_type=SCM_CREDENTIALS{pid=819, uid=301, gid=301}}, msg_flags=0},
> MSG_NOSIGNAL) = 24 <0.10>
>  0.72 setsockopt(6, SOL_SOCKET, SO_P

Re: [Openais] Problems forming cluster on corosync startup

2011-08-15 Thread Steven Dake

On 08/14/2011 01:34 PM, Tim Beale wrote:
> Hi Steve,
> 
> I repeated the test with fail_recv_const=5000. I can see the CPG
> client hung for ~4 minutes without dispatching any CPG events (i.e.
> node join). Unfortunately, one of our healthchecking mechanisms kicked
> in at this point, detected the CPG client as locked up and rebooted
> the units.
> 
> It definitely rules out #2. I can repeat the test with healthchecking
> disabled to narrow down if #1 or #3 will occur.
> 
> Regards,
> Tim
> 
> On Thu, Aug 11, 2011 at 4:21 AM, Steven Dake  wrote:
>> On 08/09/2011 09:56 PM, Tim Beale wrote:
>>> Hi Steve,
>>>
>>> Thanks for your patch.
>>>
>>> 1. I don't see the initial CLM leave events. But I still see the FAILED TO
>>> RECEIVE hit on node-3. A couple of nodes don't enter operational on ring 20,
>>> then after the ring next reforms (ring 24), the FAILED TO RECEIVE happens.
>>> Attached is the latest debug.
>>>
>>
>> Keep in mind there are two problems here - (1) clm membership is wrong
>> and (2) fail to recv problem.  They are independent issues.
>>
>> I definitely want to look into this failed to receive issue.  Can you
>> try changing "fail_recv_const" on all the nodes to some large value,
>> such as 5000?
>>
>> One of 3 things should happen:
>> 1. the protocol blocks forever
>> 2. the protocol enters operational after some short period
>> 3. fail to recv is printed after a long period of time (1-10 minutes).
>>
>> Please report back which one happens with this tuning.
>>

Given that #1/#3 are basically what are occurring, I would love to have
a blackbox few seconds after config time and then couple minutes in.
Apparently something is wrong with the recovery in this test case.

Regards
-steve
>>
>>> I think the problem is some nodes end up missing a message/sequence-number,
>>> although I'm not sure exactly why. E.g. the token sequence starts off at one
>>> when they enter operational, but not all nodes receive this.
>>> 2011 Aug  9 10:07:18 daemon.debug node-3 corosync[1575]:   [TOTEM ]
>>> totemsrp.c:3785 retrans flag count 4 token aru 0 install seq 0 aru 0 1
>>>
>>> The nodes that were still in recovery will be using different values for
>>> old_ring_state_high_seq_received and my_old_ring_id. It seems these nodes
>>> receive msg seq #1, but the others don't and hit the FAILED TO RECEIVE.
>>>
>>> The debug attached has your first memb-list patch popped off, but I've seen 
>>> the
>>> same problem happen with it applied too.
>>>
>>> 2. Note that I don't see any CLM leave events at all now, even though after 
>>> the
>>> FAILED TO RECEIVE, node-3 kicks all other nodes out of its ring. I think 
>>> this
>>> is due to the logic:
>>>  diff = my_new_memb_list - my_memb_list
>>
>> This isn't how the difference operation works.  It produces a list of
>> nodes that are not both in my_new_memb_list and my_memb_list, therefore,
>> the current and logic should be correct.  I wrote the patch at 2am and
>> was quite tired, so I'll double check it is correct.
>>
>> Regards
>> -steve
>>
>>> The diff doesn't include any nodes that are in my_memb_list but not in
>>> my_new_memb_list, i.e. left nodes. I guess you could get all the differences
>>> by doing the following:
>>>  memb_set_subtract( diff1, my_new_memb_list, my_memb_list )
>>>  memb_set_subtract( diff2, my_memb_list, my_new_memb_list )
>>>  memb_set_and( diff1, diff2, diff )
>>>
>>> Thanks,
>>> Tim
>>>
>>> On Mon, Aug 8, 2011 at 9:45 PM, Steven Dake  wrote:
>>>> On 08/08/2011 12:10 AM, Tim Beale wrote:
>>>>> Hi Steve,
>>>>>
>>>>> Thanks for your help. I tried out your patch but the problem still
>>>>> occurs. The problem looks to me due to the ring-IDs used when forming
>>>>> the transitional memb-list, rather than with the memb-list itself. The
>>>>> ring-ID of the nodes still in Recovery is older than the rest of the
>>>>> nodes who have already shifted to Operational.
>>>>>
>>>>> Attached is my attempt at fixing the problem. The idea is to delay the
>>>>> nodes processing a Memb-Join immediately after shifting to
>>>>> Operational, until the token has rotated the ring once.
>>>>>
>>>>> It doesn't quite work either though. The nodes are still r

Re: [Openais] Problems forming cluster on corosync startup

2011-08-10 Thread Steven Dake

On 08/09/2011 09:56 PM, Tim Beale wrote:
> Hi Steve,
> 
> Thanks for your patch.
> 
> 1. I don't see the initial CLM leave events. But I still see the FAILED TO
> RECEIVE hit on node-3. A couple of nodes don't enter operational on ring 20,
> then after the ring next reforms (ring 24), the FAILED TO RECEIVE happens.
> Attached is the latest debug.
> 

Keep in mind there are two problems here - (1) clm membership is wrong
and (2) fail to recv problem.  They are independent issues.

I definitely want to look into this failed to receive issue.  Can you
try changing "fail_recv_const" on all the nodes to some large value,
such as 5000?

One of 3 things should happen:
1. the protocol blocks forever
2. the protocol enters operational after some short period
3. fail to recv is printed after a long period of time (1-10 minutes).

Please report back which one happens with this tuning.


> I think the problem is some nodes end up missing a message/sequence-number,
> although I'm not sure exactly why. E.g. the token sequence starts off at one
> when they enter operational, but not all nodes receive this.
> 2011 Aug  9 10:07:18 daemon.debug node-3 corosync[1575]:   [TOTEM ]
> totemsrp.c:3785 retrans flag count 4 token aru 0 install seq 0 aru 0 1
> 
> The nodes that were still in recovery will be using different values for
> old_ring_state_high_seq_received and my_old_ring_id. It seems these nodes
> receive msg seq #1, but the others don't and hit the FAILED TO RECEIVE.
> 
> The debug attached has your first memb-list patch popped off, but I've seen 
> the
> same problem happen with it applied too.
> 
> 2. Note that I don't see any CLM leave events at all now, even though after 
> the
> FAILED TO RECEIVE, node-3 kicks all other nodes out of its ring. I think this
> is due to the logic:
>  diff = my_new_memb_list - my_memb_list

This isn't how the difference operation works.  It produces a list of
nodes that are not both in my_new_memb_list and my_memb_list, therefore,
the current and logic should be correct.  I wrote the patch at 2am and
was quite tired, so I'll double check it is correct.

Regards
-steve

> The diff doesn't include any nodes that are in my_memb_list but not in
> my_new_memb_list, i.e. left nodes. I guess you could get all the differences
> by doing the following:
>  memb_set_subtract( diff1, my_new_memb_list, my_memb_list )
>  memb_set_subtract( diff2, my_memb_list, my_new_memb_list )
>  memb_set_and( diff1, diff2, diff )
> 
> Thanks,
> Tim
> 
> On Mon, Aug 8, 2011 at 9:45 PM, Steven Dake  wrote:
>> On 08/08/2011 12:10 AM, Tim Beale wrote:
>>> Hi Steve,
>>>
>>> Thanks for your help. I tried out your patch but the problem still
>>> occurs. The problem looks to me due to the ring-IDs used when forming
>>> the transitional memb-list, rather than with the memb-list itself. The
>>> ring-ID of the nodes still in Recovery is older than the rest of the
>>> nodes who have already shifted to Operational.
>>>
>>> Attached is my attempt at fixing the problem. The idea is to delay the
>>> nodes processing a Memb-Join immediately after shifting to
>>> Operational, until the token has rotated the ring once.
>>>
>>> It doesn't quite work either though. The nodes are still re-entering
>>> gather before all have left recovery. This time it's due to processing
>>> a Merge-Detect message. One node has just started up and set itself to
>>> the rep, and sends out a Merge-Detect which triggers the other nodes
>>> to enter gather and reform the ring.
>>>
>>> Let me know if you have any other advice.
>>>
>>
>> the problem is clear from the blackbox - 8 nodes enter operational while
>> 1 in recovery is interrupted by a join message.  this interrupted node
>> then proceeds with a transitional membership of 1 node (which is correct).
>>
>> The joined and left lists use the transitional list to determine their
>> contents, which is not correct.  This results in incorrect data
>> delivered to clm.  Try the follow-up patch which should correctly
>> calculate the joined and left lists.
>>
>>
>>> Thanks,
>>> Tim
>>>
>>> On Mon, Aug 8, 2011 at 6:08 AM, Steven Dake  wrote:
>>>> On 08/03/2011 10:32 PM, Tim Beale wrote:
>>>>> Hi,
>>>>>
>>>>> It looks to me that the way the transition from Recovery to Operational 
>>>>> works,
>>>>> we can't guarantee that all nodes in the ring have entered Operational 
>>>>> before
>>>>> a node processes another Memb-Joi

[Openais] [PATCH 3/4] properly define rec_token_cq_send_event_fn

2011-08-09 Thread Steven Dake

Signed-off-by: Steven Dake 
---
 exec/totemiba.c |5 -
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/exec/totemiba.c b/exec/totemiba.c
index 008018a..ffcfceb 100644
--- a/exec/totemiba.c
+++ b/exec/totemiba.c
@@ -562,7 +562,10 @@ static int mcast_rdma_event_fn (int events,  int suck,  
void *context)
return (0);
 }
 
-static int recv_token_cq_send_event_fn (hdb_handle_t poll_handle,  int events, 
 int suck,  void *context)
+static int recv_token_cq_send_event_fn (
+   int fd,
+   int revents,
+   void *context)
 {
struct totemiba_instance *instance = (struct totemiba_instance 
*)context;
struct ibv_wc wc[32];
-- 
1.7.6

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

[Openais] [PATCH 4/4] Remove -lcoroipcc from tools/Makefile.am notifyd

2011-08-09 Thread Steven Dake

Signed-off-by: Steven Dake 
---
 tools/Makefile.am |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/tools/Makefile.am b/tools/Makefile.am
index f88e741..2699519 100644
--- a/tools/Makefile.am
+++ b/tools/Makefile.am
@@ -55,7 +55,7 @@ corosync_quorumtool_LDADD = -lconfdb -lcfg -lquorum \
-lvotequorum ../lcr/liblcr.a $(LIBQB_LIBS)
 corosync_quorumtool_LDFLAGS = -L../lib
 
-corosync_notifyd_LDADD = -lcfg -lconfdb ../lcr/liblcr.a -lcoroipcc \
+corosync_notifyd_LDADD = -lcfg -lconfdb ../lcr/liblcr.a \
   $(LIBQB_LIBS) $(DBUS_LIBS) $(SNMPLIBS) \
   -lquorum
 corosync_notifyd_LDFLAGS = -L../lib
-- 
1.7.6

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

[Openais] [PATCH 2/4] Define totemiba_log_printf properly

2011-08-09 Thread Steven Dake

Signed-off-by: Steven Dake 
---
 exec/totemiba.c |8 +---
 1 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/exec/totemiba.c b/exec/totemiba.c
index a16f88a..008018a 100644
--- a/exec/totemiba.c
+++ b/exec/totemiba.c
@@ -187,13 +187,15 @@ struct totemiba_instance {
 
struct ibv_cq *send_token_recv_cq;
 
-   void (*totemiba_log_printf) (
-   unsigned int rec_ident,
+void (*totemiba_log_printf) (
+   int level,
+   int subsys,
const char *function,
const char *file,
int line,
const char *format,
-   ...)__attribute__((format(printf, 5, 6)));
+   ...)__attribute__((format(printf, 6, 7)));
+
 
int totemiba_subsys_id;
 
-- 
1.7.6

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

[Openais] [PATCH 1/4] Fix problem in totemiba where incorrect define is used (and also not defined)

2011-08-09 Thread Steven Dake

Signed-off-by: Steven Dake 
---
 exec/totemiba.c |4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/exec/totemiba.c b/exec/totemiba.c
index 2d8c690..a16f88a 100644
--- a/exec/totemiba.c
+++ b/exec/totemiba.c
@@ -70,6 +70,8 @@
 #include 
 #include 
 #include 
+
+#include 
 #include 
 #define LOGSYS_UTILS_ONLY 1
 #include 
@@ -1316,7 +1318,7 @@ int totemiba_initialize (
 
qb_loop_timer_add (instance->totemiba_poll_handle,
QB_LOOP_MED,
-   100*QB_TIME_NS_IN_NSEC,
+   100*QB_TIME_NS_IN_MSEC,
(void *)instance,
timer_function_netif_check_timeout,
&instance->timer_netif_check_timeout);
-- 
1.7.6

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] [PATCH 2/2] cfg: Handle errors from totem_mcast

2011-08-09 Thread Steven Dake

On second consideration this patch is

Reviewed-by: Steven Dake 

On 08/08/2011 09:15 AM, Steven Dake wrote:
> Before accepting an IPC message, ipc checks that the totem queue has
> available room for new messages.  As a result this patch is either not
> necessary or fixes the wrong thing.
> 
> See coroipcs.c:697
> 
> send_ok = api->sending_allowed (conn_info->service,
> header->id,
> header,
> conn_info->sending_allowed_private_data);
> 
> 
> On 07/28/2011 07:20 AM, Jan Friesse wrote:
>> totem_mcast function can return -1 if corosync is overloaded. Sadly
>> in many calls of this functions was error code ether not handled at
>> all, or handled by assert.
>>
>> Commit changes behaviour to ether return CS_ERR_TRY_AGAIN or put
>> error code to later layers to handle it.
>>
>> Signed-off-by: Jan Friesse 
>> ---
>>  services/cfg.c |   77 
>> ++-
>>  1 files changed, 59 insertions(+), 18 deletions(-)
>>
>> diff --git a/services/cfg.c b/services/cfg.c
>> index b7aa63b..24f19f2 100644
>> --- a/services/cfg.c
>> +++ b/services/cfg.c
>> @@ -379,6 +379,7 @@ static int send_shutdown(void)
>>  {
>>  struct req_exec_cfg_shutdown req_exec_cfg_shutdown;
>>  struct iovec iovec;
>> +int result;
>>  
>>  ENTER();
>>  req_exec_cfg_shutdown.header.size =
>> @@ -389,10 +390,10 @@ static int send_shutdown(void)
>>  iovec.iov_base = (char *)&req_exec_cfg_shutdown;
>>  iovec.iov_len = sizeof (struct req_exec_cfg_shutdown);
>>  
>> -assert (api->totem_mcast (&iovec, 1, TOTEM_SAFE) == 0);
>> +result = api->totem_mcast (&iovec, 1, TOTEM_SAFE);
>>  
>>  LEAVE();
>> -return 0;
>> +return (result);
>>  }
>>  
>>  static void send_test_shutdown(void *only_conn, void *exclude_conn, int 
>> status)
>> @@ -426,6 +427,9 @@ static void send_test_shutdown(void *only_conn, void 
>> *exclude_conn, int status)
>>  
>>  static void check_shutdown_status(void)
>>  {
>> +int result;
>> +cs_error_t error = CS_OK;
>> +
>>  ENTER();
>>  
>>  /*
>> @@ -448,9 +452,17 @@ static void check_shutdown_status(void)
>>  shutdown_flags == CFG_SHUTDOWN_FLAG_REGARDLESS) {
>>  TRACE1("shutdown confirmed");
>>  
>> +/*
>> + * Tell other nodes we are going down
>> + */
>> +result = send_shutdown();
>> +if (result == -1) {
>> +error = CS_ERR_TRY_AGAIN;
>> +}
>> +
>>  res_lib_cfg_tryshutdown.header.size = sizeof(struct 
>> res_lib_cfg_tryshutdown);
>>  res_lib_cfg_tryshutdown.header.id = 
>> MESSAGE_RES_CFG_TRYSHUTDOWN;
>> -res_lib_cfg_tryshutdown.header.error = CS_OK;
>> +res_lib_cfg_tryshutdown.header.error = error;
>>  
>>  /*
>>   * Tell originator that shutdown was confirmed
>> @@ -459,10 +471,6 @@ static void check_shutdown_status(void)
>>  
>> sizeof(res_lib_cfg_tryshutdown));
>>  shutdown_con = NULL;
>>  
>> -/*
>> - * Tell other nodes we are going down
>> - */
>> -send_shutdown();
>>  
>>  }
>>  else {
>> @@ -698,7 +706,9 @@ static void message_handler_req_lib_cfg_ringreenable (
>>  const void *msg)
>>  {
>>  struct req_exec_cfg_ringreenable req_exec_cfg_ringreenable;
>> +struct res_lib_cfg_ringreenable res_lib_cfg_ringreenable;
>>  struct iovec iovec;
>> +int result;
>>  
>>  ENTER();
>>  req_exec_cfg_ringreenable.header.size =
>> @@ -711,7 +721,19 @@ static void message_handler_req_lib_cfg_ringreenable (
>>  iovec.iov_base = (char *)&req_exec_cfg_ringreenable;
>>  iovec.iov_len = sizeof (struct req_exec_cfg_ringreenable);
>>  
>> -assert (api->totem_mcast (&iovec, 1, TOTEM_SAFE) == 0);
>> +result = api->totem_mcast (&iovec, 1, TOTEM_SAFE);
>> +
>> +if (result == -1) {
>> +res_lib_cfg_ringreenable.header.id = 
>> MES

Re: [Openais] [PATCH 1/2] cpg: Handle errors from totem_mcast

2011-08-09 Thread Steven Dake

On second consideration this patch is

Reviewed-by: Steven Dake 

On 08/08/2011 09:11 AM, Steven Dake wrote:
> On 07/28/2011 07:20 AM, Jan Friesse wrote:
>> totem_mcast function can return -1 if corosync is overloaded. Sadly in
>> many calls of this functions was error code ether not handled at all, or
>> handled by assert.
>>
>> Commit changes behaviour to ether return CS_ERR_TRY_AGAIN or put error
>> code to later layers to handle it.
>>
>> Signed-off-by: Jan Friesse 
>> ---
>>  services/cpg.c |   31 ++-
>>  1 files changed, 26 insertions(+), 5 deletions(-)
>>
>> diff --git a/services/cpg.c b/services/cpg.c
>> index 6669fbd..18767bd 100644
>> --- a/services/cpg.c
>> +++ b/services/cpg.c
>> @@ -865,12 +865,19 @@ static void cpg_pd_finalize (struct cpg_pd *cpd)
>>  static int cpg_lib_exit_fn (void *conn)
>>  {
>>  struct cpg_pd *cpd = (struct cpg_pd *)api->ipc_private_data_get (conn);
>> +int result;
>>  
>>  log_printf(LOGSYS_LEVEL_DEBUG, "exit_fn for conn=%p\n", conn);
>>  
>>  if (cpd->group_name.length > 0) {
>> -cpg_node_joinleave_send (cpd->pid, &cpd->group_name,
>> +result = cpg_node_joinleave_send (cpd->pid, &cpd->group_name,
>>  MESSAGE_REQ_EXEC_CPG_PROCLEAVE, 
>> CONFCHG_CPG_REASON_PROCDOWN);
>> +if (result == -1) {
>> +/*
>> + * Call this function again later
>> + */
>> +return (result);
>> +}
>>  }
>>
> 
> this is correct
> 
>>  cpg_pd_finalize (cpd);
>> @@ -1289,6 +1296,7 @@ static void message_handler_req_lib_cpg_join (void 
>> *conn, const void *message)
>>  struct res_lib_cpg_join res_lib_cpg_join;
>>  cs_error_t error = CPG_OK;
>>  struct list_head *iter;
>> +int result;
>>  
>>  /* Test, if we don't have same pid and group name joined */
>>  for (iter = cpg_pd_list_head.next; iter != &cpg_pd_list_head; iter = 
>> iter->next) {
>> @@ -1327,9 +1335,15 @@ static void message_handler_req_lib_cpg_join (void 
>> *conn, const void *message)
>>  memcpy (&cpd->group_name, &req_lib_cpg_join->group_name,
>>  sizeof (cpd->group_name));
>>  
>> -cpg_node_joinleave_send (req_lib_cpg_join->pid,
>> +result = cpg_node_joinleave_send (req_lib_cpg_join->pid,
>>  &req_lib_cpg_join->group_name,
>>  MESSAGE_REQ_EXEC_CPG_PROCJOIN, CONFCHG_CPG_REASON_JOIN);
>> +
>> +if (result == -1) {
>> +error = CPG_ERR_TRY_AGAIN;
>> +cpd->cpd_state = CPD_STATE_UNJOINED;
>> +goto response_send;
>> +}
>>  break;
>>  case CPD_STATE_LEAVE_STARTED:
>>  error = CPG_ERR_BUSY;
> 
> the remainder of patch is not.  the ipc layer ensures room is available
> in the totem queue to handle new totem messages.  If that part isn't
> working as expected (ie: you see a failure in this part of the code) you
> should fix the totem pending queue rather then hack it here.
> 
>> @@ -1356,6 +1370,7 @@ static void message_handler_req_lib_cpg_leave (void 
>> *conn, const void *message)
>>  cs_error_t error = CPG_OK;
>>  struct req_lib_cpg_leave  *req_lib_cpg_leave = (struct 
>> req_lib_cpg_leave *)message;
>>  struct cpg_pd *cpd = (struct cpg_pd *)api->ipc_private_data_get (conn);
>> +int result;
>>  
>>  log_printf(LOGSYS_LEVEL_DEBUG, "got leave request on %p\n", conn);
>>  
>> @@ -1372,10 +1387,14 @@ static void message_handler_req_lib_cpg_leave (void 
>> *conn, const void *message)
>>  case CPD_STATE_JOIN_COMPLETED:
>>  error = CPG_OK;
>>  cpd->cpd_state = CPD_STATE_LEAVE_STARTED;
>> -cpg_node_joinleave_send (req_lib_cpg_leave->pid,
>> +result = cpg_node_joinleave_send (req_lib_cpg_leave->pid,
>>  &req_lib_cpg_leave->group_name,
>>  MESSAGE_REQ_EXEC_CPG_PROCLEAVE,
>>  CONFCHG_CPG_REASON_LEAVE);
>> +if (result == -1) {
>> +error = CPG_ERR_TRY_AGAIN;
>> +cpd->cpd_state = CPD_STATE_JOIN_COMPLETED;
>> +

Re: [Openais] [PATCH 2/2] cfg: Handle errors from totem_mcast

2011-08-08 Thread Steven Dake

Before accepting an IPC message, ipc checks that the totem queue has
available room for new messages.  As a result this patch is either not
necessary or fixes the wrong thing.

See coroipcs.c:697

send_ok = api->sending_allowed (conn_info->service,
header->id,
header,
conn_info->sending_allowed_private_data);


On 07/28/2011 07:20 AM, Jan Friesse wrote:
> totem_mcast function can return -1 if corosync is overloaded. Sadly
> in many calls of this functions was error code ether not handled at
> all, or handled by assert.
> 
> Commit changes behaviour to ether return CS_ERR_TRY_AGAIN or put
> error code to later layers to handle it.
> 
> Signed-off-by: Jan Friesse 
> ---
>  services/cfg.c |   77 ++-
>  1 files changed, 59 insertions(+), 18 deletions(-)
> 
> diff --git a/services/cfg.c b/services/cfg.c
> index b7aa63b..24f19f2 100644
> --- a/services/cfg.c
> +++ b/services/cfg.c
> @@ -379,6 +379,7 @@ static int send_shutdown(void)
>  {
>   struct req_exec_cfg_shutdown req_exec_cfg_shutdown;
>   struct iovec iovec;
> + int result;
>  
>   ENTER();
>   req_exec_cfg_shutdown.header.size =
> @@ -389,10 +390,10 @@ static int send_shutdown(void)
>   iovec.iov_base = (char *)&req_exec_cfg_shutdown;
>   iovec.iov_len = sizeof (struct req_exec_cfg_shutdown);
>  
> - assert (api->totem_mcast (&iovec, 1, TOTEM_SAFE) == 0);
> + result = api->totem_mcast (&iovec, 1, TOTEM_SAFE);
>  
>   LEAVE();
> - return 0;
> + return (result);
>  }
>  
>  static void send_test_shutdown(void *only_conn, void *exclude_conn, int 
> status)
> @@ -426,6 +427,9 @@ static void send_test_shutdown(void *only_conn, void 
> *exclude_conn, int status)
>  
>  static void check_shutdown_status(void)
>  {
> + int result;
> + cs_error_t error = CS_OK;
> +
>   ENTER();
>  
>   /*
> @@ -448,9 +452,17 @@ static void check_shutdown_status(void)
>   shutdown_flags == CFG_SHUTDOWN_FLAG_REGARDLESS) {
>   TRACE1("shutdown confirmed");
>  
> + /*
> +  * Tell other nodes we are going down
> +  */
> + result = send_shutdown();
> + if (result == -1) {
> + error = CS_ERR_TRY_AGAIN;
> + }
> +
>   res_lib_cfg_tryshutdown.header.size = sizeof(struct 
> res_lib_cfg_tryshutdown);
>   res_lib_cfg_tryshutdown.header.id = 
> MESSAGE_RES_CFG_TRYSHUTDOWN;
> - res_lib_cfg_tryshutdown.header.error = CS_OK;
> + res_lib_cfg_tryshutdown.header.error = error;
>  
>   /*
>* Tell originator that shutdown was confirmed
> @@ -459,10 +471,6 @@ static void check_shutdown_status(void)
>   
> sizeof(res_lib_cfg_tryshutdown));
>   shutdown_con = NULL;
>  
> - /*
> -  * Tell other nodes we are going down
> -  */
> - send_shutdown();
>  
>   }
>   else {
> @@ -698,7 +706,9 @@ static void message_handler_req_lib_cfg_ringreenable (
>   const void *msg)
>  {
>   struct req_exec_cfg_ringreenable req_exec_cfg_ringreenable;
> + struct res_lib_cfg_ringreenable res_lib_cfg_ringreenable;
>   struct iovec iovec;
> + int result;
>  
>   ENTER();
>   req_exec_cfg_ringreenable.header.size =
> @@ -711,7 +721,19 @@ static void message_handler_req_lib_cfg_ringreenable (
>   iovec.iov_base = (char *)&req_exec_cfg_ringreenable;
>   iovec.iov_len = sizeof (struct req_exec_cfg_ringreenable);
>  
> - assert (api->totem_mcast (&iovec, 1, TOTEM_SAFE) == 0);
> + result = api->totem_mcast (&iovec, 1, TOTEM_SAFE);
> +
> + if (result == -1) {
> + res_lib_cfg_ringreenable.header.id = 
> MESSAGE_RES_CFG_RINGREENABLE;
> + res_lib_cfg_ringreenable.header.size = sizeof (struct 
> res_lib_cfg_ringreenable);
> + res_lib_cfg_ringreenable.header.error = CS_ERR_TRY_AGAIN;
> + api->ipc_response_send (
> + conn,
> + &res_lib_cfg_ringreenable,
> + sizeof (struct res_lib_cfg_ringreenable));
> +
> + api->ipc_refcnt_dec(conn);
> + }
>  
>   LEAVE();
>  }
> @@ -836,6 +858,8 @@ static void message_handler_req_lib_cfg_killnode (
>   struct res_lib_cfg_killnode res_lib_cfg_killnode;
>   struct req_exec_cfg_killnode req_exec_cfg_killnode;
>   struct iovec iovec;
> + int result;
> + cs_error_t error = CS_OK;
>  
>   ENTER();
>   req_exec_cfg_killnode.header.size =
> @@ -848,11 +872,14 @@ static void message_handler_req_lib_cfg_killnode (
>   iovec.iov_base = (char

Re: [Openais] [PATCH 1/2] cpg: Handle errors from totem_mcast

2011-08-08 Thread Steven Dake

On 07/28/2011 07:20 AM, Jan Friesse wrote:
> totem_mcast function can return -1 if corosync is overloaded. Sadly in
> many calls of this functions was error code ether not handled at all, or
> handled by assert.
> 
> Commit changes behaviour to ether return CS_ERR_TRY_AGAIN or put error
> code to later layers to handle it.
> 
> Signed-off-by: Jan Friesse 
> ---
>  services/cpg.c |   31 ++-
>  1 files changed, 26 insertions(+), 5 deletions(-)
> 
> diff --git a/services/cpg.c b/services/cpg.c
> index 6669fbd..18767bd 100644
> --- a/services/cpg.c
> +++ b/services/cpg.c
> @@ -865,12 +865,19 @@ static void cpg_pd_finalize (struct cpg_pd *cpd)
>  static int cpg_lib_exit_fn (void *conn)
>  {
>   struct cpg_pd *cpd = (struct cpg_pd *)api->ipc_private_data_get (conn);
> + int result;
>  
>   log_printf(LOGSYS_LEVEL_DEBUG, "exit_fn for conn=%p\n", conn);
>  
>   if (cpd->group_name.length > 0) {
> - cpg_node_joinleave_send (cpd->pid, &cpd->group_name,
> + result = cpg_node_joinleave_send (cpd->pid, &cpd->group_name,
>   MESSAGE_REQ_EXEC_CPG_PROCLEAVE, 
> CONFCHG_CPG_REASON_PROCDOWN);
> + if (result == -1) {
> + /*
> +  * Call this function again later
> +  */
> + return (result);
> + }
>   }
> 

this is correct

>   cpg_pd_finalize (cpd);
> @@ -1289,6 +1296,7 @@ static void message_handler_req_lib_cpg_join (void 
> *conn, const void *message)
>   struct res_lib_cpg_join res_lib_cpg_join;
>   cs_error_t error = CPG_OK;
>   struct list_head *iter;
> + int result;
>  
>   /* Test, if we don't have same pid and group name joined */
>   for (iter = cpg_pd_list_head.next; iter != &cpg_pd_list_head; iter = 
> iter->next) {
> @@ -1327,9 +1335,15 @@ static void message_handler_req_lib_cpg_join (void 
> *conn, const void *message)
>   memcpy (&cpd->group_name, &req_lib_cpg_join->group_name,
>   sizeof (cpd->group_name));
>  
> - cpg_node_joinleave_send (req_lib_cpg_join->pid,
> + result = cpg_node_joinleave_send (req_lib_cpg_join->pid,
>   &req_lib_cpg_join->group_name,
>   MESSAGE_REQ_EXEC_CPG_PROCJOIN, CONFCHG_CPG_REASON_JOIN);
> +
> + if (result == -1) {
> + error = CPG_ERR_TRY_AGAIN;
> + cpd->cpd_state = CPD_STATE_UNJOINED;
> + goto response_send;
> + }
>   break;
>   case CPD_STATE_LEAVE_STARTED:
>   error = CPG_ERR_BUSY;

the remainder of patch is not.  the ipc layer ensures room is available
in the totem queue to handle new totem messages.  If that part isn't
working as expected (ie: you see a failure in this part of the code) you
should fix the totem pending queue rather then hack it here.

> @@ -1356,6 +1370,7 @@ static void message_handler_req_lib_cpg_leave (void 
> *conn, const void *message)
>   cs_error_t error = CPG_OK;
>   struct req_lib_cpg_leave  *req_lib_cpg_leave = (struct 
> req_lib_cpg_leave *)message;
>   struct cpg_pd *cpd = (struct cpg_pd *)api->ipc_private_data_get (conn);
> + int result;
>  
>   log_printf(LOGSYS_LEVEL_DEBUG, "got leave request on %p\n", conn);
>  
> @@ -1372,10 +1387,14 @@ static void message_handler_req_lib_cpg_leave (void 
> *conn, const void *message)
>   case CPD_STATE_JOIN_COMPLETED:
>   error = CPG_OK;
>   cpd->cpd_state = CPD_STATE_LEAVE_STARTED;
> - cpg_node_joinleave_send (req_lib_cpg_leave->pid,
> + result = cpg_node_joinleave_send (req_lib_cpg_leave->pid,
>   &req_lib_cpg_leave->group_name,
>   MESSAGE_REQ_EXEC_CPG_PROCLEAVE,
>   CONFCHG_CPG_REASON_LEAVE);
> + if (result == -1) {
> + error = CPG_ERR_TRY_AGAIN;
> + cpd->cpd_state = CPD_STATE_JOIN_COMPLETED;
> + }
>   break;
>   }
>  
> @@ -1458,8 +1477,10 @@ static void message_handler_req_lib_cpg_mcast (void 
> *conn, const void *message)
>   req_exec_cpg_iovec[1].iov_base = (char 
> *)&req_lib_cpg_mcast->message;
>   req_exec_cpg_iovec[1].iov_len = msglen;
>  
> - result = api->totem_mcast (req_exec_cpg_iovec, 2, TOTEM_AGREED);
> - assert(result == 0);
> + result = api->totem_mcast (req_exec_cpg_iovec, 2, TOTEM_AGREED);
> + if (result == -1) {
> + error = CPG_ERR_TRY_AGAIN;
> + }
>   }
>  
>   res_lib_cpg_mcast.header.size = sizeof(res_lib_cpg_mcast);

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] [PATCH] coroipcc: use malloc for path in service_connect

2011-08-08 Thread Steven Dake

Reiewed-by: Steen Dake 

On 07/27/2011 08:31 AM, Jan Friesse wrote:
> Coroipcc appropriately uses PATH_MAX sized variables for various data
> structures handling files in the initialization of the client.  Due to
> the use of 12 of these structures declared as stack variables, the
> application stack balloons to over 12*4k. This is especially problematic
> if threads are used by long running daemons to restart the connection
> to corosync so as to be resilient in the face of system services
> restarting (service corosync restart).
> 
> A simple alternative is to allocate temporary memory to avoid
> requirements of large thread stacks.
> 
> Original patch by Dan Clark <2cla...@gmail.com>
> 
> Signed-off-by: Jan Friesse 
> ---
>  lib/coroipcc.c |   67 +--
>  1 files changed, 40 insertions(+), 27 deletions(-)
> 
> diff --git a/lib/coroipcc.c b/lib/coroipcc.c
> index 14860e2..54d9aa7 100644
> --- a/lib/coroipcc.c
> +++ b/lib/coroipcc.c
> @@ -86,6 +86,15 @@ struct ipc_instance {
>   pthread_mutex_t mutex;
>  };
>  
> +struct ipc_path_data {
> + mar_req_setup_t req_setup;
> + mar_res_setup_t res_setup;
> + char control_map_path[PATH_MAX];
> + char request_map_path[PATH_MAX];
> + char response_map_path[PATH_MAX];
> + char dispatch_map_path[PATH_MAX];
> +};
> +
>  void ipc_hdb_destructor (void *context);
>  
>  DECLARE_HDB_DATABASE(ipc_hdb,ipc_hdb_destructor);
> @@ -579,12 +588,7 @@ coroipcc_service_connect (
>   union semun semun;
>  #endif
>   int sys_res;
> - mar_req_setup_t req_setup;
> - mar_res_setup_t res_setup;
> - char control_map_path[PATH_MAX];
> - char request_map_path[PATH_MAX];
> - char response_map_path[PATH_MAX];
> - char dispatch_map_path[PATH_MAX];
> + struct ipc_path_data *path_data;
>  
>   res = hdb_error_to_cs (hdb_handle_create (&ipc_hdb,
>   sizeof (struct ipc_instance), handle));
> @@ -597,8 +601,6 @@ coroipcc_service_connect (
>   return (res);
>   }
>  
> - res_setup.error = CS_ERR_LIBRARY;
> -
>  #if defined(COROSYNC_SOLARIS)
>   request_fd = socket (PF_UNIX, SOCK_STREAM, 0);
>  #else
> @@ -611,6 +613,14 @@ coroipcc_service_connect (
>   socket_nosigpipe (request_fd);
>  #endif
>  
> + path_data = malloc (sizeof(*path_data));
> + if (path_data == NULL) {
> + goto error_connect;
> + }
> + memset(path_data, 0, sizeof(*path_data));
> +
> + path_data->res_setup.error = CS_ERR_LIBRARY;
> +
>   memset (&address, 0, sizeof (struct sockaddr_un));
>   address.sun_family = AF_UNIX;
>  #if defined(COROSYNC_BSD) || defined(COROSYNC_DARWIN)
> @@ -630,7 +640,7 @@ coroipcc_service_connect (
>   }
>  
>   sys_res = memory_map (
> - control_map_path,
> + path_data->control_map_path,
>   "control_buffer-XX",
>   (void *)&ipc_instance->control_buffer,
>   8192);
> @@ -640,7 +650,7 @@ coroipcc_service_connect (
>   }
>  
>   sys_res = memory_map (
> - request_map_path,
> + path_data->request_map_path,
>   "request_buffer-XX",
>   (void *)&ipc_instance->request_buffer,
>   request_size);
> @@ -650,7 +660,7 @@ coroipcc_service_connect (
>   }
>  
>   sys_res = memory_map (
> - response_map_path,
> + path_data->response_map_path,
>   "response_buffer-XX",
>   (void *)&ipc_instance->response_buffer,
>   response_size);
> @@ -660,7 +670,7 @@ coroipcc_service_connect (
>   }
>  
>   sys_res = circular_memory_map (
> - dispatch_map_path,
> + path_data->dispatch_map_path,
>   "dispatch_buffer-XX",
>   (void *)&ipc_instance->dispatch_buffer,
>   dispatch_size);
> @@ -715,33 +725,33 @@ coroipcc_service_connect (
>   /*
>* Initialize IPC setup message
>*/
> - req_setup.service = service;
> - strcpy (req_setup.control_file, control_map_path);
> - strcpy (req_setup.request_file, request_map_path);
> - strcpy (req_setup.response_file, response_map_path);
> - strcpy (req_setup.dispatch_file, dispatch_map_path);
> - req_setup.control_size = 8192;
> - req_setup.request_size = request_size;
> - req_setup.response_size = response_size;
> - req_setup.dispatch_size = dispatch_size;
> + path_data->req_setup.service = service;
> + strcpy (path_data->req_setup.control_file, path_data->control_map_path);
> + strcpy (path_data->req_setup.request_file, path_data->request_map_path);
> + strcpy (path_data->req_setup.response_file, 
> path_data->response_map_path);
> + strcpy (path_data->req_setup.dispatch_file, 
> path_data->dispatch_map_path);
> + path_data->req_setup.control_size = 8192;
> + path_data->req_setup.request_size = request_size;
> + path_data->req_setup.re

Re: [Openais] [PATCH] Revert "totemsrp: Remove recv_flush code"

2011-08-08 Thread Steven Dake

Reviewed-by: Steven Dake 

On 07/27/2011 05:49 AM, Jan Friesse wrote:
> This reverts commit 2167
> 
> Reversion is needed to remove overflow of receive buffers and dropping
> messages.
> 
> Signed-off-by: Jan Friesse 
> ---
>  branches/whitetank/exec/totemnet.c |   45 -
>  branches/whitetank/exec/totemnet.h |2 +
>  branches/whitetank/exec/totemrrp.c |   65 
> 
>  branches/whitetank/exec/totemrrp.h |2 +
>  branches/whitetank/exec/totemsrp.c |2 +
>  5 files changed, 115 insertions(+), 1 deletions(-)
> 
> diff --git a/branches/whitetank/exec/totemnet.c 
> b/branches/whitetank/exec/totemnet.c
> index b5c4293..154aa4f 100644
> --- a/branches/whitetank/exec/totemnet.c
> +++ b/branches/whitetank/exec/totemnet.c
> @@ -148,6 +148,8 @@ struct totemnet_instance {
>  
>   struct iovec totemnet_iov_recv;
>  
> + struct iovec totemnet_iov_recv_flush;
> +
>   struct totemnet_socket totemnet_sockets;
>  
>   struct totem_ip_address mcast_address;
> @@ -215,6 +217,9 @@ static void totemnet_instance_initialize (struct 
> totemnet_instance *instance)
>   instance->totemnet_iov_recv.iov_base = instance->iov_buffer;
>  
>   instance->totemnet_iov_recv.iov_len = FRAME_SIZE_MAX; //sizeof 
> (instance->iov_buffer);
> + instance->totemnet_iov_recv_flush.iov_base = instance->iov_buffer_flush;
> +
> + instance->totemnet_iov_recv_flush.iov_len = FRAME_SIZE_MAX; //sizeof 
> (instance->iov_buffer);
>  
>   /*
>* There is always atleast 1 processor
> @@ -629,7 +634,11 @@ static int net_deliver_fn (
>   unsigned char *msg_offset;
>   unsigned int size_delv;
>  
> - iovec = &instance->totemnet_iov_recv;
> + if (instance->flushing == 1) {
> + iovec = &instance->totemnet_iov_recv_flush;
> + } else {
> + iovec = &instance->totemnet_iov_recv;
> + }
>  
>   /*
>* Receive datagram
> @@ -1310,6 +1319,40 @@ error_exit:
>   return (res);
>  }
>  
> +int totemnet_recv_flush (totemnet_handle handle)
> +{
> + struct totemnet_instance *instance;
> + struct pollfd ufd;
> + int nfds;
> + int res = 0;
> +
> + res = hdb_handle_get (&totemnet_instance_database, handle,
> + (void *)&instance);
> + if (res != 0) {
> + res = ENOENT;
> + goto error_exit;
> + }
> +
> + instance->flushing = 1;
> +
> + do {
> + ufd.fd = instance->totemnet_sockets.mcast_recv;
> + ufd.events = POLLIN;
> + nfds = poll (&ufd, 1, 0);
> + if (nfds == 1 && ufd.revents & POLLIN) {
> + net_deliver_fn (0, instance->totemnet_sockets.mcast_recv,
> + ufd.revents, instance);
> + }
> + } while (nfds == 1);
> +
> + instance->flushing = 0;
> +
> + hdb_handle_put (&totemnet_instance_database, handle);
> +
> +error_exit:
> + return (res);
> +}
> +
>  int totemnet_send_flush (totemnet_handle handle)
>  {
>   struct totemnet_instance *instance;
> diff --git a/branches/whitetank/exec/totemnet.h 
> b/branches/whitetank/exec/totemnet.h
> index 521743a..f4788ab 100644
> --- a/branches/whitetank/exec/totemnet.h
> +++ b/branches/whitetank/exec/totemnet.h
> @@ -88,6 +88,8 @@ extern int totemnet_mcast_noflush_send (
>   struct iovec *iovec,
>   unsigned int iov_len);
>  
> +extern int totemnet_recv_flush (totemnet_handle handle);
> +
>  extern int totemnet_send_flush (totemnet_handle handle);
>  
>  extern int totemnet_iface_check (totemnet_handle handle);
> diff --git a/branches/whitetank/exec/totemrrp.c 
> b/branches/whitetank/exec/totemrrp.c
> index 9864a88..f471c5b 100644
> --- a/branches/whitetank/exec/totemrrp.c
> +++ b/branches/whitetank/exec/totemrrp.c
> @@ -131,6 +131,9 @@ struct rrp_algo {
>   struct iovec *iovec,
>   unsigned int iov_len);  
>  
> + void (*recv_flush) (
> + struct totemrrp_instance *instance);
> +
>   void (*send_flush) (
>   struct totemrrp_instance *instance);
>  
> @@ -241,6 +244,9 @@ static void none_token_send (
>   struct iovec *iovec,
>   unsigned int iov_len);  
>  
> +static void none_recv_flush (
> + struct totemrrp_instance *instance);
> +
>  static void none_send_flush (
>   struct totemrrp_instance *instance);
>  
> @@ -296,6 +302,9 @@ static void passive_token_send (
>   struct iovec *iovec,
>   unsigned int iov_len);  
>

Re: [Openais] [PATCH] Add systemd unit files for corosync and corosync-notifyd

2011-08-08 Thread Steven Dake

Reviewed-by: Steven Dake 

Regards
-steve

On 08/08/2011 04:04 AM, Angus Salkeld wrote:
> Signed-off-by: Angus Salkeld 
> ---
>  configure.ac |8 
>  corosync.spec.in |   12 
>  init/.gitignore  |2 ++
>  init/Makefile.am |   15 +++
>  init/corosync-notifyd.service.in |   11 +++
>  init/corosync.service.in |   12 
>  6 files changed, 56 insertions(+), 4 deletions(-)
>  create mode 100644 init/corosync-notifyd.service.in
>  create mode 100644 init/corosync.service.in
> 
> diff --git a/configure.ac b/configure.ac
> index e00edeb..563f799 100644
> --- a/configure.ac
> +++ b/configure.ac
> @@ -279,6 +279,11 @@ AC_ARG_ENABLE([augeas],
>   [ enable_augeas="no" ])
>  AM_CONDITIONAL(INSTALL_AUGEAS, test x$enable_augeas = xyes)
>  
> +AC_ARG_ENABLE([systemd],
> +   [  --enable-systemd   : Install systemd service 
> files],,
> + [ enable_systemd="no" ])
> +AM_CONDITIONAL(INSTALL_SYSTEMD, test x$enable_systemd = xyes)
> +
>  AC_ARG_WITH([initddir],
>   [  --with-initddir=DIR : path to init script directory. ],
>   [ INITDDIR="$withval" ],
> @@ -448,6 +453,9 @@ fi
>  if test "x${enable_augeas}" = xyes; then
>   PACKAGE_FEATURES="$PACKAGE_FEATURES augeas"
>  fi
> +if test "x${enable_systemd}" = xyes; then
> + PACKAGE_FEATURES="$PACKAGE_FEATURES systemd"
> +fi
>  
>  if test "x${enable_snmp}" = xyes; then
> SNMPCONFIG=""
> diff --git a/corosync.spec.in b/corosync.spec.in
> index 5eba3bc..b864087 100644
> --- a/corosync.spec.in
> +++ b/corosync.spec.in
> @@ -11,6 +11,7 @@
>  %bcond_with snmp
>  %bcond_with dbus
>  %bcond_with rdma
> +%bcond_with systemd
>  
>  Name: corosync
>  Summary: The Corosync Cluster Engine and Application Programming Interfaces
> @@ -46,6 +47,9 @@ BuildRequires: net-snmp-devel
>  %if %{with dbus}
>  BuildRequires: dbus-devel
>  %endif
> +%if %{with systemd}
> +BuildRequires: systemd-units
> +%endif
>  
>  BuildRoot: %(mktemp -ud %{_tmppath}/%{name}-%{version}-%{release}-XX)
>  
> @@ -83,6 +87,9 @@ export rdmacm_LIBS=-lrdmacm \
>  %if %{with rdma}
>   --enable-rdma \
>  %endif
> +%if %{with systemd}
> + --enable-systemd \
> +%endif
>   --with-initddir=%{_initrddir}
>  
>  make %{_smp_mflags}
> @@ -146,8 +153,13 @@ fi
>  %if %{with snmp}
>  %{_datadir}/snmp/mibs/COROSYNC-MIB.txt
>  %endif
> +%if %{with systemd}
> +%{_unitdir}/corosync.service
> +%{_unitdir}/corosync-notifyd.service
> +%else
>  %{_initrddir}/corosync
>  %{_initrddir}/corosync-notifyd
> +%endif
>  %dir %{_libexecdir}/lcrso
>  %{_libexecdir}/lcrso/coroparse.lcrso
>  %{_libexecdir}/lcrso/objdb.lcrso
> diff --git a/init/.gitignore b/init/.gitignore
> index 0a75c32..34e4cb8 100644
> --- a/init/.gitignore
> +++ b/init/.gitignore
> @@ -1,2 +1,4 @@
>  generic
>  notifyd
> +corosync.service
> +corosync-notifyd.service
> diff --git a/init/Makefile.am b/init/Makefile.am
> index 0ca9ee9..90d49c4 100644
> --- a/init/Makefile.am
> +++ b/init/Makefile.am
> @@ -34,9 +34,14 @@
>  
>  MAINTAINERCLEANFILES = Makefile.in
>  
> -EXTRA_DIST   = generic.in notifyd.in
> +EXTRA_DIST   = generic.in notifyd.in corosync.service.in 
> corosync-notifyd.service.in
>  
> +if INSTALL_SYSTEMD
> +systemdconfdir = /lib/systemd/system
> +systemdconf_DATA = corosync.service corosync-notifyd.service
> +else
>  target_INIT  = generic notifyd
> +endif
>  
>  %: %.in Makefile
>   rm -f $@-t $@
> @@ -46,14 +51,15 @@ target_INIT   = generic notifyd
>   -e 's#@''INITDDIR@#$(INITDDIR)#g' \
>   -e 's#@''LOCALSTATEDIR@#$(localstatedir)#g' \
>   $< > $@-t
> - chmod 0755 $@-t
>   mv $@-t $@
>  
> -all-local: $(target_INIT)
> +all-local: $(target_INIT) $(systemdconf_DATA)
>  
>  clean-local:
> - rm -rf $(target_INIT)
> + rm -rf $(target_INIT) $(systemdconf_DATA)
>  
> +if INSTALL_SYSTEMD
> +else
>  install-exec-local:
>   $(INSTALL) -d $(DESTDIR)/$(INITDDIR)
>   $(INSTALL) -m 755 generic $(DESTDIR)/$(INITDDIR)/corosync
> @@ -62,3 +68,4 @@ install-exec-local:
>  uninstall-local:
>   cd $(DESTDIR)/$(INITDDIR) && \
>   rm -f corosync corosync-notifyd
> +endif
> diff --git a/init/corosync-notifyd.service.in 
> b/init/corosync-notifyd.service.in
> new file mode 100644
> index 000..26a278a

[Openais] feature proposal: take 2 of quorum

2011-08-08 Thread Steven Dake

On 08/08/2011 12:25 AM, Fabio M. Di Nitto wrote:
> On 8/7/2011 6:57 PM, Steven Dake wrote:
>> Believe many in community are on vacation during our proposal window.
>> As a result, I'm extending until Aug 30th.
>>
> 
> topic-quorum ? as we discussed recently on IRC, in order to replace cman.
> 

can you write full proposal that captures our conversation on irc.

thanks
-steve

> Fabio
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Problems forming cluster on corosync startup

2011-08-08 Thread Steven Dake

On 08/08/2011 12:10 AM, Tim Beale wrote:
> Hi Steve,
> 
> Thanks for your help. I tried out your patch but the problem still
> occurs. The problem looks to me due to the ring-IDs used when forming
> the transitional memb-list, rather than with the memb-list itself. The
> ring-ID of the nodes still in Recovery is older than the rest of the
> nodes who have already shifted to Operational.
> 
> Attached is my attempt at fixing the problem. The idea is to delay the
> nodes processing a Memb-Join immediately after shifting to
> Operational, until the token has rotated the ring once.
> 
> It doesn't quite work either though. The nodes are still re-entering
> gather before all have left recovery. This time it's due to processing
> a Merge-Detect message. One node has just started up and set itself to
> the rep, and sends out a Merge-Detect which triggers the other nodes
> to enter gather and reform the ring.
> 
> Let me know if you have any other advice.
> 

the problem is clear from the blackbox - 8 nodes enter operational while
1 in recovery is interrupted by a join message.  this interrupted node
then proceeds with a transitional membership of 1 node (which is correct).

The joined and left lists use the transitional list to determine their
contents, which is not correct.  This results in incorrect data
delivered to clm.  Try the follow-up patch which should correctly
calculate the joined and left lists.


> Thanks,
> Tim
> 
> On Mon, Aug 8, 2011 at 6:08 AM, Steven Dake  wrote:
>> On 08/03/2011 10:32 PM, Tim Beale wrote:
>>> Hi,
>>>
>>> It looks to me that the way the transition from Recovery to Operational 
>>> works,
>>> we can't guarantee that all nodes in the ring have entered Operational 
>>> before
>>> a node processes another Memb-Join message from a new node. E.g. we can't
>>> guarantee the token has rotated right the way around the ring.
>>>
>>> When this happens, the nodes still in Recovery will still use the older ring
>>> ID. So they won't get added to the transitional membership, and CLM will 
>>> report
>>> leave events for these nodes. (Plus there might be other side-effects, like 
>>> the
>>> FAILED TO RECEIVE problem - I haven't quite worked out why that's 
>>> happening).
>>>
>>
>> Thanks for the pointer here - patch on ml.
>>
>>> We are currently using CLM to check the health of a node, i.e. so we can 
>>> detect
>>> if it locks up. My questions are:
>>> i) Are there config settings we could change to improve this, like 
>>> increasing
>>> the 'join' timeout?
>>> ii) Should I try to make a code change to fix the problem? E.g. delay
>>> processing the Memb-Join message if the node's only just entered 
>>> operational.
>>> iii) Should we not be using CLM like this? I.e. should we just learn to live
>>> with CLM/CPG sometimes reporting nodes as leaving when they're perfectly
>>> healthy.
>>>
>>> Thanks for your help.
>>> Tim
>>>
>>
>> Tim please try the patch I have recently posted:
>> [PATCH] Set my_new_memb_list in recovery enter
>>
>> First and foremost, let me know if it resolves your 10 node startup case
>> which fails 10% of the time.  Then let me know if it treats other symptoms.
>>
>> Regards
>> -steve
>>
>>
>>> On Wed, Aug 3, 2011 at 3:28 PM, Tim Beale  wrote:
>>>> Hi,
>>>>
>>>> We're booting up a 10-node cluster (with all nodes starting corosync at 
>>>> roughly
>>>> the same time) and approx 1 in 10 times we see some problems:
>>>> a) CLM is reporting nodes as leaving and then immediately rejoining (not 
>>>> sure
>>>> if this is valid behaviour?)
>>>> b) Probably an unrelated oddity, but we're getting flow control enabled on 
>>>> a
>>>> client daemon using CLM that's only sending one request 
>>>> (saClmClusterTrack()).
>>>> c) A node is hitting the FAILED TO RECEIVE case
>>>> d) After c) there seems to be a lot of churn as the cluster tries to reform
>>>> e) During the processing of node leave events, the CPG client can 
>>>> sometimes get
>>>> broken so it no longer processes *any* CPG events
>>>>
>>>> Corosync debug is attached (I commented out some of the noisier debug 
>>>> around
>>>> message delivery). We don't really know enough about corosync to tell what
>>>> exactly is incorr

[Openais] [PATCH] Make joined and left lists deliver correct results

2011-08-08 Thread Steven Dake

Signed-off-by: Steven Dake 
---
 exec/totemsrp.c |   47 ++-
 1 files changed, 42 insertions(+), 5 deletions(-)

diff --git a/exec/totemsrp.c b/exec/totemsrp.c
index 4a299a0..a97ed49 100644
--- a/exec/totemsrp.c
+++ b/exec/totemsrp.c
@@ -1349,6 +1349,35 @@ static void memb_set_and_with_ring_id (
return;
 }
 
+static void memb_set_and (
+   struct srp_addr *set1,
+   int set1_entries,
+   struct srp_addr *set2,
+   int set2_entries,
+   struct srp_addr *and,
+   int *and_entries)
+{
+   int i;
+   int j;
+   int found = 0;
+
+   *and_entries = 0;
+
+   for (i = 0; i < set2_entries; i++) {
+   for (j = 0; j < set1_entries; j++) {
+   if (srp_addr_equal (&set1[j], &set2[i])) {
+   found = 1;
+   break;
+   }
+   }
+   if (found) {
+   srp_addr_copy (&and[*and_entries], &set1[j]);
+   *and_entries = *and_entries + 1;
+   }
+   found = 0;
+   }
+   return;
+}
 #ifdef CODE_COVERAGE
 static void memb_set_print (
char *string,
@@ -1718,6 +1747,8 @@ static void memb_state_operational_enter (struct 
totemsrp_instance *instance)
unsigned int trans_memb_list_totemip[PROCESSOR_COUNT_MAX];
unsigned int new_memb_list_totemip[PROCESSOR_COUNT_MAX];
unsigned int left_list[PROCESSOR_COUNT_MAX];
+   struct srp_addr difference_list[PROCESSOR_COUNT_MAX];
+   int difference_list_entries = 0;
unsigned int i;
unsigned int res;
 
@@ -1739,14 +1770,20 @@ static void memb_state_operational_enter (struct 
totemsrp_instance *instance)
/*
 * Calculate joined and left list
 */
-   memb_set_subtract (instance->my_left_memb_list,
-   &instance->my_left_memb_entries,
+   memb_set_subtract (difference_list,
+   &difference_list_entries,
+   instance->my_new_memb_list, instance->my_new_memb_entries,
+   instance->my_memb_list, instance->my_memb_entries);
+
+   memb_set_and (
+   difference_list, difference_list_entries,
instance->my_memb_list, instance->my_memb_entries,
-   instance->my_trans_memb_list, instance->my_trans_memb_entries);
+   instance->my_left_memb_list, &instance->my_left_memb_entries);
 
-   memb_set_subtract (joined_list, &joined_list_entries,
+   memb_set_and (
+   difference_list, difference_list_entries,
instance->my_new_memb_list, instance->my_new_memb_entries,
-   instance->my_trans_memb_list, instance->my_trans_memb_entries);
+   joined_list, &joined_list_entries);
 
/*
 * Install new membership
-- 
1.7.6

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Problems forming cluster on corosync startup

2011-08-07 Thread Steven Dake

On 08/03/2011 10:32 PM, Tim Beale wrote:
> Hi,
> 
> It looks to me that the way the transition from Recovery to Operational works,
> we can't guarantee that all nodes in the ring have entered Operational before
> a node processes another Memb-Join message from a new node. E.g. we can't
> guarantee the token has rotated right the way around the ring.
> 
> When this happens, the nodes still in Recovery will still use the older ring
> ID. So they won't get added to the transitional membership, and CLM will 
> report
> leave events for these nodes. (Plus there might be other side-effects, like 
> the
> FAILED TO RECEIVE problem - I haven't quite worked out why that's happening).
> 

Thanks for the pointer here - patch on ml.

> We are currently using CLM to check the health of a node, i.e. so we can 
> detect
> if it locks up. My questions are:
> i) Are there config settings we could change to improve this, like increasing
> the 'join' timeout?
> ii) Should I try to make a code change to fix the problem? E.g. delay
> processing the Memb-Join message if the node's only just entered operational.
> iii) Should we not be using CLM like this? I.e. should we just learn to live
> with CLM/CPG sometimes reporting nodes as leaving when they're perfectly
> healthy.
> 
> Thanks for your help.
> Tim
> 

Tim please try the patch I have recently posted:
[PATCH] Set my_new_memb_list in recovery enter

First and foremost, let me know if it resolves your 10 node startup case
which fails 10% of the time.  Then let me know if it treats other symptoms.

Regards
-steve


> On Wed, Aug 3, 2011 at 3:28 PM, Tim Beale  wrote:
>> Hi,
>>
>> We're booting up a 10-node cluster (with all nodes starting corosync at 
>> roughly
>> the same time) and approx 1 in 10 times we see some problems:
>> a) CLM is reporting nodes as leaving and then immediately rejoining (not sure
>> if this is valid behaviour?)
>> b) Probably an unrelated oddity, but we're getting flow control enabled on a
>> client daemon using CLM that's only sending one request 
>> (saClmClusterTrack()).
>> c) A node is hitting the FAILED TO RECEIVE case
>> d) After c) there seems to be a lot of churn as the cluster tries to reform
>> e) During the processing of node leave events, the CPG client can sometimes 
>> get
>> broken so it no longer processes *any* CPG events
>>
>> Corosync debug is attached (I commented out some of the noisier debug around
>> message delivery). We don't really know enough about corosync to tell what
>> exactly is incorrect behaviour and what should be fixed. But here's what 
>> we've
>> noticed:
>> 1). Node-4 joins soon after node-1. When this happens all nodes except 
>> node-12
>> have entered operational state (see node-12.txt line 235). It looks like 
>> maybe
>> node-12 hasn't received enough rotations of the token to enter operational 
>> yet.
>> Node-12's resulting transitional config consists of just itself. All nodes 
>> then
>> report node-1 and node-12 as leaving and immediately rejoining.
>> 2) After this config change, node-3 eventually hits the FAILED TO RECEIVE 
>> case
>> (node-3.txt line 380). At this point node-1 and node-12 have an ARU matching
>> the high_seq_received, all other nodes have an ARU of zero.
>> 3) Node-3 entering gather seems to result in a lot of config change churn
>> across the cluster.
>> 4) While processing the config changes on node-3, the CPG downlist it uses
>> contains itself. When node-3 sends leave events for the nodes in the downlist
>> (including itself), it sets its own cpd state to CPD_STATE_UNJOINED and 
>> clears
>> the cpd->group_name. This means it no longer sends any CPG events to the CPG
>> client.
>>
>> We tried cherry-picking this commit to fix the problem (#4) with the CPG 
>> client.
>> http://www.corosync.org/git/?p=corosync.git;a=commit;h=956a1dcb4236acbba37c07e2ac0b6c9ffcb32577
>> It helped a bit, but didn't fix it completely. We've made an interim change
>> (attached) to avoid this problem.
>>
>> We're using corosync v1.3.1 on an embedded linux system (with a low-spec 
>> CPU).
>> Corosync is running over a basic ethernet interface (no hubs/routers/etc).
>>
>> Any help would be appreciated. Let me know if there's any other debug I can
>> provide.
>>



>> Thanks,
>> Tim
>>
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

[Openais] [PATCH] Set my_new_memb_list in recovery enter

2011-08-07 Thread Steven Dake

Currently my_new_memb_list is set in commit_enter, resulting in join messages
being accepted during commit/recovery phases which are not appropriate to
maintain protocol guarantees.

Signed-off-by: Steven Dake 
---
 exec/totemsrp.c |   10 +-
 1 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/exec/totemsrp.c b/exec/totemsrp.c
index 4a299a0..44623d8 100644
--- a/exec/totemsrp.c
+++ b/exec/totemsrp.c
@@ -1991,6 +1991,11 @@ static void memb_state_recovery_enter (
log_printf (instance->totemsrp_log_level_debug,
"entering RECOVERY state.\n");
 
+   memcpy (instance->my_new_memb_list, addr,
+   sizeof (struct srp_addr) * 
instance->commit_token->addr_entries);
+
+   instance->my_new_memb_entries = instance->commit_token->addr_entries;
+
instance->orf_token_discard = 0;
 
instance->my_high_ring_delivered = 0;
@@ -2766,11 +2771,6 @@ static void memb_state_commit_token_update (
addr = (struct srp_addr *)instance->commit_token->end_of_commit_token;
memb_list = (struct memb_commit_token_memb_entry *)(addr + 
instance->commit_token->addr_entries);
 
-   memcpy (instance->my_new_memb_list, addr,
-   sizeof (struct srp_addr) * 
instance->commit_token->addr_entries);
-
-   instance->my_new_memb_entries = instance->commit_token->addr_entries;
-
memcpy (&memb_list[instance->commit_token->memb_index].ring_id,
&instance->my_old_ring_id, sizeof (struct memb_ring_id));
 
-- 
1.7.6

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

[Openais] Corosync 2.0 Feature Request: Centralize the encryption/decryption into one file

2011-08-07 Thread Steven Dake

Each network driver has encryption code in it.  Centralize that
encryption code to one file so that it may be maintained in one file
rather then 3 separate drivers.

This is the topic-onecrypt topic on the TODO file.

Regards
-steve
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

[Openais] Corosync 2.0 Feature Request: Use zero-copy operation with RDMA networks

2011-08-07 Thread Steven Dake

Totem currently copies each packet into the network layer.  This results
in an extra copy in RDMA networks.  To reduce cpu utilization and
improve performance, allocate these packets from the totem network layer
before sending the packet.  This removes an extra memory copy operation
in RDMA networks.

This is the topic-netmalloc topic in the TODO file.

Regards
-steve
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

[Openais] Corosync 2.0 Feature Request: Experiment with rdma support without using librdmacm

2011-08-07 Thread Steven Dake

The librdmacm libs assume a connection oriented mechanism whereas totem
assumes a connectionless oriented operation.  The RDMA technology can be
exposed only through ibverbs.

The advantage is improved reliability with RDMA networks.

In the TODO file this is the topic : topic-rdmaud

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

[Openais] Extendng call for Corosync RFEs until Aug 30th

2011-08-07 Thread Steven Dake

Believe many in community are on vacation during our proposal window.
As a result, I'm extending until Aug 30th.

Regards
-steve
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] [GIT PULL] Changes to example configuration file (corosync.conf.example)

2011-08-07 Thread Steven Dake

Florian

This has been processed.  Apologies for delay - very busy week.

Regards
-steve

On 08/01/2011 07:17 AM, Florian Haas wrote:
> Steve,
> 
> please consider pulling the following changes since commit
> d4fb83e971b6fa9af0447ce0a70345fb20064dc1:
> 
>   main: let poll really stop before totempg_finalize (2011-07-26
> 10:07:08 +0200)
> 
> from the the git repository at:
>   git://github.com/fghaas/corosync master
> 
> All changes have undergone review on the list. Thanks to Dan Frincu and
> Jan Friesse for their valuable feedback. A patch-by-patch summary and
> diffstat are below, as usual.
> 
> Cheers,
> Florian
> 
> 
> Florian Haas (4):
>   corosync.conf.example: change bindnetaddr
>   corosync.conf.example: change mcastaddr
>   corosync.conf.example: include comments
>   corosync.conf.example: add note about host addresses in bindnetaddr
> 
>  conf/corosync.conf.example |   54
> +--
>  1 files changed, 51 insertions(+), 3 deletions(-)
> 
> 
> 
> 
> 
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] [PATCH] Make realtime scheduling optional not the default.

2011-08-07 Thread Steven Dake

Good work

Reviewed-by: Steven Dake 

On 08/07/2011 05:40 AM, Angus Salkeld wrote:
> Signed-off-by: Angus Salkeld 
> ---
>  configure.ac   |6 ++
>  exec/main.c|   21 +++--
>  man/corosync.8 |7 +--
>  3 files changed, 26 insertions(+), 8 deletions(-)
> 
> diff --git a/configure.ac b/configure.ac
> index 35e3cfb..e00edeb 100644
> --- a/configure.ac
> +++ b/configure.ac
> @@ -73,6 +73,12 @@ AC_CHECK_LIB([socket], [socket])
>  AC_CHECK_LIB([nsl], [t_open])
>  AC_CHECK_LIB([rt], [sched_getscheduler])
>  PKG_CHECK_MODULES([LIBQB], [libqb])
> +AC_CHECK_LIB([qb], [qb_log_thread_priority_set], \
> +  have_qb_log_thread_priority_set="yes", \
> +  have_qb_log_thread_priority_set="no")
> +if test "x${have_qb_log_thread_priority_set}" = xyes; then
> + AC_DEFINE_UNQUOTED([HAVE_QB_LOG_THREAD_PRIORITY_SET], 1, [have 
> qb_log_thread_priority_set])
> +fi
>  
>  # Checks for header files.
>  AC_FUNC_ALLOCA
> diff --git a/exec/main.c b/exec/main.c
> index 9b2c941..a822120 100644
> --- a/exec/main.c
> +++ b/exec/main.c
> @@ -980,13 +980,19 @@ static void corosync_setscheduler (void)
>   global_sched_param.sched_priority);
>  
>   global_sched_param.sched_priority = 0;
> - logsys_thread_priority_set (SCHED_OTHER, NULL, 1);
> +#ifdef HAVE_QB_LOG_THREAD_PRIORITY_SET
> + qb_log_thread_priority_set (SCHED_OTHER, 0);
> +#endif
>   } else {
>  
>   /*
>* Turn on SCHED_RR in logsys system
>*/
> - res = logsys_thread_priority_set (SCHED_RR, 
> &global_sched_param, 10);
> +#ifdef HAVE_QB_LOG_THREAD_PRIORITY_SET
> + res = qb_log_thread_priority_set (SCHED_RR, 
> sched_priority);
> +#else
> + res = -1;
> +#endif
>   if (res == -1) {
>   log_printf (LOGSYS_LEVEL_ERROR,
>   "Could not set logsys thread 
> priority."
> @@ -1238,9 +1244,9 @@ int main (int argc, char **argv, char **envp)
>   /* default configuration
>*/
>   background = 1;
> - setprio = 1;
> + setprio = 0;
>  
> - while ((ch = getopt (argc, argv, "fpv")) != EOF) {
> + while ((ch = getopt (argc, argv, "fprv")) != EOF) {
>  
>   switch (ch) {
>   case 'f':
> @@ -1248,7 +1254,9 @@ int main (int argc, char **argv, char **envp)
>   logsys_config_mode_set (NULL, 
> LOGSYS_MODE_OUTPUT_STDERR|LOGSYS_MODE_THREADED|LOGSYS_MODE_FORK);
>   break;
>   case 'p':
> - setprio = 0;
> + break;
> + case 'r':
> + setprio = 1;
>   break;
>   case 'v':
>   printf ("Corosync Cluster Engine, version 
> '%s'\n", VERSION);
> @@ -1260,7 +1268,8 @@ int main (int argc, char **argv, char **envp)
>   fprintf(stderr, \
>   "usage:\n"\
>   "-f : Start application in 
> foreground.\n"\
> - "-p : Do not set process 
> priority.\n"\
> + "-p : Does nothing.\n"\
> + "-r : Set round robin 
> realtime scheduling \n"\
>   "-v : Display version and 
> SVN revision of Corosync and exit.\n");
>   return EXIT_FAILURE;
>   }
> diff --git a/man/corosync.8 b/man/corosync.8
> index c45cc56..016c053 100644
> --- a/man/corosync.8
> +++ b/man/corosync.8
> @@ -35,7 +35,7 @@
>  .SH NAME
>  corosync \- The Corosync Cluster Engine.
>  .SH SYNOPSIS
> -.B "corosync [\-f] [\-p] [\-v]"
> +.B "corosync [\-f] [\-p] [\-r] [\-v]"
>  .SH DESCRIPTION
>  .B corosync
>  Corosync provides clustering infracture such as membership, messaging and 
> quorum.
> @@ -45,7 +45,10 @@ Corosync provides clustering infracture such as 
> membership, messaging and quorum
>  Start application in foreground.
>  .TP
>  .B -p
> -Do not set process priority.
> +Does nothing (was: "Do not set process priority" - this is now the default).
> +.TP
> +.B -r
> +Set round robin realtime scheduling.
>  .TP
>  .B -v
>  Display version and SVN revision of Corosync and exit.

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] [PATCH 6/6] Update TODOs

2011-08-05 Thread Steven Dake

Reviewed-by: Steven Dake 

On 08/05/2011 12:09 AM, Angus Salkeld wrote:
> Signed-off-by: Angus Salkeld 
> ---
>  TODO |   73 +
>  1 files changed, 19 insertions(+), 54 deletions(-)
> 
> diff --git a/TODO b/TODO
> index 9a2db8f..fa30e36 100644
> --- a/TODO
> +++ b/TODO
> @@ -3,69 +3,34 @@ The Corosync Cluster Engine Topic Branches
>  --
>  
>  --
> -Last Updated: October 2010
> +Last Updated: August 2011
>  --
>  
> -We use topic branches in our git repository to develop new disruptive 
> features
> -that define our future roadmap.  This file describes the topic branches
> -the developers have interest in investigating further.
> -
> -targets can be: whitetank, needle, or future (3.0+).
> -Finished can be: percentage or date merged to master.
> -
>  
> --
> -topic-libqb
> +master
>  
> --
> -Main Developer: Angus Salkeld
> -Started: September 2010
> -Finished: 60%
> -target: needle
> -Description:
> -The libqb project is our effort to remove the core infrastructure required 
> for
> -client server operations of corosync from the corosync code base and place
> -inside a separate project.
> +1) exec/totempg.c in check_q_level()
> +   Remove hardcoded values.
> +   Chat to Steve about correcting the queue length calculation.
>  
> -The main purpose of this topic is to investigate integrating corosync with 
> the 
> -libqb package that has been refactored.  Part of this effort also involves
> -investigation into single threaded operation of the IPC layer without
> -peformance penalties.
> +2) check max message size restrictions.
>  
> -------
> -topic-rr
> ---
> -Main Developer: Steven Dake
> -Started: Not Started
> -Finished: 0%
> -target: needle
> -Description:
> -Redundant ring may have quality problems near boundary conditions for 
> sequence
> -numbers.  This effort involves qualifying and hardening redundant ring around
> -these boundary numbers.  A further stretch goal of this topic is to
> -automatically reenable a redundant ring when it has been back in service.
> +3) is this https://github.com/asalkeld/libqb/issues/1 still an issue?
>  
> ---
> -topic-snmp
> ---
> -Main Developer: Angus Salkeld
> -Started: Not Started
> -Finished: 100%
> -target: needle
> -Description:
> -This topic involves investigation of adding SNMP support into Corosync.
> +4) remove "old" stuff from the man pages (logging/IPC).
>  
> +5) new blackbox size might be too small (exec/logsys.c:311)
>  
> ---
> -topic-udpu
> ---
> -Main Developer: Steven Dake
> -Started: October
> -Finished: 80%
> -target: needle
> -Description:
> -The UDPU transport mode offers a mechanism for Corosync to operate in network
> -environments where multicast or broadcast are prohibited.  The main mechanism
> -it uses to do this is to UDP unicast to each of the target node IP addresses
> -listed in the configuation.
> +6) extend the logging config to make better use of the tracing capabilities.
> +
> +
> +
> +We use topic branches in our git repository to develop new disruptive 
> features
> +that define our future roadmap.  This file describes the topic branches
> +the developers have interest in investigating further.
> +
> +targets can be: whitetank, needle, or future (3.0+).
> +Finished can be: percentage or date merged to master.
>  
>  
> --
>  topic-onecrypt

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] [PATCH 4/6] libqb: Add libqb dependency in the rpm & pc file

2011-08-05 Thread Steven Dake

Reviewed-by: Steven Dake 

On 08/05/2011 12:09 AM, Angus Salkeld wrote:
> Signed-off-by: Angus Salkeld 
> ---
>  corosync.spec.in |2 +-
>  pkgconfig/corosync.pc.in |2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/corosync.spec.in b/corosync.spec.in
> index 58c4b0d..d50b72c 100644
> --- a/corosync.spec.in
> +++ b/corosync.spec.in
> @@ -36,7 +36,7 @@ Conflicts: openais <= 0.89, openais-devel <= 0.89
>  %if %{buildtrunk}
>  BuildRequires: autoconf automake
>  %endif
> -BuildRequires: nss-devel
> +BuildRequires: nss-devel libqb-devel
>  %if %{with rdma}
>  BuildRequires: libibverbs-devel librdmacm-devel
>  %endif
> diff --git a/pkgconfig/corosync.pc.in b/pkgconfig/corosync.pc.in
> index 820c607..31b354a 100644
> --- a/pkgconfig/corosync.pc.in
> +++ b/pkgconfig/corosync.pc.in
> @@ -8,5 +8,5 @@ socketdir=@COROSOCKETDIR@
>  Name: corosync
>  Version: @LIBVERSION@
>  Description: corosync
> -Requires:
> +Requires: libqb
>  Cflags: -I${includedir}

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] [PATCH 3/6] Fix some compiler warnings

2011-08-05 Thread Steven Dake

Reviewed-by: Steven Dake 

On 08/05/2011 12:09 AM, Angus Salkeld wrote:
> Signed-off-by: Angus Salkeld 
> ---
>  configure.ac  |4 +-
>  exec/crypto.c |2 -
>  exec/main.c   |3 --
>  exec/objdb.c  |   76 
>  lib/confdb.c  |3 ++
>  services/confdb.c |7 +++-
>  services/cpg.c|2 -
>  7 files changed, 51 insertions(+), 46 deletions(-)
> 
> diff --git a/configure.ac b/configure.ac
> index 92aed9e..35e3cfb 100644
> --- a/configure.ac
> +++ b/configure.ac
> @@ -173,9 +173,9 @@ LIB_MSG_RESULT(m4_shift(m4_shift($@)))dnl
>  
>  ## helper for CC stuff
>  cc_supports_flag() {
> - local CFLAGS="$@"
> + local CPPFLAGS="$CPPFLAGS $@"
>   AC_MSG_CHECKING([whether $CC supports "$@"])
> - AC_COMPILE_IFELSE([int main(){return 0;}] ,
> + AC_PREPROC_IFELSE([AC_LANG_PROGRAM([])],
> [RC=0; AC_MSG_RESULT([yes])],
> [RC=1; AC_MSG_RESULT([no])])
>   return $RC
> diff --git a/exec/crypto.c b/exec/crypto.c
> index 901797a..14fb807 100644
> --- a/exec/crypto.c
> +++ b/exec/crypto.c
> @@ -1140,12 +1140,10 @@ int sha1_done(hash_state * md, unsigned char *hash)
>  int hmac_init(hmac_state *hmac, int hash, const unsigned char *key, unsigned 
> long keylen)
>  {
>  unsigned char buf[128];
> -unsigned long hashsize;
>  unsigned long i;
>  int err;
>  
>  hmac->hash = hash;
> -hashsize   = hash_descriptor[hash]->hashsize;
>  
>  /* valid key length? */
>   assert (keylen > 0);
> diff --git a/exec/main.c b/exec/main.c
> index 006f846..e33a397 100644
> --- a/exec/main.c
> +++ b/exec/main.c
> @@ -807,16 +807,13 @@ static void deliver_fn (
>   int32_t service;
>   int32_t fn_id;
>   uint32_t id;
> - uint32_t size;
>   uint32_t key_incr_dummy;
>  
>   header = msg;
>   if (endian_conversion_required) {
>   id = swab32 (header->id);
> - size = swab32 (header->size);
>   } else {
>   id = header->id;
> - size = header->size;
>   }
>  
>   /*
> diff --git a/exec/objdb.c b/exec/objdb.c
> index 99e20ec..999db61 100644
> --- a/exec/objdb.c
> +++ b/exec/objdb.c
> @@ -112,7 +112,7 @@ static int objdb_init (void)
>  {
>   hdb_handle_t handle;
>   struct object_instance *instance;
> - unsigned int res;
> + int res;
>  
>   res = hdb_handle_create (&object_instance_database,
>   sizeof (struct object_instance), &handle);
> @@ -192,11 +192,12 @@ static void object_created_notification(
>   struct object_instance * obj_pt;
>   struct object_tracker * tracker_pt;
>   hdb_handle_t obj_handle = object_handle;
> - unsigned int res;
>  
>   do {
> - res = hdb_handle_get (&object_instance_database,
> - obj_handle, (void *)&obj_pt);
> + if (hdb_handle_get (&object_instance_database,
> + obj_handle, (void *)&obj_pt) != 0) {
> + return;
> + }
>  
>   for (list = obj_pt->track_head.next;
>   list != &obj_pt->track_head; list = list->next) {
> @@ -226,11 +227,12 @@ static void 
> object_pre_deletion_notification(hdb_handle_t object_handle,
>   struct object_instance * obj_pt;
>   struct object_tracker * tracker_pt;
>   hdb_handle_t obj_handle = object_handle;
> - unsigned int res;
>  
>   do {
> - res = hdb_handle_get (&object_instance_database,
> - obj_handle, (void *)&obj_pt);
> + if (hdb_handle_get (&object_instance_database,
> + obj_handle, (void *)&obj_pt) != 0) {
> + return;
> + }
>  
>   for (list = obj_pt->track_head.next;
>   list != &obj_pt->track_head; list = list->next) {
> @@ -265,11 +267,12 @@ static void 
> object_key_changed_notification(hdb_handle_t object_handle,
>   struct object_instance * owner_pt = NULL;
>   struct object_tracker * tracker_pt;
>   hdb_handle_t obj_handle = object_handle;
> - unsigned int res;
>  
>   do {
> - res = hdb_handle_get (&object_instance_database,
> - obj_handle, (void *)&obj_pt);
> + if (hdb_handle_get (&object_instance_database,
> + obj_handle, (void *)&obj_pt) != 0) {
> + return;
> +

Re: [Openais] [PATCH 2/6] Use PATH_MAX for file path size

2011-08-05 Thread Steven Dake

Reviewed-by: Steven Dake

On 08/05/2011 12:09 AM, Angus Salkeld wrote:
> Signed-off-by: Angus Salkeld 
> ---
>  lib/cpg.c |2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/lib/cpg.c b/lib/cpg.c
> index caf3efe..71704c0 100644
> --- a/lib/cpg.c
> +++ b/lib/cpg.c
> @@ -777,7 +777,7 @@ cs_error_t cpg_zcb_alloc (
>   void **buffer)
>  {
>   void *buf = NULL;
> - char path[128];
> + char path[PATH_MAX];
>   mar_req_coroipcc_zc_alloc_t req_coroipcc_zc_alloc;
>   struct qb_ipc_response_header res_coroipcs_zc_alloc;
>   size_t map_size;

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] [PATCH 1/6] Remove scheduling

2011-08-05 Thread Steven Dake

I believe a better approach would be to default to standard scheduling
and add a new flag "--realtime" which enables realtime scheduling.

Regards
-steve

 On 08/05/2011 12:09 AM, Angus Salkeld wrote:
> Signed-off-by: Angus Salkeld 
> ---
>  exec/main.c |   55 +--
>  1 files changed, 1 insertions(+), 54 deletions(-)
> 
> diff --git a/exec/main.c b/exec/main.c
> index b03d33e..006f846 100644
> --- a/exec/main.c
> +++ b/exec/main.c
> @@ -144,8 +144,6 @@ LOGSYS_DECLARE_SUBSYS ("MAIN");
>  
>  #define SERVER_BACKLOG 5
>  
> -static int sched_priority = 0;
> -
>  static unsigned int service_count = 32;
>  
>  static struct totem_logging_configuration totem_logging_configuration;
> @@ -972,46 +970,6 @@ void message_source_set (
>   source->conn = conn;
>  }
>  
> -static void corosync_setscheduler (void)
> -{
> -#if defined(HAVE_PTHREAD_SETSCHEDPARAM) && 
> defined(HAVE_SCHED_GET_PRIORITY_MAX) && defined(HAVE_SCHED_SETSCHEDULER)
> - int res;
> -
> - sched_priority = sched_get_priority_max (SCHED_RR);
> - if (sched_priority != -1) {
> - global_sched_param.sched_priority = sched_priority;
> - res = sched_setscheduler (0, SCHED_RR, &global_sched_param);
> - if (res == -1) {
> - LOGSYS_PERROR(errno, LOGSYS_LEVEL_WARNING,
> - "Could not set SCHED_RR at priority %d",
> - global_sched_param.sched_priority);
> -
> - global_sched_param.sched_priority = 0;
> - logsys_thread_priority_set (SCHED_OTHER, NULL, 1);
> - } else {
> -
> - /*
> -  * Turn on SCHED_RR in logsys system
> -  */
> - res = logsys_thread_priority_set (SCHED_RR, 
> &global_sched_param, 10);
> - if (res == -1) {
> - log_printf (LOGSYS_LEVEL_ERROR,
> - "Could not set logsys thread 
> priority."
> - " Can't continue because of 
> priority inversions.");
> - corosync_exit_error (AIS_DONE_LOGSETUP);
> - }
> - }
> - } else {
> - LOGSYS_PERROR (errno, LOGSYS_LEVEL_WARNING,
> - "Could not get maximum scheduler priority");
> - sched_priority = 0;
> - }
> -#else
> - log_printf(LOGSYS_LEVEL_WARNING,
> - "The Platform is missing process priority setting features.  
> Leaving at default.");
> -#endif
> -}
> -
>  static void fplay_key_change_notify_fn (
>   object_change_type_t change_type,
>   hdb_handle_t parent_object_handle,
> @@ -1203,7 +1161,7 @@ int main (int argc, char **argv, char **envp)
>   char *iface;
>   char *strtok_save_pt;
>   int res, ch;
> - int background, setprio;
> + int background;
>   struct stat stat_out;
>   char corosync_lib_dir[PATH_MAX];
>   hdb_handle_t object_runtime_handle;
> @@ -1212,7 +1170,6 @@ int main (int argc, char **argv, char **envp)
>   /* default configuration
>*/
>   background = 1;
> - setprio = 1;
>  
>   while ((ch = getopt (argc, argv, "fpv")) != EOF) {
>  
> @@ -1222,7 +1179,6 @@ int main (int argc, char **argv, char **envp)
>   logsys_config_mode_set (NULL, 
> LOGSYS_MODE_OUTPUT_STDERR|LOGSYS_MODE_THREADED|LOGSYS_MODE_FORK);
>   break;
>   case 'p':
> - setprio = 0;
>   break;
>   case 'v':
>   printf ("Corosync Cluster Engine, version 
> '%s'\n", VERSION);
> @@ -1240,15 +1196,6 @@ int main (int argc, char **argv, char **envp)
>   }
>   }
>  
> - /*
> -  * Set round robin realtime scheduling with priority 99
> -  * Lock all memory to avoid page faults which may interrupt
> -  * application healthchecking
> -  */
> - if (setprio) {
> - corosync_setscheduler ();
> - }
> -
>   corosync_mlockall ();
>  
>   log_printf (LOGSYS_LEVEL_NOTICE, "Corosync Cluster Engine ('%s'): 
> started and ready to provide service.\n", VERSION);

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Corosync fails to start under cman

2011-08-03 Thread Steven Dake

On 08/03/2011 04:06 PM, David wrote:
> I have a 3 node RHCS cluster and prior to an VLAN change (moved the 
> cluster communications into its own VLAN) all three nodes were working.
> 
> Post VLAN migration 2 of the 3 nodes joined the cluster but a third is 
> failing when I start cman:
> 
> Starting cluster:
> Checking Network Manager... [  OK  ]
> Global setup... [  OK  ]
> Loading kernel modules...   [  OK  ]
> Mounting configfs...[  OK  ]
> Starting cman... Aug 03 22:58:26 corosync [MAIN  ] Corosync Cluster 
> Engine ('1.2.3'): started and ready to provide service.
> Aug 03 22:58:26 corosync [MAIN  ] Corosync built-in features: nss rdma
> Aug 03 22:58:26 corosync [MAIN  ] Successfully read config from 
> /etc/cluster/cluster.conf
> Aug 03 22:58:26 corosync [MAIN  ] Successfully parsed cman config
> Aug 03 22:58:26 corosync [TOTEM ] Token Timeout (1 ms) retransmit 
> timeout (2380 ms)
> Aug 03 22:58:26 corosync [TOTEM ] token hold (1894 ms) retransmits 
> before loss (4 retrans)
> Aug 03 22:58:26 corosync [TOTEM ] join (60 ms) send_join (0 ms) 
> consensus (12000 ms) merge (200 ms)
> Aug 03 22:58:26 corosync [TOTEM ] downcheck (1000 ms) fail to recv const 
> (2500 msgs)
> Aug 03 22:58:26 corosync [TOTEM ] seqno unchanged const (30 rotations) 
> Maximum network MTU 1402
> Aug 03 22:58:26 corosync [TOTEM ] window size per rotation (50 messages) 
> maximum messages per rotation (17 messages)
> Aug 03 22:58:26 corosync [TOTEM ] missed count const (5 messages)
> Aug 03 22:58:26 corosync [TOTEM ] send threads (0 threads)
> Aug 03 22:58:26 corosync [TOTEM ] RRP token expired timeout (2380 ms)
> Aug 03 22:58:26 corosync [TOTEM ] RRP token problem counter (2000 ms)
> Aug 03 22:58:26 corosync [TOTEM ] RRP threshold (10 problem count)
> Aug 03 22:58:26 corosync [TOTEM ] RRP mode set to none.
> Aug 03 22:58:26 corosync [TOTEM ] heartbeat_failures_allowed (0)
> Aug 03 22:58:26 corosync [TOTEM ] max_network_delay (50 ms)
> Aug 03 22:58:26 corosync [TOTEM ] HeartBeat is Disabled. To enable set 
> heartbeat_failures_allowed > 0
> Aug 03 22:58:26 corosync [TOTEM ] Initializing transport (UDP/IP).
> Aug 03 22:58:26 corosync [TOTEM ] Initializing transmit/receive 
> security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
> Aug 03 22:58:26 corosync [IPC   ] you are using ipc api v2
> Aug 03 22:58:26 corosync [TOTEM ] Receive multicast socket recv buffer 
> size (262142 bytes).
> Aug 03 22:58:26 corosync [TOTEM ] Transmit multicast socket send buffer 
> size (262142 bytes).
> corosync: totemsrp.c:3091: memb_ring_id_create_or_load: Assertion `res 
> == sizeof (unsigned long long)' failed.
> Aug 03 22:58:26 corosync [TOTEM ] The network interface [10.50.3.70] is 
> now up.
> corosync died with signal: 6 Check cluster logs for details
> [FAILED]
> 
> 
> I haven't been able to find information that identifies the issue or how 
> to correct it.  I am hoping someone from this group may be able to shed 
> some light.
> 

This happens because the ring id file is 0 bytes.  We have fixed this
problem in later versions of corosync.  TO rectify this problem, rm -f
/var/lib/corosync/ringid*

Regards
-steve

> Thanks!
> David
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

[Openais] Live demo of Pacemaker Cloud on Fedora: Friday August 5th at 8am PST

2011-08-03 Thread Steven Dake

Extending a general invitation to the high availability communities and
other cloud community contributors to participate in a live demo I am
giving on Friday August 5th 8am PST (GMT-7).  Demo portion of session is
15 minutes and will be provided first followed by more details of our
approach to high availability.

I will use elluminate to show the demo on my desktop machine.  To make
elluminate work, you will need icedtea-web installed on your system
which is not typically installed by default.

You will also need a conference # and bridge code.  Please contact me
offlist with your location and I'll provide you with a hopefully toll
free conference # and bridge code.

Elluminate link:
https://sas.elluminate.com/m.jnlp?sid=819&password=M.13AB020AEBE358D265FD925A07335F

Bridge Code:  Please contact me off list with your location and I'll
respond back with dial-in information.
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Corosync (version 1.23 on rhel6) crashes when packets are dropped

2011-08-02 Thread Steven Dake

On 08/02/2011 04:47 PM, Stanley, Ephrim wrote:
> Hi,
>  
> I’m evaluating the Qpid messaging broker which uses Corosync for
> clustering. As part of my cluster break tests, I ran into a problem
> where Corosync dies without producing any core files or error messages.
>  
> Is this expected ? Also, what are some best practices for testing packet
> loss with Corosync ?
>  
> Steps to reproduce :
> 
>  1. Compile Corosync 1.2.3 after enabling the #defines for packet loss
> (in totemsrp.c  line 129). I did not change the drop percentages..
> left them as is 
> 
>  
>   #define TEST_DROP_ORF_TOKEN_PERCENTAGE 30
>   #define TEST_DROP_COMMIT_TOKEN_PERCENTAGE 30
>   #define TEST_DROP_MCAST_PERCENTAGE 50
>   #define TEST_RECOVERY_MSG_COUNT 300
>  
> 
>  2. Start a qpid cluster with three nodes NODE1, NODE2, NODE3
>  3. Nodes NODE2 and NODE3 are run with the Corosync that does not drop
> packets
>  4. Start the qpid process on nodes NODE2 and NODE3
>  5. After both proceses are up, corosync-cpgtool reports the cluster
> membership correctly
>  6. On NODE1, start Corosync (that drops packets)
>  7. Corosync starts and packet drops can be observed in the Corosync log
> (I added some debug log statements)
>  8. Start a qpid process on NODE1
>  9. Now, Corosync crashes on NODE1. No core files are produced. 
> 
>  
> I have attached the output of corosync-fplay on NODE1 and a diff of the
> changes I made to totemsrp.c.
>  
>  
> Thanks, Ephrim.
>  
>  
>  

Ephrim

Could you be more specific about which version of Red Hat's build of
corosync you are using?  Redundant ring is not supported in 1.2.3 by
upstream nor Red Hat.

Looking at existing bugs that have not hit z streams yet, may be this issue:
https://bugzilla.redhat.com/show_bug.cgi?id=722522

to get a core file, set ulimit -c unlimited before running corosync.  A
core file would verify if this is a known fixed problem or a new issue.

Thanks
-steve


> 
> 
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] [PATCH] corosync.conf.example: add note about host addresses in bindnetaddr

2011-08-01 Thread Steven Dake

These patches look good.

Reviewed-by: Steven Dake 

Regards
-steve

On 07/31/2011 11:56 PM, Florian Haas wrote:
> https://lists.linux-foundation.org/pipermail/openais/2011-July/016563.html
> 
> Jan Friesse pointed out that bindnetaddr should be set to a host
> address (as opposed to a network address) on hosts where multiple
> NICs live on the same subnet. Add a comment to that effect to
> the example configuration file.
> 
> Signed-off-by: Florian Haas 
> ---
>  conf/corosync.conf.example |   16 
>  1 files changed, 12 insertions(+), 4 deletions(-)
> 
> diff --git a/conf/corosync.conf.example b/conf/corosync.conf.example
> index c849dba..ac1718f 100644
> --- a/conf/corosync.conf.example
> +++ b/conf/corosync.conf.example
> @@ -17,11 +17,19 @@ totem {
>   interface {
>  # Rings must be consecutively numbered, starting at 0.
>   ringnumber: 0
> - # This is the *network* address of the interface to
> - # bind to. This ensures that you can use identical
> - # instances of this configuration file across all your
> - # cluster nodes, without having to modify this option.
> + # This is normally the *network* address of the
> + # interface to bind to. This ensures that you can use
> + # identical instances of this configuration file
> + # across all your cluster nodes, without having to
> + # modify this option.
>   bindnetaddr: 192.168.1.0
> + # However, if you have multiple physical network
> + # interfaces configured for the same subnet, then the
> + # network address alone is not sufficient to identify
> + # the interface Corosync should bind to. In that case,
> + # configure the *host* address of the interface
> + # instead:
> + # bindnetaddr: 192.168.1.1
>   # When selecting a multicast address, consider RFC
>   # 2365 (which, among other things, specifies that
>   # 239.255.x.x addresses are left to the discretion of

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Corosync Compatability

2011-07-29 Thread Steven Dake

On 07/26/2011 08:28 PM, manish.gu...@ionidea.com wrote:
> 
> Thank you Steave,
> We are currentely using corosync-1.2.1 and pacemaker 1.0.10
> Can we use the same version of pacemaker with corosync-1.4
> 

Yes, although redundant ring is not onwire compatible meaning you will
have to restart your cluster.

Regards
-steve

> 
> On Tue, July 26, 2011 7:12 pm, Steven Dake wrote:
>> On 07/26/2011 01:52 AM, manish.gu...@ionidea.com wrote:
>>
>>> Hi,
>>>
>>>
>>> I am facing problem with redundent Communication Channel.
>>> I am using Coroync 1.2 In this auto failback of redundent
>>> channel is not Supported. But 1.4 provide support.
>>>
>>> Corosync-1.4 id compatiable with which version of pacemaker
>>>
> 
>>>
>>>
>>
>> corosync 1.4 should work with all versions of pacemaker.  What version of
>> pm are you using?
>>
>> Regards
>> -steve
>>
>>>
>>> ___
>>> Openais mailing list
>>> Openais@lists.linux-foundation.org
>>> https://lists.linux-foundation.org/mailman/listinfo/openais
>>>
>>
>>
> 
> 

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] corosync didn't do what I expected

2011-07-29 Thread Steven Dake

On 07/29/2011 12:36 PM, Keith Stevens wrote:
> I have the following configuration on two servers netbox1 and netbox2:
> 
> crm(live)configure# show
> node netbox1 \
>  attributes standby="off"
> node netbox2
> primitive failover-ip ocf:heartbeat:IPaddr \
>  params ip="216.105.20.43" \
>  op monitor interval="10s"
> location cli-prefer-failover-ip failover-ip \
>  rule $id="cli-prefer-rule-failover-ip" inf: #uname eq netbox1
> property $id="cib-bootstrap-options" \
>  dc-version="1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b" \
>  cluster-infrastructure="openais" \
>  expected-quorum-votes="2" \
>  stonith-enabled="false"
> 
> If I put netbox1 on standby the ip address migrates to netbox2 and back 
> to netbox1 when
> I bring it back online.
> The ip address was on netbox1 when I powered down netbox2 to move it 
> into a cabinet.
> To my surprise, netbox1 lost the ip address and didn't get it back until 
> I booted netbox2.
> Apparently I have huge conceptual hole in my understanding, I expected 
> netbox1 to keep the ip address.
> Why didn't it?
> 
> Thanks,
> -Keith
> 

Keith,

Your email is better suited for the pacemaker list.

Regards
-teve
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] vsftype - which one?

2011-07-26 Thread Steven Dake

On 07/26/2011 04:07 AM, Proskurin Kirill wrote:
> Hello all.
> 
> I not fully understand that vsftype is really is. Could someone explain it?
> 
> I plan to make a ~50 nodes cluster with about ~50 resources via 
> pacemaker. All nodes are in out local network with 1Gbis\s NIC
> 
> What type should I chose?
> Do I need recompile corosync with something special? (eg with 
> HAVE_SMALL_MEMORY_FOOTPRINT=0 ?)
> 
> All runs on corosync-1.4.1 and pacemaker-1.1.5
> 

Don't use vsftype, ie "vsftype: none"

Regards
-steve
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Corosync Compatability

2011-07-26 Thread Steven Dake

On 07/26/2011 01:52 AM, manish.gu...@ionidea.com wrote:
> Hi,
> 
>   I am facing problem with redundent Communication Channel.
>   I am using Coroync 1.2 In this auto failback of redundent
>   channel is not Supported. But 1.4 provide support.
> 
>   Corosync-1.4 id compatiable with which version of pacemaker
> 
> 

corosync 1.4 should work with all versions of pacemaker.  What version
of pm are you using?

Regards
-steve
> 
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] [PATCH] main: let poll really stop before totempg_finalize

2011-07-25 Thread Steven Dake

Reviewed-by: Steven Dake 

On 07/25/2011 06:23 AM, Jan Friesse wrote:
> Signed-off-by: Jan Friesse 
> ---
>  exec/main.c |   24 +++-
>  1 files changed, 15 insertions(+), 9 deletions(-)
> 
> diff --git a/exec/main.c b/exec/main.c
> index be9e118..1c4fb37 100644
> --- a/exec/main.c
> +++ b/exec/main.c
> @@ -184,6 +184,8 @@ static int32_t corosync_not_enough_fds_left = 0;
>  
>  static void serialize_unlock (void);
>  
> +static void serialize_lock (void);
> +
>  hdb_handle_t corosync_poll_handle_get (void)
>  {
>   return (corosync_poll_handle);
> @@ -211,14 +213,7 @@ static void unlink_all_completed (void)
>   serialize_unlock ();
>   api->timer_delete (corosync_stats_timer_handle);
>   poll_stop (corosync_poll_handle);
> - totempg_finalize ();
> -
> - /*
> -  * Remove pid lock file
> -  */
> - unlink (corosync_lock_file);
> -
> - corosync_exit_error (AIS_DONE_EXIT);
> + serialize_lock ();
>  }
>  
>  void corosync_shutdown_request (void)
> @@ -1887,6 +1882,17 @@ int main (int argc, char **argv, char **envp)
>*/
>   poll_run (corosync_poll_handle);
>  
> + /*
> +  * Exit was requested
> +  */
> + totempg_finalize ();
> +
> + /*
> +  * Remove pid lock file
> +  */
> + unlink (corosync_lock_file);
> +
> + corosync_exit_error (AIS_DONE_EXIT);
> +
>   return EXIT_SUCCESS;
>  }
> -

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] [PATCH] totemsrp: fix buffer overflows for large clusters (> 100 nodes)

2011-07-24 Thread Steven Dake

Thanks for the submission.

Reviewed-by; Steven Dake 

On 07/24/2011 02:58 AM, MORITA Kazutaka wrote:
> Signed-off-by: MORITA Kazutaka 
> ---
>  exec/totemsrp.c |6 +++---
>  1 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/exec/totemsrp.c b/exec/totemsrp.c
> index 16de74d..e34da1a 100644
> --- a/exec/totemsrp.c
> +++ b/exec/totemsrp.c
> @@ -508,7 +508,7 @@ struct totemsrp_instance {
>   
>   void * token_recv_event_handle;
>   void * token_sent_event_handle;
> - char commit_token_storage[9000];
> + char commit_token_storage[4];
>  };
>  
>  struct message_handlers {
> @@ -2976,7 +2976,7 @@ static void memb_state_commit_token_create (
>  
>  static void memb_join_message_send (struct totemsrp_instance *instance)
>  {
> - char memb_join_data[1];
> + char memb_join_data[4];
>   struct memb_join *memb_join = (struct memb_join *)memb_join_data;
>   char *addr;
>   unsigned int addr_idx;
> @@ -3028,7 +3028,7 @@ static void memb_join_message_send (struct 
> totemsrp_instance *instance)
>  
>  static void memb_leave_message_send (struct totemsrp_instance *instance)
>  {
> - char memb_join_data[1];
> + char memb_join_data[4];
>   struct memb_join *memb_join = (struct memb_join *)memb_join_data;
>   char *addr;
>   unsigned int addr_idx;

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Corosync sends unicast but should multycast

2011-07-22 Thread Steven Dake

On 07/22/2011 08:01 AM, Proskurin Kirill wrote:
> On 07/22/2011 06:46 PM, Steven Dake wrote:
>> Tokens are always sent unicast - this is how the protocol works.
> 
> Thanks for reply.
> One more thing - then and for what multycast is send?
> We make some test with network team and try to understand all
> communication logic of corosync.
> 

read
http://www.google.com/url?sa=t&source=web&cd=1&ved=0CBUQFjAA&url=http%3A%2F%2Fciteseer.ist.psu.edu%2Fviewdoc%2Fdownload%3Bjsessionid%3D863760AB04B004AF5DF7285D032E6595%3Fdoi%3D10.1.1.37.767%26rep%3Drep1%26type%3Dps&rct=j&q=totem%20single%20ring%20protocol&ei=Z5EpTo6sMsmbtwfe36nXAg&usg=AFQjCNFSyIM94w0Xm2VCfOGJS4kKyaMjmg

The Totem Single Ring Protocol in case that link didn't come through

The multicast are the actual data in messages that is transmitted to all
nodes at same time.

Regards
-steve
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Corosync sends unicast but should multycast

2011-07-22 Thread Steven Dake

Tokens are always sent unicast - this is how the protocol works.

thanks
-steve

On 07/22/2011 07:22 AM, Proskurin Kirill wrote:
> Hi all.
> 
> Found odd thing - some of my node send unicast while other send
> muiltycast and other unicast and multycast... with same configuration
> and they all work.
> 
> Sound little confusing, I know.
> 
> corosync-1.4.0
> Config attached.
> 
> I have 3 node:
> my108.i has address 10.3.1.108
> my107.i has address 10.3.1.107
> my105.i has address 10.6.1.155
> 
> I use tcpdump to look at the traffic and see thing like this:
> IP (tos 0x0, ttl  62, id 0, offset 0, flags [DF], proto: UDP (17),
> length: 98) 10.3.1.108.5404 > 10.6.1.155.5405: UDP, length 70
> IP (tos 0x0, ttl  64, id 0, offset 0, flags [DF], proto: UDP (17),
> length: 98) 10.6.1.155.5404 > 10.6.1.156.5405: UDP, length 70
> IP (tos 0x0, ttl  29, id 0, offset 0, flags [DF], proto: UDP (17),
> length: 110) 10.3.1.107.5404 > 239.255.1.1.5405: UDP, length 82
> IP (tos 0x0, ttl  62, id 0, offset 0, flags [DF], proto: UDP (17),
> length: 98) 10.3.1.108.5404 > 10.6.1.155.5405: UDP, length 70
> IP (tos 0x0, ttl  64, id 0, offset 0, flags [DF], proto: UDP (17),
> length: 98) 10.6.1.155.5404 > 10.6.1.156.5405: UDP, length 70
> 
> Node see each other and all seems to work but as I understand they
> should communicate by multycast. Or not?
> 
> 
> 
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] FAILED TO RECEIVE followed by cluster failure

2011-07-21 Thread Steven Dake

On 07/21/2011 12:19 PM, Jed Smith wrote:
> Steve,
> 
> Thank you again for all of the information.
> 
> I labbed an in-place upgrade and the Corosync 1.4.0 compile brought
> down the 1.2.1-4ubuntu1 box. All I did was deploy from scratch, create
> a cluster with 1.2.1-4ubuntu1 and Pacemaker 1.0.10-4ubuntu3, then
> compiled Corosync 1.4.0 and Pacemaker 1.0.11 and introduced them to
> the cluster, and Corosync disappeared with no output.
> 
> I don't mind building a new oblivious cluster and failing my resources
> over the hard way -- I did that many times, including a transition
> from Heartbeat to Corosync during development -- I'm just curious if
> there's something I'm doing that's preventing the 1.2.1 box from
> staying up. I restarted Corosync on the 1.2.1 side, and it crashed
> immediately.
> 
> Logs: http://pastie.org/private/e9ktdolkdesf3eeq5d5gnq
> 
> Again, I don't mind doing an oblivious cluster rebuild. It's not
> ideal, but it's also not a big deal -- you just mentioned that, in
> theory, 1.2.1 should talk to 1.4.0 fine.
> 

A correction is in order.  We test rolling upgrades from 1.2.latest z to
1.3.0 and 1.3.latest z to 1.4.0.  updating from 1.2.1 may not roll properly.

I expect rolling upgrades of redundant ring don't work well with 1.4.0
because of protocol changes to support automatic redundant ring
recovery, which hopefully nobody was using until 1.4.0 where we added it
to the list of things we really want to work well :)

Regards
-steve
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

[Openais] Corosync 2.0 (needle) Call for RFEs

2011-07-21 Thread Steven Dake

The Corosync flatiron 1.y series had many more features added then I
would have liked, but the development team feels the 1.y series
addresses any major gaps users of the software have had.  As a result,
we are freezing any future feature development of the flatiron branch
permanently.  We will continue to maintain z streams (1.4.z) bug fixes
for many years to come in a robust and aggressive fashion.

Now that the flatiron chapter of Corosync is finished, we can move on to
new r&d work around Corosync 2.0.  There are a few RFEs floating around
in bugzilla and the TODO list.  This is your chance to provide feedback
about feature development you would like to see in Corosync.

The overall theme for Corosync 2.0 is focused around trimming the fat
and simplifying the implementation without major performance regressions.

The developers will take feature submission suggestions until Aug 31, at
which point we will prioritize features for 2.0 and close feature
submission requests.

Regards
-steve
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

[Openais] New bugzilla method

2011-07-21 Thread Steven Dake

Hi,

We have new bugzilla tracking in place via bugzilla.redhat.com.  When
filing bugs, please file under "Community->Corosync Cluster Engine"
rather then rawhide or a specific fedora version.  If the issue is
fedora specific, continue to file under fedora.  For other distro
specific problems (such as defect because distro is shipping non latest
z stream supported software), please file bugs with the various
distributions bug tracking systems.
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Multycast & unicast as fall back

2011-07-21 Thread Steven Dake

On 07/21/2011 06:27 AM, Proskurin Kirill wrote:
> On 07/21/2011 05:11 PM, Steven Dake wrote:
>> On 07/21/2011 02:30 AM, Proskurin Kirill wrote:
>>> Hello all.
>>>
>>> Is this possible to use multycast as primary way to communication in
>>> cluster but fall back to unicast transports if multycast is fail?
>>> Different rings with different transports?
>>>
>>> We have some problems in network switches and multycast just stop
>>> working and I start to think about this feature.
> 
>> Just use udpu entirely.  This feature is supported n 1.3.2+.
> 
> I`m on 1.4.0 now but I not wish to use unicast as production base - only
> if some problems with multycast occur.
> 
> 

There is no fallback.  You can specify one transport or the other.
Thinking a moment how to implement this type of feature, it could not be
reasonably implemented.

What type of app are you running on top of corosync?  The advantages of
multicast is automatic growth (you don't have to know the node addresses
ahead of time) and more throughput with less cpu utilization on high cpg
message throughput.  The disadvantage is multicast is generally poorly
implemented by switch vendors.

Regards
-steve
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] About TODO file

2011-07-21 Thread Steven Dake

On 07/21/2011 04:15 AM, Yingliang Yang wrote:
> Hi,
> I have downloaded corosync-1.4.0 package.
> There is a TODO file in the release.But it's updated in October 2010
> I would like to know is there any plan in the future.
>  
> And also, there is an option(enable_watchdog) in the configure file.
> Will this feature be released in  future version?
>  

We have a fairly concrete 2.0 plan which is called "Noeedle".  Most
features are describes in our TODO in master branch.

Regards
-steve

>  
> Best Regards,
> Yingliang Yang
>  
>  
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> 
> https://lists.linux-foundation.org/mailman/listinfo/openais
> 
> 
> 
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Ip addr auto detection

2011-07-21 Thread Steven Dake

On 07/21/2011 02:42 AM, Proskurin Kirill wrote:
> Hello all.
> 
> In man for corosync.conf suggest to add not current IP addr of a node 
> but her network:
> 
> "For example, if the local interface is 192.168.5.92 with netmask 
> 255.255.255.0, set bindnetaddr to 192.168.5.0."
> 
> Ok - that`s cool. But If i have a bunch of alias on same network on same 
> node? How it will determine what ip to use?
> 
> Or if I have two NIC with two IP on them on the with the same network?
> 

Specify the exact ip address in this case.

Regards
-steve
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Multycast & unicast as fall back

2011-07-21 Thread Steven Dake

On 07/21/2011 02:30 AM, Proskurin Kirill wrote:
> Hello all.
> 
> Is this possible to use multycast as primary way to communication in 
> cluster but fall back to unicast transports if multycast is fail? 
> Different rings with different transports?
> 
> We have some problems in network switches and multycast just stop 
> working and I start to think about this feature.
> 
> 

Just use udpu entirely.  This feature is supported n 1.3.2+.

Regards
-steve
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] [PATCH 2/3] specfile: use _datadir as var expansion not exec

2011-07-20 Thread Steven Dake

On 07/20/2011 12:48 AM, Jan Friesse wrote:
> Steven Dake wrote:
>> On 07/19/2011 08:01 AM, Jan Friesse wrote:
>>> Signed-off-by: Jan Friesse 
>>> ---
>>>  corosync.spec.in |2 +-
>>>  1 files changed, 1 insertions(+), 1 deletions(-)
>>>
>>> diff --git a/corosync.spec.in b/corosync.spec.in
>>> index 37e53ed..823ad3d 100644
>>> --- a/corosync.spec.in
>>> +++ b/corosync.spec.in
>>> @@ -138,7 +138,7 @@ fi
>>>  %{_sysconfdir}/dbus-1/system.d/corosync-signals.conf
>>>  %endif
>>>  %if %{with snmp}
>>> -%(_datadir)/snmp/mibs/COROSYNC-MIB.txt
>>> +%{_datadir}/snmp/mibs/COROSYNC-MIB.txt
>>>  %endif
>>
>> does this patch change anything?
> 
> Ya, but it's very hard to spot (especially with small/bad fonts, it took
> me a while to notice it too). It changes round brackets ( ) to curly
> bracket { }. First means "execute in shell" (we really don't want to
> execute _datadir" command) and second "expand variable value" (this is
> what we want).
> 
>>
>>>  %{_initrddir}/corosync
>>>  %{_initrddir}/corosync-notifyd
>>
> 


Reviewed-by: Steven Dake 
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Add a few more stats for debugging

2011-07-19 Thread Steven Dake

On 07/18/2011 09:14 PM, Tim Beale wrote:
> Hi,
> 
> Attached is a patch that adds a few more more stats (the code was actually
> written by Angus). We find these stats useful - hopefully others will too.
> 
> Cheers,
> Tim
> 

Great work

Reviewed-by: Steven Dake 
> 
> 
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Some messages still leaked in recovery code

2011-07-19 Thread Steven Dake

On 07/18/2011 07:55 PM, Tim Beale wrote:
> Hi,
> 
> I think there is still a slight memory-leak when recovery is entered
> repeatedly. The recovery messages usually get freed when the operational state
> is entered. However if recovery is entered several times, without entering the
> operational state, then some messages can be leaked.
> 
> Attached is a patch that fixes the problem for me. I tested it on v1.3.1, but
> the patch should apply to trunk.
> 
> Let me know if I've misunderstood anything, or if any of the patch needs 
> fixing
> up.
> 
> Cheers,
> Tim
> 
> 

Tim,

Thanks for the patch.  I have briefly looked over it, and it is a big
change.  I want to give it due review but I am swamped atleast until the
end of the month.  I'll provide review then.

Thanks
-steve

> 
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] [PATCH 3/3] specfile: Install corosync-signals.conf for dbus

2011-07-19 Thread Steven Dake

On 07/19/2011 08:01 AM, Jan Friesse wrote:
> Signed-off-by: Jan Friesse 
> ---
>  corosync.spec.in |5 +
>  1 files changed, 5 insertions(+), 0 deletions(-)
> 
> diff --git a/corosync.spec.in b/corosync.spec.in
> index 823ad3d..74ab851 100644
> --- a/corosync.spec.in
> +++ b/corosync.spec.in
> @@ -92,6 +92,11 @@ rm -rf %{buildroot}
>  
>  make install DESTDIR=%{buildroot}
>  
> +%if %{with dbus}
> +mkdir -p -m 0700 %{buildroot}/%{_sysconfdir}/dbus-1/system.d
> +install -m 644 %{_builddir}/%{name}-%{version}/conf/corosync-signals.conf 
> %{buildroot}/%{_sysconfdir}/dbus-1/system.d/corosync-signals.conf
> +%endif
> +
>  ## tree fixup
>  # drop static libs
>  rm -f %{buildroot}%{_libdir}/*.a

Reviewed-by: Steven Dake 
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] [PATCH 2/3] specfile: use _datadir as var expansion not exec

2011-07-19 Thread Steven Dake

On 07/19/2011 08:01 AM, Jan Friesse wrote:
> Signed-off-by: Jan Friesse 
> ---
>  corosync.spec.in |2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/corosync.spec.in b/corosync.spec.in
> index 37e53ed..823ad3d 100644
> --- a/corosync.spec.in
> +++ b/corosync.spec.in
> @@ -138,7 +138,7 @@ fi
>  %{_sysconfdir}/dbus-1/system.d/corosync-signals.conf
>  %endif
>  %if %{with snmp}
> -%(_datadir)/snmp/mibs/COROSYNC-MIB.txt
> +%{_datadir}/snmp/mibs/COROSYNC-MIB.txt
>  %endif

does this patch change anything?

>  %{_initrddir}/corosync
>  %{_initrddir}/corosync-notifyd

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] [PATCH 1/3] specfile: Correct URL and source0

2011-07-19 Thread Steven Dake

On 07/19/2011 08:01 AM, Jan Friesse wrote:
> Signed-off-by: Jan Friesse 
> ---
>  corosync.spec.in |4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/corosync.spec.in b/corosync.spec.in
> index e1dcf19..37e53ed 100644
> --- a/corosync.spec.in
> +++ b/corosync.spec.in
> @@ -18,8 +18,8 @@ Version: @version@
>  Release: 
> 1%{?numcomm:.%{numcomm}}%{?alphatag:.%{alphatag}}%{?dirty:.%{dirty}}%{?dist}
>  License: BSD
>  Group: System Environment/Base
> -URL: http://www.openais.org
> -Source0: 
> http://developer.osdl.org/dev/openais/downloads/%{name}-%{version}/%{name}-%{version}%{?numcomm:.%{numcomm}}%{?alphatag:-%{alphatag}}%{?dirty:-%{dirty}}.tar.gz
> +URL: http://ftp.corosync.org
> +Source0: 
> ftp://ftp:u...@ftp.corosync.org/downloads/%{name}-%{version}/%{name}-%{version}%{?numcomm:.%{numcomm}}%{?alphatag:-%{alphatag}}%{?dirty:-%{dirty}}.tar.gz
>  
>  # Runtime bits
>  Requires: corosynclib = %{version}-%{release}

Reviewed-by: Steven Dake 
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Multi State active resource each instance start\Stop

2011-07-19 Thread Steven Dake

On 07/19/2011 03:21 AM, manish.gu...@ionidea.com wrote:
> Hi,
> 
> I have configured a multi-state(clone)resource float IP(IP).
> It is running on all the configure Nodes.
> 
> I am trying to stop it using crm_resource command
> 
> crm_resource -r IP:0 -p target-role -v stopped
> 
> I am getting this error.
> 
> Error performing operation : The object/attribute does not exist.
> 
> Please anybody can help me. How can I stop a single instance using  
> any command
> 
> If I manually down a single instance on one node ,then i clean
> instance than it comes up means it start again.
> 
>  ifconfig eth0:1 down
>  crm_resource -C -r IP:0 -H NodeName
> 
>  It is working properly.
> 
>  Cluster stack
>  corosync-1.2
>  pacemaker-1.10
> 
> 

wrong ml.  Try the pacemaker ml.

> 
> Regards
> Manish
> 
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] FAILED TO RECEIVE followed by cluster failure

2011-07-18 Thread Steven Dake

On 07/18/2011 07:55 PM, Keisuke MORI wrote:
> Hi,
> 
> 2011/7/19 Steven Dake :
>> On 07/18/2011 10:38 AM, Jed Smith wrote:
>>> Thank you for your reply.
>>>
>>> On Mon, Jul 18, 2011 at 1:18 PM, Digimer  wrote:
>>>> Is it possible that the switch dropped the multicast group, and didn't
>>>> reform it fast enough to prevent the cluster from partitioning?
>>>
>>> Our network guy says that the switches do not look at multicast
>>> traffic, they merely broadcast it in our environment.
>>>
>>
>> unlikely.  I expect what is happening is your switch is delaying
>> multicast packets compared to the unicast token.  This causes
>> retransmits.  There is a bug in older versions of our totem
>> implementation that increase the fail to recv counter incorrectly.  In
>> newer versions we have worked around this flaw in the original totem
>> specification (which expects multicast can be flushed before a token
>> receipt, which is an invalid assertion).
>>
>> My recommendation to you is to update to a 1.3 or 1.4 series.   Both of
>> these have very tight maintenance rules around what goes in (ie: its not
>> tip development work).
>>
>> Once you have a version that doesn't have known bugs, I'd recommend
>> increasing fail recv const to some large value, such as 5000.  See:
>>
>> http://www.mail-archive.com/openais@lists.linux-foundation.org/msg05924.html
> 
> We had discovered that the issue in that report was caused by a misbehavior
> of IGMP snooping feature in bridge interface;
> http://www.spinics.net/lists/netdev/msg166960.html
> 
> Because of this, the bridge interface sometimes fails to handle IGMP
> packet properly
> and multicast traffic may not be forwarded for a while although
> unicast traffic goes fine,
> which makes corosync confused.
> 
> RHEL6.0 is affected at least, but RHEL5 is not affected because RHEL5 kernel
> does not implement IGMP snooping yet.
> 
> 
> You can workaroud it by either;
> 1) disabling IGMP snooping feature
>  ex. echo 0 > /sys/class/net/br0/bridge/multicast_snooping
> 2) not to use bridge interface for corosync multicast traffic
> 
> 
> When we encountered to this issue, we had assigned a multicast address to
> a bridge interface on top of a bonding interface.
> Changing to assign the IP address onto a bonding interface did solve it.
> Increasing fail_recv_const did not actually solve it; it just
> "delayed" to occur.
> 
> Hope it helps.
> 

Thanks for the report.  I believe our workarounds for delayed multicast
packets will mask that kernel oddness, but can't guarantee it.  I'm
certain someone will find that information of value.

Regards
-steve
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] FAILED TO RECEIVE followed by cluster failure

2011-07-18 Thread Steven Dake

On 07/18/2011 04:21 PM, Jed Smith wrote:
> Steven,
> 
> Thank you very much for the reply and information.
> 
> On Mon, Jul 18, 2011 at 6:58 PM, Steven Dake  wrote:
>> My recommendation to you is to update to a 1.3 or 1.4 series.   Both of
>> these have very tight maintenance rules around what goes in (ie: its not
>> tip development work).
> 
> I will indeed. Can I upgrade in place in the same cluster, or will 1.4
> not talk to 1.2 clusters? I apologize if this information is readily
> available.
> 

All 1.y will talk to any version of other 1.y versions. (ie 1.0 will
talk to 1.4)

Regards
-steve

>> Once you have a version that doesn't have known bugs, I'd recommend
>> increasing fail recv const to some large value, such as 5000.  See:
>>
>> http://www.mail-archive.com/openais@lists.linux-foundation.org/msg05924.html
> 
> I will also do this. Thank you for the advice.
> 
>> It would be nice if the debian maintainers would update their packages
>> to latest upstream.
> 
> I agree, and I'm running the absolute latest Ubuntu non-LTS for this reason.
> 

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] FAILED TO RECEIVE followed by cluster failure

2011-07-18 Thread Steven Dake

On 07/18/2011 10:38 AM, Jed Smith wrote:
> Thank you for your reply.
> 
> On Mon, Jul 18, 2011 at 1:18 PM, Digimer  wrote:
>> Is it possible that the switch dropped the multicast group, and didn't
>> reform it fast enough to prevent the cluster from partitioning?
> 
> Our network guy says that the switches do not look at multicast
> traffic, they merely broadcast it in our environment.
> 

unlikely.  I expect what is happening is your switch is delaying
multicast packets compared to the unicast token.  This causes
retransmits.  There is a bug in older versions of our totem
implementation that increase the fail to recv counter incorrectly.  In
newer versions we have worked around this flaw in the original totem
specification (which expects multicast can be flushed before a token
receipt, which is an invalid assertion).

My recommendation to you is to update to a 1.3 or 1.4 series.   Both of
these have very tight maintenance rules around what goes in (ie: its not
tip development work).

Once you have a version that doesn't have known bugs, I'd recommend
increasing fail recv const to some large value, such as 5000.  See:

http://www.mail-archive.com/openais@lists.linux-foundation.org/msg05924.html

It would be nice if the debian maintainers would update their packages
to latest upstream.  We release z streams for a reason, usually the
reason being someone has had a field failure resulting in a complete
cluster outage).  Y stream releases are a bit more liberal in terms of
additional features.

File a bug with your distro and ask them to use an upstream release
which is recent and supported upstream (1.2.y upstream support fell off
once we released 1.4.y - we support 2 y streams).

Thanks
-steve

> Thanks,
> 

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Announcing Corosync 1.4.0

2011-07-18 Thread Steven Dake

On 07/18/2011 07:37 AM, Jan Friesse wrote:
> Corosync 1.4.0 is available for immediate download from our website.
> This version brings many enhancements to the software but most visible 
> change is redundant ring auto recovery functionality.
> 
> Please retrieve the latest sources from our website:
> 
>  http://www.corosync.org
> 
> Regards
>Honza
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais

The other nice feature we have spent alot of time on is SNMP support and
integration with foghorn (a DBUS to SNMP connector).

Regards
-steve
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Announcing Corosync 1.4.0

2011-07-18 Thread Steven Dake

On 07/18/2011 08:29 AM, Digimer wrote:
> On 07/18/2011 10:37 AM, Jan Friesse wrote:
>> Corosync 1.4.0 is available for immediate download from our website.
>> This version brings many enhancements to the software but most visible 
>> change is redundant ring auto recovery functionality.
>>
>> Please retrieve the latest sources from our website:
>>
>>  http://www.corosync.org
>>
>> Regards
>>Honza
> 
> This is a question I think I already know the answer to, but what the
> heck, I'll ask anyway.
> 
> Will the RRP recovery feature be back-ported to EL5? Having this option
> on existing RHCS2 clusters would be fantastic!
> 

We are providing bug fixes for RHEL5 only - no new feature development.
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] [PATCH] rrp: handle rollover in active rrp properly

2011-07-15 Thread Steven Dake

Reviewed-by: Steven Dake 

On 07/15/2011 09:31 AM, Jan Friesse wrote:
> Signed-off-by: Jan Friesse 
> ---
>  exec/totemrrp.c |   24 +++-
>  1 files changed, 23 insertions(+), 1 deletions(-)
> 
> diff --git a/exec/totemrrp.c b/exec/totemrrp.c
> index 6fb5772..eb9b788 100644
> --- a/exec/totemrrp.c
> +++ b/exec/totemrrp.c
> @@ -468,6 +468,22 @@ static void active_timer_problem_decrementer_cancel (
>  
>  #define ENDIAN_LOCAL 0xff22
>  
> +/*
> + * Rollover handling:
> + *
> + * ARR_SEQNO_START_TOKEN is the starting sequence number of last seen 
> sequence
> + * for a token for active redundand ring.  This should remain zero, unless 
> testing
> + * overflow in which case 07f00 or 0xff00 are good starting values.
> + * It should be same as on defined in totemsrp.c
> + */
> +
> +#define ARR_SEQNO_START_TOKEN 0x0
> +
> +/*
> + * These can be used ot test different rollover points
> + * #define ARR_SEQNO_START_MSG 0xfe00
> + */
> +
>  struct message_header {
>   char type;
>   char encapsulated;
> @@ -1154,6 +1170,8 @@ void *active_instance_initialize (
>  
>   instance->rrp_instance = rrp_instance;
>  
> + instance->last_token_seq = ARR_SEQNO_START_TOKEN - 1;
> +
>  error_exit:
>   return ((void *)instance);
>  }
> @@ -1342,7 +1360,7 @@ static void active_token_recv (
>   struct active_instance *active_instance = (struct active_instance 
> *)rrp_instance->rrp_algo_instance;
>  
>   active_instance->totemrrp_context = context;
> - if (token_seq > active_instance->last_token_seq) {
> + if (sq_lt_compare (active_instance->last_token_seq, token_seq)) {
>   memcpy (active_instance->token, msg, msg_len);
>   active_instance->token_len = msg_len;
>   for (i = 0; i < rrp_instance->interface_count; i++) {
> @@ -1353,6 +1371,10 @@ static void active_token_recv (
>   active_timer_expired_token_start (active_instance);
>   }
>  
> + /*
> +  * This doesn't follow spec because the spec assumes we will know
> +  * when token resets occur.
> +  */
>   active_instance->last_token_seq = token_seq;
>  
>   if (token_seq == active_instance->last_token_seq) {

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] [PATCH] totemconfig: Change default FAIL_TO_RECV_CONST

2011-07-15 Thread Steven Dake

Reviewed-by: Steven Dake 

On 07/15/2011 09:21 AM, Jan Friesse wrote:
> Previous default (50) was too low for most modern switch hardware. This
> may trigger abort because the aru doesn't increase for 50 token
> rotations combined with a defect in how failed to recv conditions are
> handled.  By increasing this tunable, the condition should no longer
> trigger the errant code.
> 
> Signed-off-by: Jan Friesse 
> ---
>  exec/totemconfig.c  |2 +-
>  man/corosync.conf.5 |2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/exec/totemconfig.c b/exec/totemconfig.c
> index 5135672..80ca182 100644
> --- a/exec/totemconfig.c
> +++ b/exec/totemconfig.c
> @@ -73,7 +73,7 @@
>  #define JOIN_TIMEOUT 50
>  #define MERGE_TIMEOUT200
>  #define DOWNCHECK_TIMEOUT1000
> -#define FAIL_TO_RECV_CONST   50
> +#define FAIL_TO_RECV_CONST   2500
>  #define  SEQNO_UNCHANGED_CONST   30
>  #define MINIMUM_TIMEOUT  (int)(1000/HZ)*3
>  #define MAX_NETWORK_DELAY50
> diff --git a/man/corosync.conf.5 b/man/corosync.conf.5
> index d092064..3f8e90e 100644
> --- a/man/corosync.conf.5
> +++ b/man/corosync.conf.5
> @@ -380,7 +380,7 @@ This constant specifies how many rotations of the token 
> without receiving any
>  of the messages when messages should be received may occur before a new
>  configuration is formed.
>  
> -The default is 50 failures to receive a message.
> +The default is 2500 failures to receive a message.
>  
>  .TP
>  seqno_unchanged_const

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] [PATCH 2/2] rrp: Handle rollower in passive rrp properly

2011-07-15 Thread Steven Dake

Great work

Reviewed-by: Steven Dake 

On 07/15/2011 06:31 AM, Jan Friesse wrote:
> Signed-off-by: Jan Friesse 
> ---
>  exec/totemrrp.c |  175 
> +++
>  1 files changed, 112 insertions(+), 63 deletions(-)
> 
> diff --git a/exec/totemrrp.c b/exec/totemrrp.c
> index 0445be2..6bfacd9 100644
> --- a/exec/totemrrp.c
> +++ b/exec/totemrrp.c
> @@ -335,6 +335,11 @@ static void passive_mcast_flush_send (
>   const void *msg,
>   unsigned int msg_len);
>  
> +static void passive_monitor (
> + struct totemrrp_instance *rrp_instance,
> + unsigned int iface_no,
> + int is_token_recv_count);
> +
>  static void passive_token_recv (
>   struct totemrrp_instance *instance,
>   unsigned int iface_no,
> @@ -484,6 +489,14 @@ static void active_timer_problem_decrementer_cancel (
>   * #define ARR_SEQNO_START_MSG 0xfe00
>   */
>  
> +/*
> + * Threshold value when recv_count for passive rrp should be adjusted.
> + * Set this value to some smaller for testing of adjusting proper
> + * functionality. Also keep in mind that this value must be smaller
> + * then rrp_problem_count_threshold
> + */
> +#define PASSIVE_RECV_COUNT_THRESHOLD (INT_MAX / 2)
> +
>  struct message_header {
>   char type;
>   char encapsulated;
> @@ -841,50 +854,92 @@ static void passive_timer_problem_decrementer_cancel (
>  }
>  */
>  
> -
> -static void passive_mcast_recv (
> +/*
> + * Monitor function implementation from rrp paper.
> + * rrp_instance is passive rrp instance, iface_no is interface with received 
> messgae/token and
> + * is_token_recv_count is boolean variable which donates if message is token 
> (>1) or regular
> + * message (= 0)
> + */
> +static void passive_monitor (
>   struct totemrrp_instance *rrp_instance,
>   unsigned int iface_no,
> - void *context,
> - const void *msg,
> - unsigned int msg_len)
> + int is_token_recv_count)
>  {
>   struct passive_instance *passive_instance = (struct passive_instance 
> *)rrp_instance->rrp_algo_instance;
> + unsigned int *recv_count;
>   unsigned int max;
>   unsigned int i;
> -
> - rrp_instance->totemrrp_deliver_fn (
> - context,
> - msg,
> - msg_len);
> -
> - if (rrp_instance->totemrrp_msgs_missing() == 0 &&
> - passive_instance->timer_expired_token) {
> - /*
> -  * Delivers the last token
> -  */
> - rrp_instance->totemrrp_deliver_fn (
> - passive_instance->totemrrp_context,
> - passive_instance->token,
> - passive_instance->token_len);
> - passive_timer_expired_token_cancel (passive_instance);
> - }
> + unsigned int min_all, min_active;
>  
>   /*
>* Monitor for failures
> -  * TODO doesn't handle wrap-around of the mcast recv count
>*/
> - passive_instance->mcast_recv_count[iface_no] += 1;
> + if (is_token_recv_count) {
> + recv_count = passive_instance->token_recv_count;
> + } else {
> + recv_count = passive_instance->mcast_recv_count;
> + }
> +
> + recv_count[iface_no] += 1;
> +
>   max = 0;
>   for (i = 0; i < rrp_instance->interface_count; i++) {
> - if (max < passive_instance->mcast_recv_count[i]) {
> - max = passive_instance->mcast_recv_count[i];
> + if (max < recv_count[i]) {
> + max = recv_count[i];
> + }
> + }
> +
> + /*
> +  * Max is larger then threshold -> start adjusting process
> +  */
> + if (max > PASSIVE_RECV_COUNT_THRESHOLD) {
> + min_all = min_active = recv_count[iface_no];
> +
> + for (i = 0; i < rrp_instance->interface_count; i++) {
> + if (recv_count[i] < min_all) {
> + min_all = recv_count[i];
> + }
> +
> + if (passive_instance->faulty[i] == 0 &&
> + recv_count[i] < min_active) {
> + min_active = recv_count[i];
> + }
> + }
> +
> + if (min_all > 0) {
> + /*
> +  * There is one or more faulty device with recv_count > > 0
> +  */
> + for (i = 0; i < rrp_instance->interface_count; i++) {
> +

Re: [Openais] [PATCH] totemconfig: Change default FAIL_TO_RECV_CONST

2011-07-15 Thread Steven Dake

manpage needs changing too

regards
-steve

On 07/15/2011 08:13 AM, Jan Friesse wrote:
> Previous default (50) was too low for most modern switch hardware. This
> may trigger abort because the aru doesn't increase for 50 token
> rotations combined with a defect in how failed to recv conditions are
> handled.  By increasing this tunable, the condition should no longer
> trigger the errant code.
> 
> Signed-off-by: Jan Friesse 
> ---
>  exec/totemconfig.c |2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/exec/totemconfig.c b/exec/totemconfig.c
> index 5135672..80ca182 100644
> --- a/exec/totemconfig.c
> +++ b/exec/totemconfig.c
> @@ -73,7 +73,7 @@
>  #define JOIN_TIMEOUT 50
>  #define MERGE_TIMEOUT200
>  #define DOWNCHECK_TIMEOUT1000
> -#define FAIL_TO_RECV_CONST   50
> +#define FAIL_TO_RECV_CONST   2500
>  #define  SEQNO_UNCHANGED_CONST   30
>  #define MINIMUM_TIMEOUT  (int)(1000/HZ)*3
>  #define MAX_NETWORK_DELAY50

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

[Openais] [PATCH] Fix problem where corosync will segfault if there are gaps in recovery queue

2011-07-15 Thread Steven Dake

Fixes a problem where there are gaps in the recovery queue.  Example my_aru = 5,
but there are messages at 7,8.  8 = my_high_seq_received which results
in data slots taken up in new message queue.  What should really happen
is these last messages should be delivered after a transitional
configuration to maintain SAFE agreement.  We don't have support for
SAFE atm, so it is probably safe just to throw these messages away.  Without
this change, the new message queue on a new configuraton change is out of sync.

Signed-off-by: Steven Dake 
Tested-by: Tim Beale 
---
 exec/totemsrp.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/exec/totemsrp.c b/exec/totemsrp.c
index 3dcc05e..16de74d 100644
--- a/exec/totemsrp.c
+++ b/exec/totemsrp.c
@@ -1809,7 +1809,7 @@ static void memb_state_operational_enter (struct 
totemsrp_instance *instance)
sizeof (struct srp_addr) * instance->my_memb_entries);
 
instance->my_failed_list_entries = 0;
-   instance->my_high_delivered = instance->my_aru;
+   instance->my_high_delivered = instance->my_high_seq_received;
 
for (i = 0; i <= instance->my_high_delivered; i++) {
void *ptr;
-- 
1.7.4.4

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] [TOTEM ] Process pause detected for XXX ms, flushing membership messages.

2011-07-08 Thread Steven Dake

On 07/08/2011 02:03 AM, Vladislav Bogdanov wrote:
>>> I checked the archives and found a patch from some time ago that was
>>> never merged.  It wasn't verified to resolve the "pause timeout" problem
>>> but t could indeed solve the problem.  It wasn't merged because we
>>> lacked verification it resolved the problem.
>>
>> Great, I'll try it in next few days, good news is that problem should be
>> easily reproducible.
> 
> Hmm...
> Not so easily...
> 
> I applied that patch to all physical hosts, and do not see that message
> any more for two days, independently of number of RX buffers in adapter.
> 
> But, I do not see it if I downgrade to previous image (without that
> patch) :( Although I did not test it again for a long time, only several
> hours.
> 
> I didn't apply patch to VM, and do not see that message either.
> What I did also:
> * Rescheduled VM to higher CPU priority (actually real-time)
> * Assigned higher blkio priority to that VM
> * Assigned low blkio priority to bulk resources on node where that VM runs.
> So, original problem seems to have different causes for bare-metal and
> VM cases.
> 
> For former case patch seems to be helpful.
> It should help for VM case too.
> 
> There were lots of '[TOTEM ] Retransmit List:' messages on bare-metal
> hosts until I returned eth RX ring size back to 256 buffers (from 4096).
> After some thinking, this is probably correct, because more buffers add
> some latency, which is bad for corosync. Not sure why that may affect
> NAPI polling rate although.
> 
> I'll try to upgrade igb driver (newer version has tuning param
> InterruptThrottleRate) and play again with ring buffers and that rate.
> 
> Again, that driver version I currently have may have some bugs when
> operating with big buffer rings which lead to 500ms blocking under high
> load.
> 
> BTW are that Retransmit List: messages harmful?
> 

These are only warning messages and result in a duplicate message being
retransmitted which may not have to be.  We are working to sort out how
to remove these on some hardware enironments.

Regards
-steve

> 
> Best,
> Vladislav
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

[Openais] [PATCH] take 3 Speculatory patch that may correct tlbe...@gmail.com's reported problem

2011-07-07 Thread Steven Dake

May not work at all or correct problem - would appreciate feedback

Signed-off-by: Steven Dake 
---
 exec/totemsrp.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/exec/totemsrp.c b/exec/totemsrp.c
index 3dcc05e..16de74d 100644
--- a/exec/totemsrp.c
+++ b/exec/totemsrp.c
@@ -1809,7 +1809,7 @@ static void memb_state_operational_enter (struct 
totemsrp_instance *instance)
sizeof (struct srp_addr) * instance->my_memb_entries);
 
instance->my_failed_list_entries = 0;
-   instance->my_high_delivered = instance->my_aru;
+   instance->my_high_delivered = instance->my_high_seq_received;
 
for (i = 0; i <= instance->my_high_delivered; i++) {
void *ptr;
-- 
1.7.4.4

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Question about recovery code

2011-07-07 Thread Steven Dake

On 07/07/2011 03:07 PM, Tim Beale wrote:
> Hi Steve,
> 
> Thanks for your help. When we upgraded to v1.3.1 we picked up commit
> 8603ff6e9a270ecec194f4e13780927ebeb9f5b2:
>>> totemsrp: free messages originated in recovery rather then rely on 
>>> messages_free
> 
> Which is why I was retesting this issue. But I still see the problem even with
> the above change.
> 
> The recovery code seems to work most of the time. But occasionally it doesn't
> free all of the recovery messages on the queue. It seems there are recovery
> messages left with seq numbers higher than instance->my_high_delivered/
> instance->my_aru.
> 
> In the last crash I saw there were 12 messages on the recovery queue but only
> 5 of them got freed by the above patch/code. I think usually a node leave 
> event
> seems to occur at the same time.
> 

I speculate there are gaps in the recovery queue.  Example my_aru = 5,
but there are messages at 7,8.  8 = my_high_seq_received which results
in data slots taken up in new message queue.  What should really happen
is these last messages should be delivered after a transitional
configuration to maintain SAFE agreement.  We don't have support for
SAFE atm, so it is probably safe just to throw these messages away.

Could you test my speculatory patch against your test case?

Thanks!
-steve

> I can reproduce the problem reasonably reliably in a 2-node cluster with:
> #define TEST_DROP_ORF_TOKEN_PERCENTAGE 40
> #define TEST_DROP_MCAST_PERCENTAGE 20
> But I suspect it's reliant on timing/messaging specific to my system. Let me
> know if there's any debug or anything you want me to try out.
> 
> Thanks,
> Tim
> 
> On Thu, Jul 7, 2011 at 3:47 PM, Steven Dake  wrote:
>> On 07/06/2011 05:24 PM, Tim Beale wrote:
>>> Hi,
>>>
>>> We've hit a problem in the recovery code and I'm struggling to understand 
>>> why
>>> we do the following:
>>>
>>>   /*
>>>* The recovery sort queue now becomes the regular
>>>* sort queue.  It is necessary to copy the state
>>>* into the regular sort queue.
>>>*/
>>>   sq_copy (&instance->regular_sort_queue, 
>>> &instance->recovery_sort_queue);
>>>
>>> The problem we're seeing is sometimes we get an encapsulated message from 
>>> the
>>> recovery queue copied onto the regular queue, and corosync then crashes 
>>> trying
>>> to process the message. (When it strips off the totemsrp header it gets 
>>> another
>>> totemsrp header rather than the totempg header it expects).
>>>
>>> The problem seems to happen when we only do the sq_items_release() for a 
>>> subset
>>> of the recovery messages, e.g. there are 12 messages on the recovery queue 
>>> and
>>> we only free/release 5 of them. The remaining encapsulated recovery messages
>>> get left on the regular queue and corosync crashes trying to deliver them.
>>>
>>> It looks to me like deliver_messages_from_recovery_to_regular() handles the
>>> encapsulation correctly, stripping the extra header and adding the recovery
>>> messages to the regular queue. But then the sq_copy() just seems to 
>>> overwrite
>>> the regular queue.
>>>
>>> We've avoided the crash in the past by just reiniting both queues, but I 
>>> don't
>>> think this is the best solution.
>>>
>>
>> I would expect this solution would lead to message loss or lockup of the
>> protocol.
>>
>>> Any advice would be appreciated.
>>>
>>> Thanks,
>>> Tim
>>
>> A proper fix should be in commit
>> master:
>> 7d5e588931e4393c06790995a995ea69e6724c54
>> flatiron-1.3:
>> 8603ff6e9a270ecec194f4e13780927ebeb9f5b2
>>
>> A new flatiron-1.3 release is in the works.  There are other totem bugs
>> you may wish to backport in the meantime.
>>
>> Let us know if that commit fixes the problem you encountered.
>>
>> Regards
>> -steve
>>
>>> ___
>>> Openais mailing list
>>> Openais@lists.linux-foundation.org
>>> https://lists.linux-foundation.org/mailman/listinfo/openais
>>
>>

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

[Openais] [PATCH] take 2 Speculatory patch that may correct tlbe...@gmail.com's reported problem

2011-07-07 Thread Steven Dake

May not work at all or correct problem - would appreciate feedback

Signed-off-by: Steven Dake 
---
 exec/totemsrp.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/exec/totemsrp.c b/exec/totemsrp.c
index 3dcc05e..5a3bfaa 100644
--- a/exec/totemsrp.c
+++ b/exec/totemsrp.c
@@ -1809,7 +1809,7 @@ static void memb_state_operational_enter (struct 
totemsrp_instance *instance)
sizeof (struct srp_addr) * instance->my_memb_entries);
 
instance->my_failed_list_entries = 0;
-   instance->my_high_delivered = instance->my_aru;
+   instance->my_high_delivered = instance->my_high_received;
 
for (i = 0; i <= instance->my_high_delivered; i++) {
void *ptr;
-- 
1.7.4.4

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

[Openais] [PATCH] Speculatory patch that may correct tlbe...@gmail.com's reported problem

2011-07-07 Thread Steven Dake

May not work at all or correct problem - would appreciate feedback

Signed-off-by: Steven Dake 
---
 exec/totemsrp.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/exec/totemsrp.c b/exec/totemsrp.c
index 3dcc05e..5a3bfaa 100644
--- a/exec/totemsrp.c
+++ b/exec/totemsrp.c
@@ -1809,7 +1809,7 @@ static void memb_state_operational_enter (struct 
totemsrp_instance *instance)
sizeof (struct srp_addr) * instance->my_memb_entries);
 
instance->my_failed_list_entries = 0;
-   instance->my_high_delivered = instance->my_aru;
+   instance->my_high_delivered = instance->my_high_received;
 
for (i = 0; i <= instance->my_high_delivered; i++) {
void *ptr;
-- 
1.7.4.4

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] [PATCH] totemiba: free send_buf on ibv_reg_mr failure

2011-07-07 Thread Steven Dake

On 07/07/2011 02:06 AM, Jan Friesse wrote:
> Signed-off-by: Jan Friesse 
> ---
>  exec/totemiba.c |2 ++
>  1 files changed, 2 insertions(+), 0 deletions(-)
> 
> diff --git a/exec/totemiba.c b/exec/totemiba.c
> index ec4ccfc..0b2d2ca 100644
> --- a/exec/totemiba.c
> +++ b/exec/totemiba.c
> @@ -271,6 +271,7 @@ static inline struct send_buf *mcast_send_buf_get (
>   2048, IBV_ACCESS_LOCAL_WRITE);
>   if (send_buf->mr == NULL) {
>   log_printf (LOGSYS_LEVEL_ERROR, "couldn't register memory 
> range\n");
> + free (send_buf);
>   return (NULL);
>   }
>   list_init (&send_buf->list_all);
> @@ -307,6 +308,7 @@ static inline struct send_buf *token_send_buf_get (
>   2048, IBV_ACCESS_LOCAL_WRITE);
>   if (send_buf->mr == NULL) {
>   log_printf (LOGSYS_LEVEL_ERROR, "couldn't register memory 
> range\n");
> +     free (send_buf);
>   return (NULL);
>   }
>   list_init (&send_buf->list_all);

Reviewed-by: Steven Dake 

Thanks!
-steve
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Question about recovery code

2011-07-06 Thread Steven Dake

On 07/06/2011 05:24 PM, Tim Beale wrote:
> Hi,
> 
> We've hit a problem in the recovery code and I'm struggling to understand why
> we do the following:
> 
>   /*
>* The recovery sort queue now becomes the regular
>* sort queue.  It is necessary to copy the state
>* into the regular sort queue.
>*/
>   sq_copy (&instance->regular_sort_queue, &instance->recovery_sort_queue);
> 
> The problem we're seeing is sometimes we get an encapsulated message from the
> recovery queue copied onto the regular queue, and corosync then crashes trying
> to process the message. (When it strips off the totemsrp header it gets 
> another
> totemsrp header rather than the totempg header it expects).
> 
> The problem seems to happen when we only do the sq_items_release() for a 
> subset
> of the recovery messages, e.g. there are 12 messages on the recovery queue and
> we only free/release 5 of them. The remaining encapsulated recovery messages
> get left on the regular queue and corosync crashes trying to deliver them.
> 
> It looks to me like deliver_messages_from_recovery_to_regular() handles the
> encapsulation correctly, stripping the extra header and adding the recovery
> messages to the regular queue. But then the sq_copy() just seems to overwrite
> the regular queue.
> 
> We've avoided the crash in the past by just reiniting both queues, but I don't
> think this is the best solution.
> 

I would expect this solution would lead to message loss or lockup of the
protocol.

> Any advice would be appreciated.
> 
> Thanks,
> Tim

A proper fix should be in commit
master:
7d5e588931e4393c06790995a995ea69e6724c54
flatiron-1.3:
8603ff6e9a270ecec194f4e13780927ebeb9f5b2

A new flatiron-1.3 release is in the works.  There are other totem bugs
you may wish to backport in the meantime.

Let us know if that commit fixes the problem you encountered.

Regards
-steve

> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] [PATCH 2/4] build: make RDMA support an RPM build conditional

2011-07-06 Thread Steven Dake

On 07/06/2011 01:02 PM, Florian Haas wrote:
> On 07/06/2011 03:52 PM, Steven Dake wrote:
>> From: Florian Haas 
>>
>> Enable RDMA in RPM builds by default to maintain the previous behavior
>> (which always included --enable-rdma in the %configure invocation).
> 
> Steve, seeing that you acked all the others, any objections to this one?
> I didn't get your Reviewed-by here. Should I leave this one out when I
> fix up my tree for you to pull?
> 
> Cheers,
> Florian
> 

hmm.

I did push it - have alot of email open in the morning :)

Reviewed-by: Steven Dake 

Thanks
-steve

> 
> 
> 
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] [PATCH 1/4] build: force LC_ALL=C correctly for dates

2011-07-06 Thread Steven Dake

Thanks for the patch

Reviewed-by: Steven Dake 

On 07/06/2011 06:52 AM, Steven Dake wrote:
> From: Florian Haas 
> 
> Failure to force "C" dates will have RPM et al. complain about invalid
> dates and timestamps.
> ---
>  Makefile.am |4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/Makefile.am b/Makefile.am
> index 0929ca7..252caf1 100644
> --- a/Makefile.am
> +++ b/Makefile.am
> @@ -123,7 +123,7 @@ clean-generic:
>  
>  $(SPEC): $(SPEC).in
>   rm -f $@-t $@
> - LC_ALL=C date="$(shell date "+%a %b %d %Y")" && \
> + date="$(shell LC_ALL=C date "+%a %b %d %Y")" && \
>   if [ -f .tarball-version ]; then \
>   gitver="$(shell cat .tarball-version)" && \
>   rpmver=$$gitver && \
> @@ -190,7 +190,7 @@ gen_start_date = 2000-01-01
>  .PHONY: gen-ChangeLog
>  gen-ChangeLog:
>   if test -d .git; then   \
> - $(top_srcdir)/build-aux/gitlog-to-changelog \
> + LC_ALL=C $(top_srcdir)/build-aux/gitlog-to-changelog
> \
>   --since=$(gen_start_date) > $(distdir)/cl-t;\
>   rm -f $(distdir)/ChangeLog; \
>   mv $(distdir)/cl-t $(distdir)/ChangeLog;\

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] [GIT PULL] Minor fixes for RPM builds

2011-07-06 Thread Steven Dake

On 07/06/2011 07:08 AM, Florian Haas wrote:
> On 2011-07-06 15:59, Steven Dake wrote:
>> On 07/06/2011 06:56 AM, Florian Haas wrote:
>>> On 2011-07-06 15:49, Steven Dake wrote:
>>>> Florian,
>>>>
>>>> I'll take improvements however I can get them, but sending patches to
>>>> the list is preferred that way multiple people can look at them.
>>>
>>> Arguably that counts for github too, as my repo happens to be quite
>>> public. :)
>>>
>>>> The way I generally do this is
>>>>
>>>> git send-email --to=open...@lists.osdl.org --smtp-server=server -3
>>>>
>>>> where -3 is last 3 patches
>>>>
>>>> the to and smtp server can be set in gitconfig as well.
>>>
>>> Fair enough, but do you actually prefer to "git am" each patch by hand?
>>> Wouldn't it make more sense to post the patches first, when reviewed and
>>> acknowledged fix up the git tree so you can merge easily, and then send
>>> a pull request?
>>>
>>
>> I do like git am, however, open to changes.
>>
>> I am not sure how to amend a commit in a patch set to include a
>> reviewed-by line.  Get am lets me amend per patch.  Any tips here?
> 
> Hmmm. You can merge from my repo into yours, then use "git rebase -i
> " to edit commit messages and add your Reviewed-By lines. But
> the downside of this is that this creates in place of my changesets it
> creates new ones, and then I have to reset my tree to match yours after
> you've pushed your changes.
> 
> I think normally what's most often done is the contributor posts patches
> first, gets review and testing feedback, then the _contributor_ adds
> Reviewed-By, Tested-By, etc., issues a pull request, and then the
> maintainer pulls, and no further changes to the commits are necessary.
> Does that sound workable?
> 
> Florian
> 
> 

Yup that wfm if you prefer to work in that way

Regards
-steve

> 
> 
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] [PATCH 4/4] build: disable RDMA support in RPMs by default

2011-07-06 Thread Steven Dake

On 07/06/2011 06:52 AM, Steven Dake wrote:
> From: Florian Haas 
> 
> Rather than curiously disable RDMA support by default in configure and
> enable it by default in RPM builds, streamline the default
> configuration to always turn RDMA support off. It can be enabled in
> RPM builds with "--with rdma".
> ---
>  corosync.spec.in |2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 

Reviewed-by: Steven Dake 

> diff --git a/corosync.spec.in b/corosync.spec.in
> index 34e1658..9585831 100644
> --- a/corosync.spec.in
> +++ b/corosync.spec.in
> @@ -10,7 +10,7 @@
>  %bcond_with monitoring
>  %bcond_with snmp
>  %bcond_with dbus
> -%bcond_without rdma
> +%bcond_with rdma
>  
>  Name: corosync
>  Summary: The Corosync Cluster Engine and Application Programming Interfaces

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 2349 matches

Mail list logo