[Linux-HA] Problem with function send_ordered_nodemsg

Audet, Jean-Michel Wed, 18 Jun 2008 08:18:13 -0700

Hi,
        I already sent this message and never get any feedback.  Here is my 
problem.


I have hearbeat 2.1.3 (Same problem with 2.1.2).
I am using a Master/Slave model.

I am using the communication link of heartbeat to transfer data from 2 nodes.  
Data is state and data.  Since, with Ethernet, I am limited in size, I am 
transferring multiple chunks of 8K data for up to 1MB (120 * 8KB approx).  

The problem is after couple of data set (maybe 300, 400, sometime more, 
sometime less... but always), the function send_ordered_nodemsg hang and I am 
not able to transfer data anymore.  It looks, from debug information that it 
hangs in function socket_resume_io_read. 

I have tried Unicast and Broadcast.

>From Dejan, it maybe that I am pushing heartbeat communication layer to the 
>limit.  I am a little bit surprise that 1MB of data can be a problem.

I am stuck now and I need a solution cause my application is not usable and I 
may have to look at other ha package (I really don't want to).

Any input, suggestions, whatever will be greatly appreciated.
May it be good to consider creating a new communication link (client/server).  

Jean-Michel Audet
Kontron Canada 


-----Message d'origine-----
De : Audet, Jean-Michel 
Envoyé : Thursday, June 05, 2008 11:13 AM
À : 'General Linux-HA mailing list'
Objet : Problem with function send_ordered_nodemsg 


Hi, 
        I currently have a problem with my software that hangs when I call the 
function send_ordered_nodemsg (exhibit the same problem with sendnodemsg).  I 
am able to send many message (many dozens) and then, it hangs.  With extra 
debug, I found that it hangs somewhere in the function socket_resume_io_read. I 
base my code on the CIB implementation. 


I am requesting any helps that may help me find the problem.  I know that CIB 
is using this function so I think the problem is on my side or I don't know 
exactly how to use it but I am trying to find this problem since many days now. 
 

Maybe somebody have some experience with his function and hit the same problem 
before.

Any help will be more than appreciated.

Jean-Michel Audet

<cib generated="false" admin_epoch="0" epoch="0" num_updates="0" have_quorum="true" ignore_dtd="false" ccm_transition="1" num_peers="1" cib-last-written="Wed Apr 30 12:27:23 2008">
   <configuration>
      <crm_config>
         <cluster_property_set id="idCluseterPropertySet">
            <attributes>
              <nvpair id="election_timeout" name="election_timeout" value="5sec"/>
              <nvpair id="crmd-integration-timeout" name="crmd-integration-timeout" value="30sec"/>
              <nvpair id="crmd-finalization_timeout" name="crmd-finalization_timeout" value="30sec"/>
              <nvpair id="default-resource-stickiness" name="default-resource-stickiness" value="500"/>
              <nvpair id="default-resource-failure-stickiness" name="default-resource-failure-stickiness" value="-100"/>
            </attributes>
         </cluster_property_set>
      </crm_config>
      <nodes>
         <node id="d6c4c454-1c0e-4d77-9fbc-740801d7b36a" uname="node1" type="normal"/>
         <node id="226d6a05-8fa7-4956-ace3-9c7f5855ab86" uname="node2" type="normal"/>
      </nodes>
      <resources>		   
         <primitive id="openhpid_1" class="heartbeat" type="openhpid" provider="heartbeat"/>
         <primitive id="openhpid_2" class="heartbeat" type="openhpid" provider="heartbeat"/>
         <primitive id="openhpi_redundancy" class="ocf" type="IPaddr" provider="heartbeat" restart_type="ignore" is_managed="default">
            <instance_attributes id="b7de1d31-b108-4e13-9613-8284a85d3db0">
               <attributes>
                  <nvpair id="5c570f6b-924e-43c8-b8ae-5d03dd68b5f2" name="ip" value="192.168.0.100"/>
               </attributes>
            </instance_attributes>
         </primitive>
      </resources>
      <constraints>
        <rsc_location id="run_openhpid_1" rsc="openhpid_1">
            <rule id="pref_run_openhpid_1" score="INFINITY">
                <expression id="expr_run_openhpid_1" attribute="#uname" operation="eq" value="node1"/>
            </rule>
            <rule id="pref_norun_openhpid_1" score="-INFINITY">
                <expression id="expr_norun_openhpid_1" attribute="#uname" operation="ne" value="node1"/>
            </rule>
        </rsc_location>
        <rsc_location id="run_openhpid_2" rsc="openhpid_2">
            <rule id="pref_run_openhpid_2" score="INFINITY">
                <expression id="expr_run_openhpid_2" attribute="#uname" operation="eq" value="node2"/>
            </rule>
            <rule id="pref_norun_openhpid_2" score="-INFINITY">
                <expression id="expr_norun_openhpid_2" attribute="#uname" operation="ne" value="node2"/>
            </rule>
        </rsc_location>	
      </constraints>
   </configuration>
</cib>

ha.cf
Description: ha.cf

_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] Problem with function send_ordered_nodemsg

Reply via email to