Re: [Linux-HA] Antw: Re: Apache error on all nodes

Tim Serong Sun, 18 Sep 2011 22:09:40 -0700

On 19/09/11 14:53, Guillaume Bettayeb wrote:
> Hi Andrew,
>
> Yes, I tried with both "/etc/init.d/apache2 start" and "service apache2
> start".
>
> I suppose I have to test with the path defined in crm configure show, which
> might be something like /usr/bin/apache2 or something like that (I am not
> home just now I can't check..)


Try running something like:

   OCF_ROOT=/usr/lib/ocf \
     OCF_RESKEY_configfile=/etc/apache2/apache2.conf \
     OCF_RESKEY_httpd=/usr/sbin/apache2 \
     /usr/lib/ocf/resource.d/heartbeat/apache start

Or, if you really want to see what the RA is doing:

   OCF_ROOT=/usr/lib/ocf \
     OCF_RESKEY_configfile=/etc/apache2/apache2.conf \
     OCF_RESKEY_httpd=/usr/sbin/apache2 \
     sh -x /usr/lib/ocf/resource.d/heartbeat/apache start

Note those OCF_RESKEY_* vars need to match what you set for the resource 
parameters in the crm config.

See also:

   http://www.clusterlabs.org/wiki/Debugging_Resource_Failures

Regards,

Tim


> I will have a look. Thanks very much for your help.
>
> G
>
> On 19 September 2011 00:41, Andrew Beekhof<and...@beekhof.net>  wrote:
>
>> On Fri, Sep 16, 2011 at 11:22 PM, Guillaume Bettayeb
>> <guillaume1...@gmail.com>  wrote:
>>> Hi all,
>>>
>>> I have been through my Apache configuration again and I confirm Apache
>> works
>>> fine.
>>
>> I assume you're testing by running "/etc/init.d/apache2 start" or
>> something similar?
>> This is not what the cluster executes to start apache and therefor the
>> test doesn't help much.
>>
>>>
>>> I have changed corosync config file to dump all the corosync log into
>>> /var/log/corosync/corosync.log
>>>
>>> then I have restarted corosync and the log file has the following :
>>>
>>> http://pastebin.com/BFVVfxCh
>>>
>>> Could the following lines being the consequence of the error ?
>>
>> A consequence yes, but not the cause.
>>
>>>
>>> root@node1:/var/log/corosync# cat corosync-apache.log | grep "INFINITY
>>> times"
>>> Sep 16 14:11:13 node1 pengine: [7103]: info: get_failcount: apache has
>>> failed INFINITY times on node1
>>> Sep 16 14:11:15 node1 pengine: [7103]: info: get_failcount: apache has
>>> failed INFINITY times on node1
>>> Sep 16 14:11:16 node1 pengine: [7103]: info: get_failcount: apache has
>>> failed INFINITY times on node1
>>> Sep 16 14:11:18 node1 pengine: [7103]: info: get_failcount: apache has
>>> failed INFINITY times on node1
>>> Sep 16 14:11:18 node1 pengine: [7103]: info: get_failcount: apache has
>>> failed INFINITY times on node2
>>> Sep 16 14:11:18 node1 pengine: [7103]: info: get_failcount: apache has
>>> failed INFINITY times on node1
>>> Sep 16 14:11:18 node1 pengine: [7103]: info: get_failcount: apache has
>>> failed INFINITY times on node2
>>>
>>>
>>> Thanks again,
>>>
>>> G
>>>
>>> On 16 September 2011 11:00, Guillaume Bettayeb<guillaume1...@gmail.com
>>> wrote:
>>>
>>>> Hi Dejan,
>>>>
>>>> I am not sure because Apache runs like a charm when not started via
>>>> Corosync but I don't know.
>>>>
>>>> Thanks,
>>>>
>>>> Guillaume
>>>>
>>>>
>>>> On 16 September 2011 09:01, Dejan Muhamedagic<deja...@fastmail.fm>
>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> On Fri, Sep 16, 2011 at 03:01:12AM +0100, Guillaume Bettayeb wrote:
>>>>>> Hi all,
>>>>>>
>>>>>>
>>>>>> I am still struggling to run apache in corosync. My Apache service is
>> OK
>>>>> and
>>>>>> runs fine if I start it manually, I have mod_status enabled on both
>>>>> nodes
>>>>>> OK. Ulrich made a good point earlier by asking if the cgisock ls -l
>>>>>> /var/run/apache2 was used by another process but that's not the case.
>>>>>>
>>>>>> I've restarted corosync after midnight and have added the full syslog
>>>>> here :
>>>>>>
>>>>>> http://pastebin.com/LXmLUu3W
>>>>>>
>>>>>>
>>>>>> I keep digging on google to find out why is it not working but any
>> help
>>>>>> would be greatly appreciated..is anyone else runs an Ubuntu cluster
>> with
>>>>>> Apache here by any chance ?
>>>>>
>>>>> Perhaps, but since this seems to be an issue with apache on
>>>>> Ubuntu I guess it's best to enquire in some ubuntu forum.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Dejan
>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Guillaume
>>>>>>
>>>>>>
>>>>>> On 15 September 2011 16:04, Guillaume Bettayeb<
>> guillaume1...@gmail.com
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Ulrich,
>>>>>>>
>>>>>>> nope, there's nothing in there at the moment :
>>>>>>> root@node1:/home/user# ls -l /var/run/apache2
>>>>>>> ls: cannot access /var/run/apache2: No such file or directory
>>>>>>>
>>>>>>>
>>>>>>> it looks like that error comes up when the cluster starts apache.
>>>>>>>
>>>>>>>
>>>>>>> Guillaume
>>>>>>>
>>>>>>> On 15 September 2011 16:00, Ulrich Windl<
>>>>>>> ulrich.wi...@rz.uni-regensburg.de>  wrote:
>>>>>>>
>>>>>>>> Hi!
>>>>>>>>
>>>>>>>> What about "ls -l /var/run/apache2"? Any cgisock* there?
>> Permissions
>>>>> of
>>>>>>>> the directory OK? Who is using that cgisock?
>>>>>>>>
>>>>>>>> Ulrich
>>>>>>>>
>>>>>>>>
>>>>>>>>>>> Guillaume Bettayeb<guillaume1...@gmail.com>  schrieb am
>>>>> 15.09.2011 um
>>>>>>>> 16:06 in
>>>>>>>> Nachricht
>>>>>>>> <CAG6QY=3DLP6S1t=+
>> jsq7qv+rmp5qqq02crnbea38nzbgvbd...@mail.gmail.com
>>>>>> :
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> Thanks for your advice, I have double checked mod_status in
>> Apache
>>>>> and
>>>>>>>> its
>>>>>>>>> definitely enabled on both nodes :
>>>>>>>>> ls /etc/apache2/mods-enabled
>>>>>>>>> alias.conf            authz_user.load  dir.conf
>>>>>   reqtimeout.conf
>>>>>>>>> alias.load            autoindex.conf   dir.load
>>>>>   reqtimeout.load
>>>>>>>>> auth_basic.load       autoindex.load   env.load
>>>>>   setenvif.conf
>>>>>>>>> authn_file.load       cgid.conf        mime.conf
>>>>> setenvif.load
>>>>>>>>> authz_default.load    cgid.load        mime.load
>>>>> status.conf
>>>>>>>>> authz_groupfile.load  deflate.conf     negotiation.conf
>>>>>   status.load
>>>>>>>>> authz_host.load       deflate.load     negotiation.load
>>>>>>>>>
>>>>>>>>> I have checked the status page http://node/server-status and I
>> can
>>>>> see
>>>>>>>> the
>>>>>>>>> status page ok. The mod_status is enabled on my node and runs
>> fine.
>>>>>>>>>
>>>>>>>>> I had a look at my apache log as you advised but I can't see
>> Apache
>>>>>>>> moaning
>>>>>>>>> about a specific error, apart from multiple stops and restarts
>> due
>>>>> to my
>>>>>>>>> tests :
>>>>>>>>>
>>>>>>>>> [Thu Sep 15 14:20:37 2011] [notice] caught SIGTERM, shutting
>> down
>>>>>>>>> [Thu Sep 15 14:20:38 2011] [notice] Apache/2.2.17 (Ubuntu)
>>>>> configured --
>>>>>>>>> resuming normal operations
>>>>>>>>> [Thu Sep 15 14:20:38 2011] [error] (2)No such file or directory:
>>>>>>>> Couldn't
>>>>>>>>> bind unix domain socket /var/run/apache2/cgisock.4278
>>>>>>>>> [Thu Sep 15 14:20:39 2011] [notice] caught SIGTERM, shutting
>> down
>>>>>>>>>
>>>>>>>>> That's for the primary node. It looks like Corosync shuts down
>>>>> Apache.
>>>>>>>>> On the Apache log file of the second node, I see the following :
>>>>>>>>>
>>>>>>>>> [Thu Sep 15 14:18:27 2011] [notice] Apache/2.2.17 (Ubuntu)
>>>>> configured --
>>>>>>>>> resuming normal operations
>>>>>>>>> [Thu Sep 15 14:18:27 2011] [error] (2)No such file or directory:
>>>>>>>> Couldn't
>>>>>>>>> bind unix domain socket /var/run/apache2/cgisock.1338
>>>>>>>>> [Thu Sep 15 14:18:28 2011] [crit] cgid daemon failed to
>> initialize
>>>>>>>>>
>>>>>>>>> I still have errors but the http service keep on running, no
>>>>> SIGTERM.
>>>>>>>>>
>>>>>>>>> And then my node status is :
>>>>>>>>>
>>>>>>>>> Online: [ node1 node2 ]
>>>>>>>>>
>>>>>>>>>   Resource Group: group1
>>>>>>>>>       failover-ip (ocf::heartbeat:IPaddr): Started node1
>>>>>>>>>       apache (ocf::heartbeat:apache): Stopped
>>>>>>>>>
>>>>>>>>> Failed actions:
>>>>>>>>>      apache_start_0 (node=node2, call=6, rc=1, status=complete):
>>>>> unknown
>>>>>>>>> error
>>>>>>>>>      apache_monitor_0 (node=node1, call=3, rc=1,
>> status=complete):
>>>>>>>> unknown
>>>>>>>>> error
>>>>>>>>>      apache_start_0 (node=node1, call=7, rc=1, status=complete):
>>>>> unknown
>>>>>>>>> error
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> of interest, some information found in var/log/syslog on node1 :
>>>>>>>>>
>>>>>>>>> Sep 15 02:32:56 node1 crmd: [710]: info: do_state_transition:
>> All 2
>>>>>>>> cluster
>>>>>>>>> nodes are eligible to run resources.
>>>>>>>>> Sep 15 02:32:56 node1 apache[928]: INFO: apache not running
>>>>>>>>> Sep 15 02:32:56 node1 apache[928]: INFO: waiting for apache
>>>>>>>>> /etc/apache2/apache2.conf to come up
>>>>>>>>>
>>>>>>>>> Sep 15 02:32:58 node1 apache[928]: INFO: Killing apache PID 995
>>>>>>>>> Sep 15 02:32:59 node1 lrmd: [707]: info: RA output:
>>>>>>>> (apache:start:stderr)
>>>>>>>>> kill: 833:
>>>>>>>>> Sep 15 02:32:59 node1 lrmd: [707]: info: RA output:
>>>>>>>> (apache:start:stderr) No
>>>>>>>>> such process
>>>>>>>>> Sep 15 02:32:59 node1 lrmd: [707]: info: RA output:
>>>>>>>> (apache:start:stderr)
>>>>>>>>> Sep 15 02:32:59 node1 apache[928]: INFO: Killing apache PID 995
>>>>>>>>> Sep 15 02:32:59 node1 apache[928]: INFO: apache stopped.
>>>>>>>>> Sep 15 02:32:59 node1 crmd: [710]: info: process_lrm_event: LRM
>>>>>>>> operation
>>>>>>>>> apache_start_0 (call=6, rc=1, cib-update=37, confirmed=true)
>>>>> unknown
>>>>>>>> error
>>>>>>>>> Sep 15 02:32:59 node1 crmd: [710]: WARN: status_from_rc: Action
>> 8
>>>>>>>>> (apache_start_0) on node1 failed (target: 0 vs. rc: 1): Error
>>>>>>>>> Sep 15 02:32:59 node1 crmd: [710]: WARN: update_failcount:
>> Updating
>>>>>>>>> failcount for apache on node1 after failed start: rc=1
>>>>> (update=INFINITY,
>>>>>>>>> time=1316050379)
>>>>>>>>> Sep 15 02:32:59 node1 crmd: [710]: info: abort_transition_graph:
>>>>>>>>> match_graph_event:272 - Triggered transition abort (complete=0,
>>>>>>>>> tag=lrm_rsc_op, id=apache_start_0,
>>>>>>>>> magic=0:1;8:3:0:a4e41810-3e8f-439a-9b92-489edf657291,
>> cib=0.172.10)
>>>>> :
>>>>>>>> Event
>>>>>>>>> failed
>>>>>>>>> Sep 15 02:32:59 node1 crmd: [710]: info: update_abort_priority:
>>>>> Abort
>>>>>>>>> priority upgraded from 0 to 1
>>>>>>>>> Sep 15 02:32:59 node1 crmd: [710]: info: update_abort_priority:
>>>>> Abort
>>>>>>>> action
>>>>>>>>> done superceeded by restart
>>>>>>>>> Sep 15 02:32:59 node1 crmd: [710]: info: match_graph_event:
>> Action
>>>>>>>>> apache_start_0 (8) confirmed on node1 (rc=4)
>>>>>>>>> Sep 15 02:32:59 node1 crmd: [710]: info: run_graph:
>>>>>>>>> ====================================================
>>>>>>>>> Sep 15 02:32:59 node1 crmd: [710]: notice: run_graph: Transition
>> 3
>>>>>>>>> (Complete=3, Pending=0, Fired=0, Skipped=4, Incomplete=0,
>>>>>>>>> Source=/var/lib/pengine/pe-input-247.bz2): Stopped
>>>>>>>>> Sep 15 02:32:59 node1 crmd: [710]: info: te_graph_trigger:
>>>>> Transition 3
>>>>>>>> is
>>>>>>>>> now complete
>>>>>>>>> Sep 15 02:32:59 node1 crmd: [710]: info: do_state_transition:
>> State
>>>>>>>>> transition S_TRANSITION_ENGINE ->  S_POLICY_ENGINE [
>> input=I_PE_CALC
>>>>>>>>> cause=C_FSA_INTERNAL origin=notify_crmd ]
>>>>>>>>> Sep 15 02:32:59 node1 crmd: [710]: info: do_state_transition:
>> All 2
>>>>>>>> cluster
>>>>>>>>> nodes are eligible to run resources.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Here is my crm configure show, is there anything I can change
>> there
>>>>> ?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> root@node1:/home/user# crm configure show
>>>>>>>>> node node1 \
>>>>>>>>> attributes standby="off"
>>>>>>>>> node node2 \
>>>>>>>>> attributes standby="off"
>>>>>>>>> primitive apache ocf:heartbeat:apache \
>>>>>>>>> params configfile="/etc/apache2/apache2.conf"
>>>>> httpd="/usr/sbin/apache2"
>>>>>>>> \
>>>>>>>>> op start interval="10" timeout="40s" \
>>>>>>>>>   op stop interval="10" timeout="60s" \
>>>>>>>>> op monitor interval="5s"
>>>>>>>>> primitive failover-ip ocf:heartbeat:IPaddr \
>>>>>>>>> params ip="192.168.0.105" \
>>>>>>>>> op monitor interval="5s"
>>>>>>>>> group group1 failover-ip apache
>>>>>>>>> location cli-prefer-failover-ip failover-ip \
>>>>>>>>> rule $id="cli-prefer-rule-failover-ip" inf: #uname eq node1
>>>>>>>>> property $id="cib-bootstrap-options" \
>>>>>>>>> dc-version="1.0.9-da7075976b5ff0bee71074385f8fd02f296ec8a3" \
>>>>>>>>> cluster-infrastructure="openais" \
>>>>>>>>>   expected-quorum-votes="2" \
>>>>>>>>> stonith-enabled="false" \
>>>>>>>>>   no-quorum-policy="ignore"
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thank you for your help,
>>>>>>>>>
>>>>>>>>> Guillaume
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 13 September 2011 08:29, Tim Serong<tser...@suse.com>
>> wrote:
>>>>>>>>>
>>>>>>>>>> On 13/09/11 00:39, Guillaume Bettayeb wrote:
>>>>>>>>>>> Hi there,
>>>>>>>>>>>
>>>>>>>>>>> This is my first post on this list, so hello everybody :)
>>>>>>>>>>>
>>>>>>>>>>> I am currently testing the fun of Linux HA Clustering (just
>> for
>>>>>>>>>>> personal interest)
>>>>>>>>>>> and I have successfully set up a tiny ubuntu virtualbox 2
>> nodes
>>>>>>>>>>> cluster with Ip failover and Apache running as resources.
>>>>>>>>>>>
>>>>>>>>>>> Right after the install, I tried to move the resources from
>> a
>>>>> node
>>>>>>>> to
>>>>>>>>>>> the other (command standby) an everything worked like a
>> charm.
>>>>>>>>>>> Then I tried some failure tests, and started with a simple
>>>>>>>>>>> /etc/init.d/networking stop on one node, noticed that the
>> other
>>>>> node
>>>>>>>>>>> took ownership of the resources automatically, all was fine.
>>>>>>>>>>>
>>>>>>>>>>> Then I have rebooted the nodes just to see how they would
>>>>> restart
>>>>>>>> the
>>>>>>>>>>> cluster, and since I have the following error :
>>>>>>>>>>>
>>>>>>>>>>> apache_start_0 (node=node1, call=8, rc=1, status=complete):
>>>>> unknown
>>>>>>>> error
>>>>>>>>>>>
>>>>>>>>>>> For reading convenience,  my outputs are available at
>>>>>>>>>>> http://pastebin.com/w1J4TWaG
>>>>>>>>>>> Just to clarify, that's :
>>>>>>>>>>> - crm configure show command
>>>>>>>>>>> - crm_mon status
>>>>>>>>>>> - All relevant information into my /var/log/syslog (although
>> I
>>>>> was
>>>>>>>> not
>>>>>>>>>>> sure what to look at, I never used corosync before)
>>>>>>>>>>>
>>>>>>>>>>> I have read on an older post that the apache error usually
>> have
>>>>>>>>>>> something to do with either timeout or mod status.
>>>>>>>>>>> As you can see on my pastebin, my timeout values are ok :
>>>>>>>>>>> op stop interval="60s" timeout="120" \
>>>>>>>>>>> op start interval="60s" timeout="120" \
>>>>>>>>>>>
>>>>>>>>>>> as for mod_status it's already enabled in Apache :
>>>>>>>>>>> root@node1:/etc/apache2# a2enmod status
>>>>>>>>>>> Module status already enabled
>>>>>>>>>>>
>>>>>>>>>>> Have I done anything wrong or is there anything else I
>> should
>>>>>>>>>> check/configure ?
>>>>>>>>>>>
>>>>>>>>>>> Any help with this matter would be greatly appreciated :)
>>>>>>>>>>
>>>>>>>>>> On a punt, it's probably mod_status.  Check your Apache logs
>> at
>>>>> the
>>>>>>>> time
>>>>>>>>>> the start failed.  If it's whining about a 403 or 404 for
>>>>>>>> /server-status
>>>>>>>>>> (or similar), you need to fix that in your Apache config.
>>>>>>>>>>
>>>>>>>>>> HTH,
>>>>>>>>>>
>>>>>>>>>> Tim
>>>>>>>>>> --
>>>>>>>>>> Tim Serong
>>>>>>>>>> Senior Clustering Engineer
>>>>>>>>>> SUSE
>>>>>>>>>> tser...@suse.com
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Linux-HA mailing list
>>>>>>>>>> Linux-HA@lists.linux-ha.org
>>>>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>>>>>>>> See also: http://linux-ha.org/ReportingProblems
>>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Linux-HA mailing list
>>>>>>>>> Linux-HA@lists.linux-ha.org
>>>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>>>>>>> See also: http://linux-ha.org/ReportingProblems
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Linux-HA mailing list
>>>>>>>> Linux-HA@lists.linux-ha.org
>>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>>>>>> See also: http://linux-ha.org/ReportingProblems
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> _______________________________________________
>>>>>> Linux-HA mailing list
>>>>>> Linux-HA@lists.linux-ha.org
>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>>>> See also: http://linux-ha.org/ReportingProblems
>>>>> _______________________________________________
>>>>> Linux-HA mailing list
>>>>> Linux-HA@lists.linux-ha.org
>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>>> See also: http://linux-ha.org/ReportingProblems
>>>>>
>>>>
>>>>
>>> _______________________________________________
>>> Linux-HA mailing list
>>> Linux-HA@lists.linux-ha.org
>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>> See also: http://linux-ha.org/ReportingProblems
>>>
>> _______________________________________________
>> Linux-HA mailing list
>> Linux-HA@lists.linux-ha.org
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
>>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>


-- 
Tim Serong
Senior Clustering Engineer
SUSE
tser...@suse.com
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Antw: Re: Apache error on all nodes

Reply via email to