Re: [Toolserver-l] Status of the toolserver

2013-05-13 Thread Alex Brollo
Just a flash feedback - some ours again I could login again, but qstat gave
an error message while crontab was running regularly; now qstat runs again.

Presently is running under Alebot account a IRC script only, that can be
considered a test routine; have I to stop it, to make server update easier?

Alex


2013/5/13 DaB. 

> Hello all,
>
> as you have surely noticed the toolserver is even more unstable and
> unreliable
> than normal at the moment. The reason is that our ha-nodes are not longer
> working as intended and neither Nosy nor I are able to fix this.
>
> A quick word was ha-nodes are: The "ha" stands for "high available" and we
> have 2 servers for that. Some services at the toolserver are so important
> that
> a downtime is unacceptable (like /home, LDAP or the DNS) and for this
> reasons
> these services life at the ha-nodes. If one server goes down or crashes
> then
> the other can continue to operate all services with no or little
> interruption
> time and without working by a root. That worked great as long as River was
> here and not-so-good in the last months, but now it is totally broken.
> The problem is that both ha-nodes run Solaris and all roots are no Solaris-
> experts what makes it hard for us to find errors or in this case
> impossible. We
> have setup a very ugly workaround, but it is not stable and so the
> downtime of
> important services cause downtime for the hole toolserver – and more work
> for
> the roots.
>
> We can only think of one solution: Replacing the solaris at the ha-nodes
> with
> linux. But this can not start before Friday and it will take some time
> until
> everything is moved over. It will also cause some hours of complete
> downtime
> while /home is copied (we will separately announce this). In best case when
> Whitsun is over everything will be working again, in worst case it will
> need 2
> weeks (I will be away between 21 and 26 for the general meeting of WMDE).
> The repairing of the ha-nodes has top priority, so everything else will be
> delayed (linux-update, database-reimports, account-creation (for VERY
> important ones send me a mail), etc.).
>
> If you have questions, please send them to the ML.
>
> Sincerely,
> DaB.
>
> --
> Userpage: [[:w:de:User:DaB.]] — PGP: 0x2d3ee2d42b255885
>
> ___
> Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
> https://lists.wikimedia.org/mailman/listinfo/toolserver-l
> Posting guidelines for this list:
> https://wiki.toolserver.org/view/Mailing_list_etiquette
>
___
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette

[Toolserver-l] Status of the toolserver

2013-05-13 Thread DaB.
Hello all,

as you have surely noticed the toolserver is even more unstable and unreliable 
than normal at the moment. The reason is that our ha-nodes are not longer 
working as intended and neither Nosy nor I are able to fix this.

A quick word was ha-nodes are: The "ha" stands for "high available" and we 
have 2 servers for that. Some services at the toolserver are so important that 
a downtime is unacceptable (like /home, LDAP or the DNS) and for this reasons 
these services life at the ha-nodes. If one server goes down or crashes then 
the other can continue to operate all services with no or little interruption 
time and without working by a root. That worked great as long as River was 
here and not-so-good in the last months, but now it is totally broken.
The problem is that both ha-nodes run Solaris and all roots are no Solaris-
experts what makes it hard for us to find errors or in this case impossible. We 
have setup a very ugly workaround, but it is not stable and so the downtime of 
important services cause downtime for the hole toolserver – and more work for 
the roots.

We can only think of one solution: Replacing the solaris at the ha-nodes with 
linux. But this can not start before Friday and it will take some time until 
everything is moved over. It will also cause some hours of complete downtime 
while /home is copied (we will separately announce this). In best case when 
Whitsun is over everything will be working again, in worst case it will need 2 
weeks (I will be away between 21 and 26 for the general meeting of WMDE).
The repairing of the ha-nodes has top priority, so everything else will be 
delayed (linux-update, database-reimports, account-creation (for VERY 
important ones send me a mail), etc.).

If you have questions, please send them to the ML.

Sincerely,
DaB.

-- 
Userpage: [[:w:de:User:DaB.]] — PGP: 0x2d3ee2d42b255885


signature.asc
Description: This is a digitally signed message part.
___
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette

Re: [Toolserver-l] Cronie error

2013-05-13 Thread Tim Landscheidt
Lars Aronsson  wrote:

>>> I have a cronietab job that now gives this error message:

>>> error: commlib error: can't connect to service (Connection refused)
>>> Unable to run job: unable to send message to qmaster
>>> using port 444 on host "damiana": got send error.
>>> Exiting.

>> Now it has changed to:

>> error: commlib error: can't connect to service (Connection refused)
>> Unable to run job: unable to send message to qmaster using
>> port 444 on host "turnera-bge0": got send error.
>> Exiting.

> Here is a third variant, that I got today:

> error: commlib error: can't connect to service (Connection refused)
> Unable to run job: unable to send message to qmaster using port 444 on host 
> "clematis.toolserver.org": got send error.
> Exiting.

> Can someone please explain how I should submit a cron/cronie job?

You shouldn't change anything.  There have been some tran-
sient errors in connection with the outage (NFS/SGE/LDAP
failure), and these are artifacts of those.  At the moment,
SGE is up and running.

Tim


___
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette

Re: [Toolserver-l] Cronie error

2013-05-13 Thread Lars Aronsson

On 05/03/2013 11:16 PM, Lars Aronsson wrote:

On 05/02/2013 04:48 PM, Lars Aronsson wrote:

I have a cronietab job that now gives this error message:

error: commlib error: can't connect to service (Connection refused)
Unable to run job: unable to send message to qmaster using port 444 
on host "damiana": got send error.

Exiting.


Now it has changed to:

error: commlib error: can't connect to service (Connection refused)
Unable to run job: unable to send message to qmaster using port 444 on 
host "turnera-bge0": got send error.

Exiting.


Here is a third variant, that I got today:

error: commlib error: can't connect to service (Connection refused)
Unable to run job: unable to send message to qmaster using port 444 on host 
"clematis.toolserver.org": got send error.
Exiting.


Can someone please explain how I should submit a cron/cronie job?


--
  Lars Aronsson (l...@aronsson.se)
  Aronsson Datateknik - http://aronsson.se



___
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette