Re: [Toolserver-l] Status of the toolserver
Just a flash feedback - some ours again I could login again, but qstat gave an error message while crontab was running regularly; now qstat runs again. Presently is running under Alebot account a IRC script only, that can be considered a test routine; have I to stop it, to make server update easier? Alex 2013/5/13 DaB. > Hello all, > > as you have surely noticed the toolserver is even more unstable and > unreliable > than normal at the moment. The reason is that our ha-nodes are not longer > working as intended and neither Nosy nor I are able to fix this. > > A quick word was ha-nodes are: The "ha" stands for "high available" and we > have 2 servers for that. Some services at the toolserver are so important > that > a downtime is unacceptable (like /home, LDAP or the DNS) and for this > reasons > these services life at the ha-nodes. If one server goes down or crashes > then > the other can continue to operate all services with no or little > interruption > time and without working by a root. That worked great as long as River was > here and not-so-good in the last months, but now it is totally broken. > The problem is that both ha-nodes run Solaris and all roots are no Solaris- > experts what makes it hard for us to find errors or in this case > impossible. We > have setup a very ugly workaround, but it is not stable and so the > downtime of > important services cause downtime for the hole toolserver – and more work > for > the roots. > > We can only think of one solution: Replacing the solaris at the ha-nodes > with > linux. But this can not start before Friday and it will take some time > until > everything is moved over. It will also cause some hours of complete > downtime > while /home is copied (we will separately announce this). In best case when > Whitsun is over everything will be working again, in worst case it will > need 2 > weeks (I will be away between 21 and 26 for the general meeting of WMDE). > The repairing of the ha-nodes has top priority, so everything else will be > delayed (linux-update, database-reimports, account-creation (for VERY > important ones send me a mail), etc.). > > If you have questions, please send them to the ML. > > Sincerely, > DaB. > > -- > Userpage: [[:w:de:User:DaB.]] — PGP: 0x2d3ee2d42b255885 > > ___ > Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) > https://lists.wikimedia.org/mailman/listinfo/toolserver-l > Posting guidelines for this list: > https://wiki.toolserver.org/view/Mailing_list_etiquette > ___ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
[Toolserver-l] Status of the toolserver
Hello all, as you have surely noticed the toolserver is even more unstable and unreliable than normal at the moment. The reason is that our ha-nodes are not longer working as intended and neither Nosy nor I are able to fix this. A quick word was ha-nodes are: The "ha" stands for "high available" and we have 2 servers for that. Some services at the toolserver are so important that a downtime is unacceptable (like /home, LDAP or the DNS) and for this reasons these services life at the ha-nodes. If one server goes down or crashes then the other can continue to operate all services with no or little interruption time and without working by a root. That worked great as long as River was here and not-so-good in the last months, but now it is totally broken. The problem is that both ha-nodes run Solaris and all roots are no Solaris- experts what makes it hard for us to find errors or in this case impossible. We have setup a very ugly workaround, but it is not stable and so the downtime of important services cause downtime for the hole toolserver – and more work for the roots. We can only think of one solution: Replacing the solaris at the ha-nodes with linux. But this can not start before Friday and it will take some time until everything is moved over. It will also cause some hours of complete downtime while /home is copied (we will separately announce this). In best case when Whitsun is over everything will be working again, in worst case it will need 2 weeks (I will be away between 21 and 26 for the general meeting of WMDE). The repairing of the ha-nodes has top priority, so everything else will be delayed (linux-update, database-reimports, account-creation (for VERY important ones send me a mail), etc.). If you have questions, please send them to the ML. Sincerely, DaB. -- Userpage: [[:w:de:User:DaB.]] — PGP: 0x2d3ee2d42b255885 signature.asc Description: This is a digitally signed message part. ___ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
Re: [Toolserver-l] Cronie error
Lars Aronsson wrote: >>> I have a cronietab job that now gives this error message: >>> error: commlib error: can't connect to service (Connection refused) >>> Unable to run job: unable to send message to qmaster >>> using port 444 on host "damiana": got send error. >>> Exiting. >> Now it has changed to: >> error: commlib error: can't connect to service (Connection refused) >> Unable to run job: unable to send message to qmaster using >> port 444 on host "turnera-bge0": got send error. >> Exiting. > Here is a third variant, that I got today: > error: commlib error: can't connect to service (Connection refused) > Unable to run job: unable to send message to qmaster using port 444 on host > "clematis.toolserver.org": got send error. > Exiting. > Can someone please explain how I should submit a cron/cronie job? You shouldn't change anything. There have been some tran- sient errors in connection with the outage (NFS/SGE/LDAP failure), and these are artifacts of those. At the moment, SGE is up and running. Tim ___ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
Re: [Toolserver-l] Cronie error
On 05/03/2013 11:16 PM, Lars Aronsson wrote: On 05/02/2013 04:48 PM, Lars Aronsson wrote: I have a cronietab job that now gives this error message: error: commlib error: can't connect to service (Connection refused) Unable to run job: unable to send message to qmaster using port 444 on host "damiana": got send error. Exiting. Now it has changed to: error: commlib error: can't connect to service (Connection refused) Unable to run job: unable to send message to qmaster using port 444 on host "turnera-bge0": got send error. Exiting. Here is a third variant, that I got today: error: commlib error: can't connect to service (Connection refused) Unable to run job: unable to send message to qmaster using port 444 on host "clematis.toolserver.org": got send error. Exiting. Can someone please explain how I should submit a cron/cronie job? -- Lars Aronsson (l...@aronsson.se) Aronsson Datateknik - http://aronsson.se ___ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette