Re: [Toolserver-l] Status of the toolserver
Just a flash feedback - some ours again I could login again, but qstat gave an error message while crontab was running regularly; now qstat runs again. Presently is running under Alebot account a IRC script only, that can be considered a test routine; have I to stop it, to make server update easier? Alex 2013/5/13 DaB. w...@daniel.baur4.info Hello all, as you have surely noticed the toolserver is even more unstable and unreliable than normal at the moment. The reason is that our ha-nodes are not longer working as intended and neither Nosy nor I are able to fix this. A quick word was ha-nodes are: The ha stands for high available and we have 2 servers for that. Some services at the toolserver are so important that a downtime is unacceptable (like /home, LDAP or the DNS) and for this reasons these services life at the ha-nodes. If one server goes down or crashes then the other can continue to operate all services with no or little interruption time and without working by a root. That worked great as long as River was here and not-so-good in the last months, but now it is totally broken. The problem is that both ha-nodes run Solaris and all roots are no Solaris- experts what makes it hard for us to find errors or in this case impossible. We have setup a very ugly workaround, but it is not stable and so the downtime of important services cause downtime for the hole toolserver – and more work for the roots. We can only think of one solution: Replacing the solaris at the ha-nodes with linux. But this can not start before Friday and it will take some time until everything is moved over. It will also cause some hours of complete downtime while /home is copied (we will separately announce this). In best case when Whitsun is over everything will be working again, in worst case it will need 2 weeks (I will be away between 21 and 26 for the general meeting of WMDE). The repairing of the ha-nodes has top priority, so everything else will be delayed (linux-update, database-reimports, account-creation (for VERY important ones send me a mail), etc.). If you have questions, please send them to the ML. Sincerely, DaB. -- Userpage: [[:w:de:User:DaB.]] — PGP: 0x2d3ee2d42b255885 ___ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette ___ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
[Toolserver-l] weird qcronsub errors (was: Output from cron command)
Hi Since a few days I'm getting weird errors when submitting tasks. My Cronjob calls /home/mazder/public_html/replicate-sequences/update-submit.sh which conains the following command: qcronsub -l h_rt=0:05:00 -l virtual_free=100M -l arch=* -l sql-user-m=1 -N mazder-replicate-sequences -m as -o '/home/mazder/public_html/replicate-sequences/sge' '/home/mazder/public_html/replicate-sequences/update-run.sh' Most of these calls produce the error below, which seems not to be an error in my code as I neither use xml nor python. Do you have any Idea what's going wrong? Peter Original-Nachricht Betreff: Output from cron command Datum: Tue, 14 May 2013 08:40:00 + (UTC) Von: maz...@toolserver.org (mazder) An: maz...@toolserver.org Your cron job on clematis /home/mazder/public_html/replicate-sequences/update-submit.sh produced the following output: error: JSV stderr: Traceback (most recent call last): error: JSV stderr: File /sge/GE/bin/sol-amd64/qjobtest, line 108, in module error: JSV stderr: dom = minidom.parse(child_stdout) error: JSV stderr: File /opt/ts/python/2.7/lib/python2.7/site-packages/_xmlplus/dom/minidom.py, line 1915, in parse error: JSV stderr: return expatbuilder.parse(file) error: JSV stderr: File /opt/ts/python/2.7/lib/python2.7/site-packages/_xmlplus/dom/expatbuilder.py, line 930, in parse error: JSV stderr: result = builder.parseFile(file) error: JSV stderr: File /opt/ts/python/2.7/lib/python2.7/site-packages/_xmlplus/dom/expatbuilder.py, line 207, in parseFile error: JSV stderr: parser.Parse(buffer, 0) error: JSV stderr: xml.parsers.expat.ExpatError: syntax error: line 1, column 0 Unable to run job: JSV stderr: Traceback (most recent call last): JSV stderr: File /sge/GE/bin/sol-amd64/qjobtest, line 108, in module JSV stderr: dom = minidom.parse(child_stdout) JSV stderr: File /opt/ts/python/2.7/lib/python2.7/site-packages/_xmlplus/dom/minidom.py, line 1915, in parse JSV stderr: return expatbuilder.parse(file) JSV stderr: File /opt/ts/python/2.7/lib/python2.7/site-packages/_xmlplus/dom/expatbuilder.py, line 930, in parse JSV stderr: result = builder.parseFile(file) JSV stderr: File /opt/ts/python/2.7/lib/python2.7/site-packages/_xmlplus/dom/expatbuilder.py, line 207, in parseFile JSV stderr: parser.Parse(buffer, 0) JSV stderr: xml.parsers.expat.ExpatError: syntax error: line 1, column 0 JSV stderr is - xml.parsers.expat.ExpatError: syntax error: line 1, column 0. Exiting. ___ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
Re: [Toolserver-l] weird qcronsub errors
Peter Körner osm-li...@mazdermind.de wrote: Since a few days I'm getting weird errors when submitting tasks. My Cronjob calls /home/mazder/public_html/replicate-sequences/update-submit.sh which conains the following command: qcronsub -l h_rt=0:05:00 -l virtual_free=100M -l arch=* -l sql-user-m=1 -N mazder-replicate-sequences -m as -o '/home/mazder/public_html/replicate-sequences/sge' /home/mazder/public_html/replicate-sequences/update-run.sh' Most of these calls produce the error below, which seems not to be an error in my code as I neither use xml nor python. Do you have any Idea what's going wrong? [...] An educated guess: The Python errors come from the script /sge/GE/bin/sol-amd64/qjobtest that is called as part of qcronsub to test whether a job with that name is already running. qjobtest parses the output of qstat -xml ... which in normal operation returns a valid XML document. My assumption is that when SGE is down, qstat returns the error messages (error: commlib error: can't connect to service (Connection refused), etc.) as plain text which can't be parsed as XML which in return causes qjobtest to barf. In short: This is another artefact of SGE being down at that moment, you can't do anything about it, just ignore. Tim ___ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
Re: [Toolserver-l] Status of the toolserver
On Mon, May 13, 2013, at 05:01 PM, DaB. wrote: The repairing of the ha-nodes has top priority, so everything else will be delayed (linux-update, database-reimports, account-creation (for VERY important ones send me a mail), etc.). If you have questions, please send them to the ML. Is the current outage of replication on sql-s1-user (now approaching 48 hours) related to this ha-node problem? At least some other dbs seem to still have replication working. -- Russell Blau russb...@imapmail.org ___ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
Re: [Toolserver-l] Status of the toolserver
Linux is your best bet. Also Errors 404 401 are non responsive. I can connect to all servers but on 2 of them msg/nickserver/password is the 401 404 error stub. See if this information helps you if not write me back Best Regards [MILASTARX]:[TS] On May 13, 2013 6:02 PM, DaB. w...@daniel.baur4.info wrote: Hello all, as you have surely noticed the toolserver is even more unstable and unreliable than normal at the moment. The reason is that our ha-nodes are not longer working as intended and neither Nosy nor I are able to fix this. A quick word was ha-nodes are: The ha stands for high available and we have 2 servers for that. Some services at the toolserver are so important that a downtime is unacceptable (like /home, LDAP or the DNS) and for this reasons these services life at the ha-nodes. If one server goes down or crashes then the other can continue to operate all services with no or little interruption time and without working by a root. That worked great as long as River was here and not-so-good in the last months, but now it is totally broken. The problem is that both ha-nodes run Solaris and all roots are no Solaris- experts what makes it hard for us to find errors or in this case impossible. We have setup a very ugly workaround, but it is not stable and so the downtime of important services cause downtime for the hole toolserver – and more work for the roots. We can only think of one solution: Replacing the solaris at the ha-nodes with linux. But this can not start before Friday and it will take some time until everything is moved over. It will also cause some hours of complete downtime while /home is copied (we will separately announce this). In best case when Whitsun is over everything will be working again, in worst case it will need 2 weeks (I will be away between 21 and 26 for the general meeting of WMDE). The repairing of the ha-nodes has top priority, so everything else will be delayed (linux-update, database-reimports, account-creation (for VERY important ones send me a mail), etc.). If you have questions, please send them to the ML. Sincerely, DaB. -- Userpage: [[:w:de:User:DaB.]] — PGP: 0x2d3ee2d42b255885 ___ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette ___ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette