Re: [Labs-l] Permissions issue

2016-06-04 Thread Marc A. Pelletier
On 2016-06-04 12:06 PM, Huji Lee wrote: Is the source code of this "take" script somewhere public? I wonder how it works. It's in git: https://phabricator.wikimedia.org/diffusion/LTOL/browse/master/src/take.cc It works because it does the chown as root; but most of the code is security

Re: [Labs-l] Signpost

2016-05-15 Thread Marc A. Pelletier
On 2016-05-12 5:54 PM, Huji Lee wrote: I know with WordPress the issues is no about the spread of vulnerabilities from one server instance to another, but I wonder how Labs is secured against the latter specifically. The security domain for labs is the project, not the instance; but they are

Re: [Labs-l] Labs privacy policy questions

2016-03-08 Thread Marc A. Pelletier
On 16-03-08 07:13 AM, Maximilian Doerr wrote: > No tool labs tool or project has any business collecting usernames and > passwords unless it's a local tool login completely separate login from > WMF wikis. That is exactly what this provision is about (credentials for the tool itself, if any).

Re: [Labs-l] /usr/bin/jsub -once echo "Goodbye!"

2015-12-29 Thread Marc A. Pelletier
On 15-12-29 09:06 AM, Maximilian Doerr wrote: > I'm confused, are you retiring, or will you be working for other areas of the > Foundation? I'm sorry if that wasn't clear - I'm leaving the foundation; my future involvement will be as a volunteer only. :-) -- Marc

[Labs-l] [Tools] Default release for gridengine to change to Trusty

2015-12-09 Thread Marc A. Pelletier
Hello Tool Labs, For quite some time, now, running jobs on Trusty has been supported by requesting so explicitly (with '-l release=trusty') on gridengine. It has been working fine, and given the need to gradually phase out Precise from tool labs, we will be changing the default release jobs are

[Labs-l] PAM cleanup on Labs instances

2015-12-02 Thread Marc A. Pelletier
Hello Labs, Today, I've cleaned up and made sane the PAM configuration puppet places on labs instances (by relying on the debian-provided facilities rather than manual overrides). However, instances using a self-hosted puppet master will not have picked the change up. This is essentially

[Labs-l] [tools] Some crontab entries reenabled

2015-11-24 Thread Marc A. Pelletier
Hello Labs, As part of the recovery process of the labstore filesystem crash, crontab for tools had been disabled to prevent partially-restored tools from firing. Those tools that still had entries commented out from that intervention had the cron entries reenabled earlier today. Only entries

[Labs-l] [Labs-announce] Very brief maintenance window for NFS server

2015-11-19 Thread Marc-André Pelletier
Hello Labs, There will be a very brief maintenance window on Monday Nov 23 at 15:00 UTC, to restart the NFS daemon with new settings. There is no expected interruption of service beyond the 90 second NFS grace period, but if something goes wrong with the restart it may take a few minutes to

Re: [Labs-l] [Labs-announce] [Maintenance] NFS Maintenance 2015-10-28 13:30 UTC

2015-10-28 Thread Marc A. Pelletier
On 15-10-26 02:46 PM, Marc-André Pelletier wrote: > https://phabricator.wikimedia.org/T107038 This maintenance window starts now for the next 90 minutes. During that period, the actual suspension of NFS services should last less than 20 minutes at some point near the start of that win

Re: [Labs-l] [Labs-announce] [Maintenance] NFS Maintenance 2015-10-28 13:30 UTC

2015-10-28 Thread Marc A. Pelletier
On 15-10-26 02:46 PM, Marc-André Pelletier wrote: > We are > planning a brief maintenance window on Wednesday, October 28 starting at > 13:30 UTC to return NFS service to the primary server. This maintenance is now complete, and normal operation should be restored to all labs instances

[Labs-l] [Labs-announce] [Maintenance] NFS Maintenance 2015-10-28 13:30 UTC

2015-10-26 Thread Marc-André Pelletier
Hello Labs, As the very last step in recovering completely from our past NFS failure, we are scheduling a switch back to the primary file server (labstore1001) as part of the recovery process. This requires a maintenance window during which NFS service will be briefly unavailable since physical

Re: [Labs-l] Cron job hasn't worked for months

2015-10-22 Thread Marc A. Pelletier
On 15-10-22 12:55 PM, Maximilian Doerr wrote: > What’s with the I/O bandwidth? NFS I/O bandwidth is currently our most contended-for resource in Labs; much effort has been deployed lately to relieve it (with quite a bit of success) but it remains the number we keep our closest eye on. -- Marc

Re: [Labs-l] Cron job hasn't worked for months

2015-10-22 Thread Marc A. Pelletier
On 15-10-21 04:14 PM, David Richfield wrote: > Sorry for hogging disk space! When would I have been warned? If it became an issue, you would have been notified - thankfully, unlike I/O bandwidth, disk /space/ isn't a desperately overused resource. Nevertheless, cleaning up after your code is

Re: [Labs-l] DRMAA on Labs

2015-09-10 Thread Marc A. Pelletier
On 15-09-10 02:13 PM, Chad Horohoe wrote: > It doesn't help that they use the acronym DRM all > over their website :D OMG TDM TLA, FFS! -- Marc ___ Labs-l mailing list Labs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/labs-l

Re: [Labs-l] [Labs-announce] Partial labs downtime Wednesday, 2015-08-12, 15:00 UTC: Reboot of labvirt1001

2015-08-11 Thread Marc A. Pelletier
On 15-08-10 05:41 PM, Maximilian Doerr wrote: How will this affect Cyberbot's continuous scripts? They will be, in practice, stuck in queue waiting for the node to become available again because those jobs will not be selected to run elsewhere. Provided they are generally restartable, the net

[Labs-l] [Labs-announce] Labs NFS outage - 2015-06-17 report

2015-06-24 Thread Marc A. Pelletier
After a long and painful recovery process, here is the incident report for the June 17 outage of Labs NFS: https://wikitech.wikimedia.org/wiki/Incident_documentation/20150617-LabsNFSOutage -- Marc ___ Labs-announce mailing list

Re: [Labs-l] [UPDATE] Recovering lost files

2015-06-21 Thread Marc A. Pelletier
Hello Labs, The fsck of the old filesystem is ongoing, with much of the work done, but it's not immediately clear how long remains. As soon as this is completed, we'll be in a position to selectively restore some of the files you cannot otherwise recover where possible. In order to track all of

Re: [Labs-l] [Labs-announce] [UPDATED] Labs (almost fully!) back up

2015-06-21 Thread Marc-André Pelletier
Hello, Labs, On 15-06-19 06:33 PM, Yuvi Panda wrote: All projects (except maps and mwoffliner) At this time, only the maps project remains to be restored. We ran into difficulties as the limited amount of storage we currently have available while we are still in recovery mode is insufficient

Re: [Labs-l] [NFS outage] Tools is back

2015-06-19 Thread Marc A. Pelletier
On 15-06-19 03:42 PM, Maciej Jaros wrote: Two problems: Both should be fixed. There is a slightly different behaviour with the more recent NFS server kernel that makes it slow down a lot when you add a new filesystem to export, and we had just brought a new filesystem for another recovered

[Labs-l] [NFS outage] Tools is back

2015-06-19 Thread Marc A. Pelletier
Hey all, The tools project is/should be back up now, with three important caveats: * All the files on NFS (in /home and /data/project) have been reverted to their version as of Jun 8 2015 around 14h UTC[1]. * Any crontabs your tools may have have been commented out to avoid the risk of

[Labs-l] [Labs-announce] [NFS outage] Tools is back

2015-06-19 Thread Marc-André Pelletier
Hey all, The tools project is/should be back up now, with three important caveats: * All the files on NFS (in /home and /data/project) have been reverted to their version as of Jun 8 2015 around 14h UTC[1]. * Any crontabs your tools may have have been commented out to avoid the risk of

Re: [Labs-l] NFS outage in progress [UPDATE]

2015-06-17 Thread Marc A. Pelletier
On 15-06-17 09:29 PM, Andrew Bogott wrote: Coren is rebooting and fscking the system -- with luck it'll be up again within the hour. Fortune does not smile upon us; the fsck is taking a LONG time to progress (over a 40T filesystem), but the repair goes apace. There is already indication of

[Labs-l] [Labs-announce] Gridengine master outage of 2015-06-02

2015-06-04 Thread Marc-André Pelletier
Hello Labs, It has been pointed out to me that I never wrote an email pointing to the incident report for the partial Tool Labs outage mentionned in Subject: https://wikitech.wikimedia.org/wiki/Incident_documentation/20150602-gridengine-dns-failure tl;dr: Two distinct name resolution issues

Re: [Labs-l] [Labs-announce] NFS server switch 2014-04-30 19h UTC

2015-04-28 Thread Marc-André Pelletier
On 15-04-28 01:26 PM, Marc-André Pelletier wrote: To be safe, since that procedure has not yet been tested live, I am setting a 30 minute maintenance window during which NFS service may be stalled. Just to be extra paranoid given our past horrible luck with the implicated servers, that window

[Labs-l] [Labs-announce] NFS server switch 2014-04-30 19h UTC

2015-04-28 Thread Marc-André Pelletier
Hello Labs, On 2014-04-30 at 19h UTC we will switch NFS service from one server to the other (labstore1001 - labstore1002). This is to allow upgrading labstore1001 to Debian Jessie - bringing it up to date with our current infrastructure and having a kernel that is more performant for our setup.

[Labs-l] [Labs-announce] A reminder about Terms of Use

2015-04-27 Thread Marc A. Pelletier
Hello Labs, I would like to remind all of you that - in addition to the Labs' terms of Use[1] to which you have all agreed - it is important that all Labs users also abide the Terms of Use of any external resource they may be accessing as well as any applicable data reuse licenses. This holds

[Labs-l] Last chance to reboot precise installs in Labs

2015-04-22 Thread Marc A. Pelletier
Hello Labs project maintainers, Today is your last chance to reboot any instances you may be running that have Ubuntu Precise and have not been rebooted since my notification last week! All instances how have a /usr/local/sbin/reboot-if-idmap script installed which, if run by root or with sudo,

Re: [Labs-l] Last chance to reboot precise installs in Labs

2015-04-22 Thread Marc A. Pelletier
On 15-04-22 01:41 PM, Kevin Payravi wrote: I run traffic-grapher, and do see the specified script - do I need to run it, and if so, how can I? You have nothing to do. To make things clear, that requirement to reboot instances only applies to labs project /instances/ (that is, whole virtual

Re: [Labs-l] Node.js updates

2015-04-16 Thread Marc A. Pelletier
On 15-04-16 10:52 AM, Ricordisamoa wrote: but it'd be nice to experiment with the latest technologies... I'd suggest spinning up a new instance for experiments of the sort; this ensures that nobody is depending on a specific version of a package while you play with a more recent one. :-) --

[Labs-l] Rebooting Precise instances

2015-04-15 Thread Marc A. Pelletier
Hello maintainers, If your project currently has instances still running Ubuntu Precise, they will need to be rebooted at some point early next week to finish applying https://phabricator.wikimedia.org/T9 While we will plan a general reboot of all instances still using idmap in the short

[Labs-l] [Project maintainers] Ownership of files on Labs's NFS server

2015-04-10 Thread Marc A. Pelletier
Hello project maintainers, Some of the files that are currently stored on NFS by several projects are currently owned by users whose numerical IDs are not stable between instances, and may cause issues now or in the future term as a consequence of work on:

Re: [Labs-l] [Project maintainers] Ownership of files on Labs's NFS server

2015-04-10 Thread Marc A. Pelletier
On 15-04-10 03:56 PM, Marc A. Pelletier wrote: Please examine the list in the ticket above to see if your project is one of the affected ones and for a list of directories containing possibly problematic files. To clarify, the list is actually on a subtask: https://phabricator.wikimedia.org

[Labs-l] Apr 02 Incident report

2015-04-02 Thread Marc A. Pelletier
Hello, Here is the incident report for the Apr 2, 2015 Labs outage: https://wikitech.wikimedia.org/wiki/Incident_documentation/20150401-LabsNFS-Overload tl;dr: hardware issue required cold boot to recover. -- Marc ___ Labs-l mailing list

Re: [Labs-l] Things appear to be down again.

2015-04-01 Thread Marc A. Pelletier
On 15-04-02 12:54 AM, Bryan White wrote: I'm unable to ssh in. Tried various tool's websites and they are all down. I got a bunch of emails from bigbrother about an hour ago saying it was restarting the webservice. There was a hardware issue currently being recovered from. More news later,

Re: [Labs-l] Things appear to be down again.

2015-04-01 Thread Marc A. Pelletier
On 15-04-02 12:56 AM, Marc A. Pelletier wrote: There was a hardware issue currently being recovered from. More news later, once things have indeed recovered. This are recovering / have recovered now after a power cycle of the affected hardware. More details will come in the incident report

Re: [Labs-l] Service restored

2015-03-31 Thread Marc A. Pelletier
On 15-03-31 11:12 AM, Nuria Ruiz wrote: Yes, with yesterday's update of puppet this was working fine (same code) before the switchover. I know _joe_ has been working on ganglia in puppet yesterday; when I get a moment I'll try to see if there is a changeset that was merged that could cause this

Re: [Labs-l] Service restored

2015-03-31 Thread Marc A. Pelletier
On 15-03-31 03:28 PM, Nuria Ruiz wrote: Thank You! We will keep an eye on the list. I see an addition by him of the following test: if (hiera('ganglia_class') == 'new') in a changeset[1] which matches the error message you are getting exactly (and adds the requirement to add a ganglia_class

Re: [Labs-l] [Maintenance] RESCHEDULED - Labs NFS storage

2015-03-30 Thread Marc A. Pelletier
On 15-03-24 03:29 PM, Marc A. Pelletier wrote: TL;DR: NFS will be slow for a few days then briefly unavailable on March 26, 2015 at 22:00 UTC (less than five minutes). The copy is finally complete, and switchover between the filesystems will take place today, March 30, at 22:00 UTC

Re: [Labs-l] Simultaneous job limits?

2015-03-27 Thread Marc A. Pelletier
On 15-03-27 05:26 PM, Anthony Di Franco wrote: Are we hitting some policy limit that can be increased, or is it load-dependent? You are. It's possible to increase the limit but that doesn't scale well (as it requires adding exceptions in the rules, which is not well exposed and difficult to

Re: [Labs-l] [Maintenance] DELAYED - Labs NFS storage

2015-03-26 Thread Marc A. Pelletier
On 15-03-24 04:06 PM, Marc-André Pelletier wrote: TL;DR: NFS will be slow for a few days then briefly unavailable on March 26, 2015 at 22:00 UTC (less than five minutes). Due to the copy taking longer than expected, this is delayed for at least 24h (the copy is taking longer than expected mostly

[Labs-l] [Maintenance] Labs NFS storage

2015-03-24 Thread Marc A. Pelletier
TL;DR: NFS will be slow for a few days then briefly unavailable on March 26, 2015 at 22:00 UTC (less than five minutes). Tracked at: https://phabricator.wikimedia.org/T93792 == The good news == backups are coming (back) to the Labs storage, with snapshots into the past. In addition, we will

Re: [Labs-l] Questions regarding the Labs Terms of use

2015-03-13 Thread Marc A. Pelletier
On 15-03-13 02:30 PM, Rainer Rillke wrote: What to do if file permissions are set in a way one can't see the code... or, as suggested it is compiled and/or obfuscated. Would that be against Lab's spirit of promoting openSource? It is, and isn't. Part of the issue is that we do not mandate

Re: [Labs-l] DB errors: tools.shuaib-bot

2015-02-27 Thread Marc A. Pelletier
On 15-02-27 05:45 AM, Ricordisamoa wrote: What does the username does not exist mean? Database users are not shell account usernames. The tool is mistakenly using its own username rather than the one provided for it in ~/replica.my.cnf -- Marc ___

Re: [Labs-l] Partial outage in progress -- update

2015-02-17 Thread Marc A. Pelletier
On 15-02-17 01:49 PM, Andrew Bogott wrote: All affected instances will start back up one by one over the next hour or so. In particular, the tool labs web proxy is back up, and web services that were running on unaffected nodes should be back up. That said, until the evacuation of instances

Re: [Labs-l] File server problems?

2015-02-04 Thread Marc A. Pelletier
On 15-01-26 04:11 AM, Daniel Naber wrote: Has this been tested with Tomcat? There are currently no tomcat nodes running Trusty; what you see with qstat is the gridengine patiently waiting for one to become available. :-) -- Marc ___ Labs-l

Re: [Labs-l] ToolLabs back up

2015-02-04 Thread Marc A. Pelletier
On 15-02-04 03:13 PM, Gerard Meijssen wrote: part of the problem is that staffing does not consider 24*7*7 support I'd love to offer round-the-clock support, Gerard, but we simply do not have the staff to do this. Between the three of us, we cover about 12-14 hours 5 days a week (given

[Labs-l] Toolserver redirects

2014-12-31 Thread Marc A. Pelletier
Hey labs, Currently, the largest number of requests hitting toolserver.org (and also the single biggest source of 404s) is at URIs starting with /tiles/ for requests in the form of: /tiles/hikebike/8/205/135.png Those were not supplied to me in the list of user redirects (I expect because

Re: [Labs-l] Filesystem downtime to schedule

2014-12-31 Thread Marc A. Pelletier
On 14-12-31 03:45 PM, Chris McMahon wrote: , and Mon/Tue/Wed are the busiest times for beta labs in the weekly deploy schedule Yep. That's exactly the kind of feedback I need to find the point of least disruption where to schedule that maintenance window. :-) -- Marc

Re: [Labs-l] Memory limits for tools

2014-12-29 Thread Marc A. Pelletier
On 14-12-28 06:18 AM, Merlijn van Deen wrote: I'd suggest to just increase the memory limit for your grid jobs; [...] The Grid will take care of making sure the servers don't get overwhelmed. That's true in general, but I should request that you take some care setting the limit as low as you

Re: [Labs-l] [Toolserver-l] Redirects for toolserver.org moved to WMF

2014-12-22 Thread Marc A. Pelletier
On 14-12-21 06:44 AM, Maarten Dammers wrote: These redirects worked a couple of weeks ago. Can you please have a look why these redirects stopped working? Also, where in git is the redirect configuration? It is not in git for privacy reasons (the configuration also has email redirection,

[Labs-l] Accidental reboot of tools-login

2014-12-19 Thread Marc A. Pelletier
Hello all, Please accept my apologies for the interruption of work you may have been doing on tools-login; I accidentally issued a reboot command /there/ instead of on the jessie test instance I was working on. It came back up nearly immediately, of course, but any work in progress will

Re: [Labs-l] HHVM on ToolLabs?

2014-12-17 Thread Marc A. Pelletier
On 14-12-17 08:09 AM, Yuvi Panda wrote: I'm wondering if any bot/tool authors would be interested in helping me experiment with HHVM on toollabs. The admin tool (the one that serves the landing page and status page) might be a decent candidate; it's fairly simple and well-contained, but has

Re: [Labs-l] Redirects for toolserver.org moved to WMF

2014-12-11 Thread Marc A. Pelletier
On 14-12-11 04:39 AM, Silke Meyer wrote: Coren, am I right in assuming that people contact you directly if there are any changes needed? Yes, that is correct. -- Marc ___ Labs-l mailing list Labs-l@lists.wikimedia.org

Re: [Labs-l] Question: how do we upgrade our labs nodes to trusty?

2014-12-04 Thread Marc A. Pelletier
On 14-12-04 11:13 AM, Nuria Ruiz wrote: Hello, How do we upgrade our labs nodes to trusty? Can we upgrade the node we currently have or do we need to spawn a new node? It's unlikely that an upgrade would succeed; I recommend you simply spin up a new node which is likely to be both faster and

Re: [Labs-l] Using data from a tool in a Wikimedia site

2014-11-28 Thread Marc A. Pelletier
On 11/28/2014 06:30 AM, Darkdadaah wrote: but it would be much better I think for users to have it directly integrated in a page in the fr.wiktionary site (e.g. [[Wiktionnaire:Recherche avancée]]). This would only require the use of a dedicated Gadget on the site. Part of the reason why the

Re: [Labs-l] Run your web tools on Trusty

2014-11-27 Thread Marc A. Pelletier
On 11/27/2014 06:04 PM, Magnus Manske wrote: Yes, except the webservice I started there is unkillable (for me), You can use the -f option to qdel to forcibly remove a job you know is dead (the instance being dead is a pretty good way to be sure). At any rate, I just did so for you now. -- Marc

Re: [Labs-l] Tool labs replicas are missing the indexes?

2014-11-11 Thread Marc A. Pelletier
On 11/11/2014 02:49 PM, Giovanni Luca Ciampaglia wrote: It's VERY unfortunate that explain does not work -- how am I supposed to debug my queries then? No explain privilege = more unoptimized queries = more queries will be killed = more users will be unhappy = less people will use the LabsDB.

Re: [Labs-l] Database Indexing Doesn't seem to be working

2014-11-09 Thread Marc A. Pelletier
On 11/09/2014 02:00 PM, John wrote: revision user index is indexed on user name not user ID It should be on both. That may be an actual issue; could you open a bz to make sure our DBA sees it? -- Marc ___ Labs-l mailing list

Re: [Labs-l] Outage of labs in progress (resolved)

2014-11-06 Thread Marc A. Pelletier
On 11/06/2014 02:38 PM, Andrew Bogott wrote: Coren will follow up shortly with a full description of the problem and advice about what (if anything) you may need to do to resurrect your jobs. Hello again. NFS is back online. The short story is that around 16:25, the NFS server went down

Re: [Labs-l] Outage of labs in progress (resolved)

2014-11-06 Thread Marc A. Pelletier
On 11/06/2014 05:13 PM, Pine W wrote: It will be interesting to see the post-action report and recommendations for prevention, if possible. There is, in the end, very little that can be done to prevent freak failures of the sort; they are thankfully rare but basically impossible to predict.

Re: [Labs-l] Unable to delete jobs

2014-11-03 Thread Marc A. Pelletier
On 11/03/2014 05:07 AM, Daniel Naber wrote: Is this supposed to work for webgrid jobs like Tomcat, too? Supposed to is the keyword here; I had checked that the webgrid queue was switched but forgot the tomcat queue. -- Marc ___ Labs-l mailing list

Re: [Labs-l] speedydeletion.wikia.com

2014-11-03 Thread Marc A. Pelletier
On 11/03/2014 07:15 AM, Hasteur Wikipedia wrote: Second, since this is some sort of bot process, has it ever passed through BRFA as it appears to be using a non trivial amount of resources. There are two points here: (a) bots that do not /edit/ enwiki have never required BRFA approval, so

Re: [Labs-l] update on labsdb replica sync issues

2014-11-03 Thread Marc A. Pelletier
On 11/03/2014 05:27 PM, Sean Pringle wrote: 1. We need some memory and time limits for user queries. Memory usage is easy to track server-side on a per-client basis, but users may find it difficult to predict or understand why specific queries trip some arbitrary memory limit. So, just time

Re: [Labs-l] Unable to delete jobs

2014-11-02 Thread Marc A. Pelletier
On 11/01/2014 05:28 PM, Rohit Dua wrote: From the past few days, I am unable to delete my tool jobs on the grid using jstop or qdel -f via ssh The jobs tend to go into dr mode(and keep running). This has stopped my tool from working.(as I need it to stop/restart it with control) There was an

Re: [Labs-l] [Ops] Slowing down puppet runs?

2014-11-01 Thread Marc A. Pelletier
On 10/31/2014 01:04 PM, Andrew Bogott wrote: delays in the client run shouldn't really make a difference to anyone who isn't actively debugging puppet and running things over and over by hand No, but a quick look at every graph of every instance shows significant spikes in resource usage every

Re: [Labs-l] Google bot

2014-10-27 Thread Marc A. Pelletier
On 10/25/2014 07:37 PM, Nuria wrote: Much agree with these recommendations. Personally, I have no beef with it either - but filtering at the proxy level means this necessarily happens to every tool with no opportunity to do it differently per-tool so we probably don't want to be overly sensitive

[Labs-l] Instances with Salt errors

2014-10-27 Thread Marc A. Pelletier
[ posted on behalf of Ariel ] Hello folks, As I was updating salt across labs instances I came across a number of instances that had various errors. Here's a little list, hopefully folks can decide what to do about them. The ones that say 'ERROR' have nova/openstack errors as opposed to a

Re: [Labs-l] Enwiki database corruption

2014-10-23 Thread Marc A. Pelletier
On 10/22/2014 09:25 PM, Nuria Ruiz wrote: I imagine that people are pretty swamped but ...is anyone working on the issue with lack of data in labs? Yes. Sean, our DBA, is currently hard at work investigating the cause and possible solutions. I expect he'll emerge with news in short order. --

Re: [Labs-l] job won't stop; state dr

2014-10-22 Thread Marc A. Pelletier
On 10/22/2014 06:12 AM, Amir Ladsgroup wrote: For me doesn't matter the job is once or continuous. it won't stop with qdel and it become annoying since sometimes I'm killing a malfunctioned task and it counties and I can do nothing. It would seem that the new gentler way of killing jobs to be

Re: [Labs-l] Many DB connections - ideas?

2014-10-20 Thread Marc A. Pelletier
On 10/20/2014 11:40 AM, Magnus Manske wrote: Anyone have ideas about how to make this faster/more scalable? As others have pointed out in the thread, opening a connection to any of the slices in fact now gets you all databases (and that will remain true for the forseeable future). If you want

Re: [Labs-l] Google bot

2014-10-19 Thread Marc A. Pelletier
On 10/19/2014 03:50 PM, Magnus Manske wrote: I vaguely remember that indexing bots (like the Google one) were filtered out by Labs already? They were, for some time, but then I got some fairly vehement protestations that tools being unindexed by Google was a problem. -- Marc

Re: [Labs-l] Queue down

2014-10-11 Thread Marc A. Pelletier
On 10/11/2014 12:13 AM, Bryan White wrote: The queue seems to be dead. No cron jobs have started for at least 4 hours. Anything I try, I receive: There were two corrupt entires in the job database, one of which outright /killed/ the gridengine master. I was able to purge both entries, and

[Labs-l] Slight change on how jobs are ended

2014-10-11 Thread Marc A. Pelletier
Hello all, In order to fix a few problems with the way jobs are ended, I have changed the gridengine settings on how jobs are terminated: tl;dr: If you don't know what a signal handler is or never use them, you probably can ignore this email entirely and nothing will visibly change for you.

Re: [Labs-l] role::mail::(sender|mx) conflict

2014-10-07 Thread Marc A. Pelletier
On 10/07/2014 01:07 PM, Jeff Green wrote: To me the most logical approach would be to remove role::mail::sender from role::labs::instance The problem with that is that any labs instance that is not a role::mail::mx *must* be a role::mail::sender to avoid a number of issues with cronspam with

Re: [Labs-l] Resolved: Labs and toollabs outage in progress

2014-10-07 Thread Marc A. Pelletier
On 10/07/2014 07:53 PM, Andrew Bogott wrote: As for which jobs died -- that's a question for someone with better grid skills than me :) Since tools-master was not affected, continuous jobs will have been requeued and will reschedule once the actual nodes return to full health (which should be

[Labs-l] A tale of three databases

2014-09-23 Thread Marc A. Pelletier
[Or; an outage report in three acts] So, what happened over the last couple of days that have caused so many small issues with the replica databases? In order to make that clear, I'll explain a bit how the replicas are structured. At the dawn of time, the production replicas were set up as a

Re: [Labs-l] [COMPLETED] Database maintenance 2014-09-19 13:30 UTC

2014-09-22 Thread Marc A. Pelletier
On 09/22/2014 09:58 AM, Yuvi Panda wrote: +1, Quarry has the exact same issue, even with new iptables conf. Fixing this is my priority this morning. Moar news coming in soon. -- Marc ___ Labs-l mailing list Labs-l@lists.wikimedia.org

Re: [Labs-l] [COMPLETED] Database maintenance 2014-09-19 13:30 UTC

2014-09-22 Thread Marc A. Pelletier
On 09/22/2014 09:59 AM, Marc A. Pelletier wrote: Fixing this is my priority this morning. Moar news coming in soon. This should now been fixed (thank you Sean!) and all databases are now merged into the single c3 database so that pointing to what was s3, s6, and s7 should lead back to the same

Re: [Labs-l] [COMPLETED] Database maintenance 2014-09-19 13:30 UTC

2014-09-22 Thread Marc A. Pelletier
On 09/22/2014 10:39 PM, Brad Jorsch (Anomie) wrote: tools.anomiebot@tools-login:~$ sql fawiki ERROR 1045 (28000): Access denied for user 's51055'@'localhost' (using password: YES) Make sure to ask for a db in format of wiki_p You caught the database between copy done and grants done. It

Re: [Labs-l] [COMPLETED] Database maintenance 2014-09-19 13:30 UTC

2014-09-20 Thread Marc A. Pelletier
On 09/20/2014 09:03 AM, Yuvi Panda wrote: This seems to have caused problems for non-tools projects using the database slaves. Oh. Sorry, I should have noted that other projects need an updated iptables.conf and hosts file! -- Marc ___ Labs-l

Re: [Labs-l] [COMPLETED] Database maintenance 2014-09-19 13:30 UTC

2014-09-19 Thread Marc A. Pelletier
On 09/18/2014 11:41 AM, Marc A. Pelletier wrote: We have to move one of the replica databases physically between two racks tomorrow, and intervention that should take around an hour. This maintenance is now completed, with the database snuggly housed in its new rack and back in operation

[Labs-l] Database maintenance 2014-09-19 13:30 UTC

2014-09-18 Thread Marc A. Pelletier
Hello all, We have to move one of the replica databases physically between two racks tomorrow, and intervention that should take around an hour. During that period, one of the replica databases (that which held the s1 replica (enwiki)) will be unavailable. The other two replicas are unaffected,

Re: [Labs-l] A proposal for better tool discoverability

2014-09-15 Thread Marc-André Pelletier
On 08/14/2014 10:06 AM, Marc-André Pelletier wrote: On 08/14/2014 09:23 AM, Hay (Husky) wrote: And obviously, it would be awesome if the list would be so good that we could replace tools.wmflabs.org. I don't know about /replacing/ it, but if we used that new metadata to improve

Re: [Labs-l] IMPORTANT: mothballed instances, marked for death!

2014-09-12 Thread Marc-André Pelletier
On 09/12/2014 07:06 PM, Maciej Jaros wrote: How do you define touched? Modified? Maybe the project is simply working fine and doesn't need modifications. Touched, in this context, means even just /starting/ the instance. The mothballed instances have been entirely off since the migration and

Re: [Labs-l] Tool Labs slow + MySQL connection errors

2014-09-02 Thread Marc-André Pelletier
On 09/02/2014 02:09 PM, Bryan White wrote: A few weeks ago, the database problems were only related to tools-webgrid-04. I just stopped and started the webserver to have it run on a different webgrid and things started running smoothly again. Not sure which webgrid is which, but when doing

Re: [Labs-l] Current lab problems

2014-08-23 Thread Marc-André Pelletier
On 08/23/2014 03:42 AM, Bryan White wrote: All told, dumps has been unusable for me for 6 of the past 12 months. Can some data be deleted or moved from /public/dumps so it becomes usable again? I've seen no updates. What is the status of when things will be fixed? The short answer: it took

Re: [Labs-l] Look-and-listen-map on Tool-labs

2014-08-20 Thread Marc-André Pelletier
On 08/20/2014 05:03 AM, Peter Wendorff wrote: - the Lalm is not a tool for wikipedia itself, it's an experimental portal for blind and visually impaired to navigate on the Openstreetmap data. Labs is made available for work for the Wikimedia projects, but also for allied and related projects.

Re: [Labs-l] A proposal for better tool discoverability

2014-08-15 Thread Marc-André Pelletier
On 08/15/2014 06:55 AM, Hay (Husky) wrote: Marc-André wrote: I would actually prefer that the json file live in the tool's home, and will shortly provide a means by which this can be fetched by HTTP. Not all tools provide web interfaces, and metadata about those would be just as useful.

Re: [Labs-l] A proposal for better tool discoverability

2014-08-14 Thread Marc-André Pelletier
On 08/14/2014 09:23 AM, Hay (Husky) wrote: And obviously, it would be awesome if the list would be so good that we could replace tools.wmflabs.org. I don't know about /replacing/ it, but if we used that new metadata to improve it that would be +good. Right now, it uses a number of ugly

Re: [Labs-l] A proposal for better tool discoverability

2014-08-14 Thread Marc-André Pelletier
On 08/13/2014 11:15 AM, Hay (Husky) wrote: If you have a web-hosted tool, simply stick it in the root of your tools directory so that it's reachable by the crawler. Whenever your tool data changes, just update the file and the directory will automatically update the directory site. I would

Re: [Labs-l] Lighttpd insight I. - proposal for change of default config

2014-07-22 Thread Marc-André Pelletier
On 07/22/2014 05:05 PM, Hedonil wrote: Also a proposal to change the default lighttpd settings. Most excellent work Hedonil; your stellar work to fine tune performance is going to be appreciated by everyone. :-) -- Marc ___ Labs-l mailing list

Re: [Labs-l] Bigbrother is watching...

2014-07-16 Thread Marc-André Pelletier
On 07/16/2014 04:09 PM, Brad Jorsch (Anomie) wrote: I guess I'll finally have to learn the difference between 'jstart' and 'qsub' syntax. tl;dr: there are relatively few differences except that j{sub|start} provides a handful of sane defaults to what is ultimately just a qsub invokation, and

[Labs-l] On disk use

2014-07-11 Thread Marc A. Pelletier
Hey all. So, a quick reminder to every labs user: project space (/data/project) is on a networked drive. While it provides a lot of space and is conveniently accesible to all instances of a project, using it /does/ incur a performance cost. Whenever a service you are running on a labs instance

Re: [Labs-l] Webservice

2014-07-11 Thread Marc-André Pelletier
On 07/11/2014 02:30 PM, Petr Bena wrote: nope, just once in minute, crontab doesn't handle seconds, it wouldn't fire up anything but the check if it's running I'm currently looking at some system by which tool maintainers may specify automatic restart scripts that are sufficiently robust for

[Labs-l] 2014-15 Wikimedia goals for Labs

2014-06-10 Thread Marc A. Pelletier
Hello everyone, WMF Engineering has just published its draft annual goals for engineering on mediawiki.org[1]; and it includes a nifty section for planned objectives for Labs in the coming year. Please leave any comments/questions on the talk page; are there glaring omissions or mistakes?

[Labs-l] Puppet master upgrade

2014-06-05 Thread Marc A. Pelletier
Hello all, We have (just now) upgraded the Lab's puppet masters to version 3; and things seem to be going well. In the past several months, members of the operations team and volunteers[1] have done a _lot_ of work to ensure puppet manifests would be compatible with puppet 3 and compare the

Re: [Labs-l] Puppet master upgrade

2014-06-05 Thread Marc-André Pelletier
Lost footnote: On 06/05/2014 03:39 PM, Marc A. Pelletier wrote: In the past several months, members of the operations team and volunteers[1] [1] With especial thanks to Matanya, who has been doing incredible amount of gnomish work to fix 2.7 dependencies in the past months. -- Marc

Re: [Labs-l] NFS server network capacity upgrade

2014-05-30 Thread Marc-André Pelletier
On 05/23/2014 12:28 PM, Marc A. Pelletier wrote: While this is not set in stone, I am aiming for Friday, May 30 at 18:00 UTC for the downtime. As a reminder, this is still scheduled for today, approximately one hour from now. Filesystem access will be disrupted for a period of 10-20 minutes

Re: [Labs-l] Unable to get

2014-05-30 Thread Marc-André Pelletier
On 05/30/2014 02:28 PM, James Alexander wrote: I have no idea how it sent... [I randomly heard a mail sent noise from my phone putting it into my pocket...] The interesting question isn't so much how did random presses on the phone manage to send /an/ email, but how it managed to do so with a

Re: [Labs-l] Full Text Reference Tool: Approved exposing of ip addresses to an external API

2014-05-29 Thread Marc-André Pelletier
On 05/29/2014 09:59 AM, Jake Orlowitz wrote: Are there any approved exceptions where user ip addresses could be exposed on Tool Labs (say, if WMF said it was ok)? Would this be technically possible? Ostensibly, yes -- that is there is no *prohibition* from doing so but the current setup makes

  1   2   3   >