Re: [Cloud] [Cloud-announce] Wiki Replicas 2020 Redesign

2020-12-10 Thread Andrew Otto
FYI, This isn't for Cloud Services, but we've got something sorta similar for internal analytics replicas. https://github.com/wikimedia/analytics-refinery/blob/master/bin/analytics-mysql https://github.com/wikimedia/analytics-refinery/blob/master/python/refinery/util.py#L135-L254 On Tue, Dec 8,

Re: [Cloud] [Cloud-announce] Wiki Replicas 2020 Redesign

2020-12-08 Thread MusikAnimal
Hello again! Thinking about this more, I'm wondering if it makes sense to have a tool to assist with parsing the dblists at noc.wikimedia.org. I know the official recommendation is to not to connect to slices, but the issue is how to work locally. I alone maintain many tools that are capable of

Re: [Cloud] [Cloud-announce] Wiki Replicas 2020 Redesign

2020-11-23 Thread Nicholas Skaggs
Amir, in case you hadn't seen it, your memory is correct. This was considered in the past. See https://phabricator.wikimedia.org/T215858#6631859. On Tue, Nov 17, 2020 at 2:47 PM Amir Sarabadani wrote: > Hello, > Actually Jaime's email gave me an idea. Why not having a separate actual > data

Re: [Cloud] [Cloud-announce] Wiki Replicas 2020 Redesign

2020-11-19 Thread Joaquin Oltra Hernandez
Hey MA, I personally think given your knowledge and experience and what GUC and XTools Global Contribs do, your approach of using those implementation details to get better performance makes sense. The outline you present is very clear and seems reasonable to me. You also mention programmatically

Re: [Cloud] [Cloud-announce] Wiki Replicas 2020 Redesign

2020-11-19 Thread David Caro
On a bit of a side note, for forwarding many ips/ports through ssh might be interesting a tool like sshuttle [1], I'v used it in the past with success. It's a bit complex, it uses an ssh tunnel + iptables/pf/... rules to move the traffic through that tunnel, and a process running on the other

Re: [Cloud] [Cloud-announce] Wiki Replicas 2020 Redesign

2020-11-18 Thread MusikAnimal
Unrelated but important question (sorry to fragment this thread): What about the max number of connections imposed on a db user? That applies to each open connection, right? For example, we still occasionally hit it in XTools, meaning there are 30 open connections and no new connections can be

Re: [Cloud] [Cloud-announce] Wiki Replicas 2020 Redesign

2020-11-18 Thread MusikAnimal
Hello Joaquin! Hey MA, I've checked, and while not explicitly disallowed, the fact that > this could work is more of an implementation detail that shouldn't really > be relied on. > > The sections and where the instances are on them are organized to maintain > the service, and are not supposed to

Re: [Cloud] [Cloud-announce] Wiki Replicas 2020 Redesign

2020-11-17 Thread Brooke Storm
ACN: Thanks! We’ve created a ticket for that one to help collaborate and surface the process here: https://phabricator.wikimedia.org/T267992 Anybody working on that, please add info there. Brooke Storm Staff SRE Wikimedia Cloud Services

Re: [Cloud] [Cloud-announce] Wiki Replicas 2020 Redesign

2020-11-17 Thread AntiCompositeNumber
I took a look at converting the query used for GreenC Bot's Job 10, which tracks enwiki files that "shadow" a different file on Commons. It is currently run daily, and the query executes in about 60-90 seconds. I tried three methods to recreate that query without a SQL cross-database join. The

Re: [Cloud] [Cloud-announce] Wiki Replicas 2020 Redesign

2020-11-17 Thread Jaime Crespo
So I think there is something here. Different people have different needs, so far the number one need for wikireplicas is for those that needed underlying access to an "almost real time" copy of the internal database structure, as is. This is based on the fact that latency is the most common

Re: [Cloud] [Cloud-announce] Wiki Replicas 2020 Redesign

2020-11-17 Thread Maarten Dammers
Hi Joaquin, On 16-11-2020 21:42, Joaquin Oltra Hernandez wrote: Hi Maarten, I believe this work started many years ago, and it was paused, and recently restarted because of the stability and performance problems in the last years. You do realize the current setup was announced as new 3 years

Re: [Cloud] [Cloud-announce] Wiki Replicas 2020 Redesign

2020-11-17 Thread Kimmo Virtanen
> ... As far as I could see in the docs, connection reusing and cross-DB joins are not documented or advertised ... Not sure what you are talking about. The cross-DB joins were key features of ToolServer[1] which Wikimedia Labs replaced in 2012-2014 and those were just not as initial features of

Re: [Cloud] [Cloud-announce] Wiki Replicas 2020 Redesign

2020-11-16 Thread Nicholas Skaggs
Kimmo, while I can't directly answer your question on bottlenecks, I will try and provide a little background information on existing issues for those who are new (like myself!). Here's a recent example of replication issues with the current setup:

Re: [Cloud] [Cloud-announce] Wiki Replicas 2020 Redesign

2020-11-16 Thread Huji Lee
Yes, I have a Toolforge account and there are a bunch of cronjobs that run weekly (and a few that run daily). The code can be found at https://github.com/PersianWikipedia/fawikibot/tree/master/HujiBot where stats.py is the program that actually connects to the DB, but weekly.py and weekly-slow.py

Re: [Cloud] [Cloud-announce] Wiki Replicas 2020 Redesign

2020-11-16 Thread Joaquin Oltra Hernandez
Hey MA, I've checked, and while not explicitly disallowed, the fact that this could work is more of an implementation detail that shouldn't really be relied on. The sections and where the instances are on them are organized to maintain the service, and are not supposed to be depended on since

Re: [Cloud] [Cloud-announce] Wiki Replicas 2020 Redesign

2020-11-16 Thread Joaquin Oltra Hernandez
Hi Maarten, I believe this work started many years ago, and it was paused, and recently restarted because of the stability and performance problems in the last years. Breaking changes are always painful, in this case of the replicas I think the changes follow the recommendations laid out years

Re: [Cloud] [Cloud-announce] Wiki Replicas 2020 Redesign

2020-11-16 Thread Joaquin Oltra Hernandez
Moving the joins to the application layer definitely makes things quite complex compared to an SQL query. Having a data lake or other solutions like you mention makes it more feasible to do these kinds of joins with big data, but it also usually requires careful schema and index design when

Re: [Cloud] [Cloud-announce] Wiki Replicas 2020 Redesign

2020-11-14 Thread Huji Lee
I like the idea of dumps as an alternative too. But I think this should be a service that is offered via the WM Clouds. Some might remember me asking related questions on this very mailing list several months ago. Having a DB called "latest_dump" which actually has the latest dump of all wikis

Re: [Cloud] [Cloud-announce] Wiki Replicas 2020 Redesign

2020-11-14 Thread Amir Sarabadani
Hello, I actually welcome the change and am quite happy about it. It might break several tools (including some of mine) but as a database nerd, I can see the benefits outweighing the problems (and I wish benefits would have been communicated in the announcement). The short version is that this

Re: [Cloud] [Cloud-announce] Wiki Replicas 2020 Redesign

2020-11-13 Thread Kimmo Virtanen
As a follow up comment. If I understand correctly the main problems are a) databases are growing too big to be stored in single instances and b) query complexity is growing. a) the growth of the data is not going away as the major drivers for the growth are automated edits from Wikidata and

Re: [Cloud] [Cloud-announce] Wiki Replicas 2020 Redesign

2020-11-12 Thread Kimmo Virtanen
> Maarten: Having 6 servers with each one having a slice + s4 (Commons) + s8 (Wikidata) might be a good compromise. > Martin: Another idea is to have the database structured as-planned, but add a server with *all* databases that would be slower/less stable, but will provide a solution for those

Re: [Cloud] [Cloud-announce] Wiki Replicas 2020 Redesign

2020-11-12 Thread Maarten Dammers
I recall some point in time (Toolserver maybe?) when all the slices (overview at https://tools-info.toolforge.org/?listmetap ) were at different servers, but the Commons slice (s4) was on every server. At some point new fancy database servers were introduced with all the slices on all servers.

Re: [Cloud] [Cloud-announce] Wiki Replicas 2020 Redesign

2020-11-11 Thread John
I’ll throw my hat in this too. Moving it to the application layer will make a number of queries just not feasible any longer. It might make sense from the administration side, but from the user perspective it beaks one of the biggest features that toolforge has. On Wed, Nov 11, 2020 at 6:40 PM

Re: [Cloud] [Cloud-announce] Wiki Replicas 2020 Redesign

2020-11-11 Thread Martin Urbanec
MusikAnimal is right, however, Wikidata and Commons either have a sui generis slice, or they share it with a few very large wikis. Tools that do any kind of crosswiki analysis would instantly break, as most of them utilise joining by Wikidata items at the very least. I second Maarten here. This

Re: [Cloud] [Cloud-announce] Wiki Replicas 2020 Redesign

2020-11-11 Thread Huji Lee
One some level, the real issue here is that different wikis are living on different slices (s1, s2, s3). One possible solution is to replicate "shared" wikis (Wikidata and Commons) and possibly a few other "mother" wikis (at least En WP) into *every* slice. The uses cases that need to join

Re: [Cloud] [Cloud-announce] Wiki Replicas 2020 Redesign

2020-11-11 Thread MusikAnimal
Technically, cross-wiki joins aren't completely disallowed, you just have to make sure each of the db names are on the same slice/section, right? ~ MA On Wed, Nov 11, 2020 at 4:11 PM Maarten Dammers wrote: > Hi Joaquin, > On 10-11-2020 21:26, Joaquin Oltra Hernandez wrote: > > TLDR: Wiki

Re: [Cloud] [Cloud-announce] Wiki Replicas 2020 Redesign

2020-11-11 Thread Maarten Dammers
Hi Joaquin, On 10-11-2020 21:26, Joaquin Oltra Hernandez wrote: TLDR: Wiki Replicas' architecture is being redesigned for stability and performance. Cross database JOINs will not be available and a host connection will only allow querying its associated DB. See [1]

Re: [Cloud] [Cloud-announce] Wiki Replicas 2020 Redesign

2020-11-10 Thread AntiCompositeNumber
Most cross-db JOINs can be recreated using two queries and an external tool to filter the results. However, there are some queries that would be simply impractical due to the large amount of data involved, and the query for overlapping local and Commons images is one of them. There are basically

Re: [Cloud] [Cloud-announce] Wiki Replicas 2020 Redesign

2020-11-10 Thread MusikAnimal
Got it. The https://noc.wikimedia.org/conf/dblists/ lists are plenty fast and easy enough to parse. I'll just cache that. It would be neat if we could rely on the slice specified in meta_p in the future, as in my case we have to query meta_p.wiki regardless, but not a big deal :) Thank you! I

Re: [Cloud] [Cloud-announce] Wiki Replicas 2020 Redesign

2020-11-10 Thread Gergo Tisza
On Tue, Nov 10, 2020 at 1:15 PM MusikAnimal wrote: > Hi! Most tools query just a single db at a time, so I don't think this > will be a massive problem. However some such as Global > Contribs[0] and GUC[1] can theoretically query all of them from a single > request. Creating new connections

Re: [Cloud] [Cloud-announce] Wiki Replicas 2020 Redesign

2020-11-10 Thread Brooke Storm
Hi MA, You could still accomplish the local environment you are describing by using 8 ssh tunnels. All the database name DNS aliases go reference the section names, eventually (s1, s2, s3, s4 in the form of s1.analytics.db.svc.eqiad.wmflabs, etc.). An app could be written to connect to the

Re: [Cloud] [Cloud-announce] Wiki Replicas 2020 Redesign

2020-11-10 Thread MusikAnimal
Hi! Most tools query just a single db at a time, so I don't think this will be a massive problem. However some such as Global Contribs[0] and GUC[1] can theoretically query all of them from a single request. Creating new connections on-the-fly seems doable in production, the issue is how to work

[Cloud] [Cloud-announce] Wiki Replicas 2020 Redesign

2020-11-10 Thread Joaquin Oltra Hernandez
TLDR: Wiki Replicas' architecture is being redesigned for stability and performance. Cross database JOINs will not be available and a host connection will only allow querying its associated DB. See [1] for more details. Hi!