Re: [Cloud] [Cloud-announce] Wiki Replicas 2020 Redesign

2020-11-17 Thread Kimmo Virtanen
> ... As far as I could see in the docs, connection reusing and cross-DB joins are not documented or advertised ... Not sure what you are talking about. The cross-DB joins were key features of ToolServer[1] which Wikimedia Labs replaced in 2012-2014 and those were just not as initial features of W

Re: [Cloud] [Cloud-announce] Wiki Replicas 2020 Redesign

2020-11-17 Thread Maarten Dammers
Hi Joaquin, On 16-11-2020 21:42, Joaquin Oltra Hernandez wrote: Hi Maarten, I believe this work started many years ago, and it was paused, and recently restarted because of the stability and performance problems in the last years. You do realize the current setup was announced as new 3 years

Re: [Cloud] [Cloud-announce] Wiki Replicas 2020 Redesign

2020-11-17 Thread Jaime Crespo
So I think there is something here. Different people have different needs, so far the number one need for wikireplicas is for those that needed underlying access to an "almost real time" copy of the internal database structure, as is. This is based on the fact that latency is the most common compla

Re: [Cloud] [Cloud-announce] Wiki Replicas 2020 Redesign

2020-11-17 Thread AntiCompositeNumber
I took a look at converting the query used for GreenC Bot's Job 10, which tracks enwiki files that "shadow" a different file on Commons. It is currently run daily, and the query executes in about 60-90 seconds. I tried three methods to recreate that query without a SQL cross-database join. The naiv

Re: [Cloud] [Cloud-announce] Wiki Replicas 2020 Redesign

2020-11-17 Thread Brooke Storm
ACN: Thanks! We’ve created a ticket for that one to help collaborate and surface the process here: https://phabricator.wikimedia.org/T267992 Anybody working on that, please add info there. Brooke Storm Staff SRE Wikimedia Cloud Services bst...@wikimedi

Re: [Cloud] [Cloud-announce] Wiki Replicas 2020 Redesign

2020-11-17 Thread Amir Sarabadani
Hello, Actually Jaime's email gave me an idea. Why not having a separate actual data lake? Like a hadoop cluster, it can even take the data from analytics cluster (after being sanitized of course). I remember there were some discussions about having a hadoop or Presto cluster in WM Cloud. Has this