Re: [Cloud] [Cloud-announce] Wiki Replicas 2020 Redesign

Joaquin Oltra Hernandez Mon, 16 Nov 2020 12:21:24 -0800

I have incorporated some of the suggestions and info from the threads to
the wiki page.


I would like to document specific code examples, specially of SQL cross
joins, and how migrating away from them would look like. If you have looked
at this and done changes to your queries and code it would be super helpful
if you can point me to the repo/code to show real examples in the
documentation.

If you can share real use cases of your use cases like Huji Lee and
MusikAnimal, it is also very useful to discuss and get help, thanks Brooke
and ACN.

Huji, how do you run the queries? Do you have a Toolforge project with code
where you query the DB? Do you use Quarry or maybe PAWS? The migration path
is different depending on your workflow and skills so that background
information helps provide suggestions. Like ACN mentioned, the answer
varies a lot depending on context and use case, we can try and help you do
the changes bit by bit. If you want to make a new thread with the specifics
of your code we should be able to help come to solutions.

On Wed, Nov 11, 2020 at 5:25 AM AntiCompositeNumber <
anticompositenum...@gmail.com> wrote:

> Most cross-db JOINs can be recreated using two queries and an external
> tool to filter the results. However, there are some queries that would
> be simply impractical due to the large amount of data involved, and
> the query for overlapping local and Commons images is one of them.
> There are basically two ways to recreate the query: re-implement the
> inner join or re-implement a semi-join subquery.
>
> Recreating a JOIN is conceptually very simple: get two lists and
> compare them. However, there are 67,034 files on fawiki, 891,286 files
> on enwiki, and 65,559,375 files on Commons. Simply joining by name
> would be impossible -- MariaDB would time out a few hundred times
> before returning all that data, and even if it did, storing those
> lists even as efficiently as possible would be quite the memory hog.
> So the query would have to be paginated. The only common identifier we
> have is the file name, and since the letters in the names aren't
> evenly distributed, paginating wouldn't exactly be fun.
> The other option is implementing the Commons lookup like a semi-join
> subquery. Iterate over the local data, paginating any way you want.
> Then, for every item, query the Commons database for that title. Of
> course, we're now making a million requests to the database, which
> isn't going to be very fast simply due to network delays. We could be
> a little nicer and group a bunch of titles together in the query,
> which will probably get us down from a million queries to fifty
> thousand or so. Of course, this all gets more complicated if you want
> a query more complex than SELECT enwiki_p.img_title FROM
> enwiki_p.image JOIN commonswiki_p.image ON enwiki_p.img_title =
> commonswiki_p.img_title;
>
> I understand the system engineering reasons for this change, but I
> think it's worth underscoring exactly how disruptive it will be for
> the queries that depended on this functionality. I'm certainly no
> expert, but I'm willing to help wrap queries in Python until they
> start working again.
>
> ACN
>
> On Tue, Nov 10, 2020 at 8:48 PM Huji Lee <huji.h...@gmail.com> wrote:
> >
> > Cross-wiki JOINS are used by some of the queries we run regularly for
> fawiki. One of those queries looks for articles that don't have an image in
> their infobox in fawiki, but do have one on enwiki, so that we can
> use/import that image. Another one JOINs fawiki data with commons data to
> look for redundant images. Yet another one, looks for articles that all use
> an image that doesn't exist (for cleanup purposes) but needs to join with
> commons db because the referenced file might exist there. Lastly, we have a
> report that looks for fair use images on fawiki that had the same name as
> an image on enwiki where the enwiki copy was deleted; this usually
> indicates in improper application of fair use, and enwiki -- due to its
> larger community -- finds and deletes these faster than we could on fawiki.
> >
> > There may be other cases I am unaware of. The point is, losing the
> cross-wiki JOIN capability can make some of the above tasks really
> difficult or completely impossible.
> >
> > On Tue, Nov 10, 2020 at 3:27 PM Joaquin Oltra Hernandez <
> jhernan...@wikimedia.org> wrote:
> >>
> >> TLDR: Wiki Replicas' architecture is being redesigned for stability and
> performance. Cross database JOINs will not be available and a host
> connection will only allow querying its associated DB. See [1] for more
> details.
> >>
> >> Hi!
> >>
> >> In the interest of making and keeping Wiki Replicas a stable and
> performant service, a new backend architecture is needed. There is some
> impact in the features and usage patterns.
> >>
> >> What should I do? To avoid breaking changes, you can start making the
> following changes *now*:
> >> - Update existing tools to ensure queries are executed against the
> proper database connection
> >>   - Eg: If you want to query the `eswiki_p` DB, you must connect to the
> `eswiki.analytics.db.svc.eqiad.wmflabs` host and `eswiki_p` DB, and not to
> enwiki or other hosts
> >> - Check your existing tools and services queries for cross database
> JOINs, rewrite the joins in application code
> >>   - Eg: If you are doing a join across databases, for example joining
> `enwiki_p` and `eswiki_p`, you will need to query them separately, and
> filter the results of the separate queries in the code
> >>
> >> Timeline:
> >> - November - December: Early adopter testing
> >> - January 2021: Existing and new systems online, transition period
> starts
> >> - February 2021: Old hardware is decommissioned
> >>
> >> We need your help
> >> - If you would like to beta test the new architecture, please let us
> know and we will reach out to you soon
> >> - Sharing examples / descriptions of how a tool or service was updated,
> writing a common solution or some example code others can utilize and
> reference, helping others on IRC and the mailing lists
> >>
> >> If you have questions or need help adapting your code or queries,
> please contact us [2], or write on the talk page [3].
> >>
> >> We will be sending reminders, and more specific examples of the changes
> via email and on the wiki page. For more information see [1].
> >>
> >> [1]:
> https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign
> >> [2]:
> https://wikitech.wikimedia.org/wiki/Help:Cloud_Services_communication
> >> [3]:
> https://wikitech.wikimedia.org/wiki/Talk:News/Wiki_Replicas_2020_Redesign
> >>
> >> --
> >> Joaquin Oltra Hernandez
> >> Developer Advocate - Wikimedia Foundation
> >> _______________________________________________
> >> Wikimedia Cloud Services announce mailing list
> >> cloud-annou...@lists.wikimedia.org (formerly
> labs-annou...@lists.wikimedia.org)
> >> https://lists.wikimedia.org/mailman/listinfo/cloud-announce
> >> _______________________________________________
> >> Wikimedia Cloud Services mailing list
> >> Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org)
> >> https://lists.wikimedia.org/mailman/listinfo/cloud
> >
> > _______________________________________________
> > Wikimedia Cloud Services mailing list
> > Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org)
> > https://lists.wikimedia.org/mailman/listinfo/cloud
>
> _______________________________________________
> Wikimedia Cloud Services mailing list
> Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org)
> https://lists.wikimedia.org/mailman/listinfo/cloud
>


-- 
Joaquin Oltra Hernandez
Developer Advocate - Wikimedia Foundation

_______________________________________________
Wikimedia Cloud Services mailing list
Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud

Re: [Cloud] [Cloud-announce] Wiki Replicas 2020 Redesign

Reply via email to