Hi Fastily, we are aware of the use case for matching commons pages/images/sha1s between commons/big wikis and other wikis, as it has come up many times. I'm cataloging all the comments and examples that have come up in the last 5 months in order to provide categorized input to the parent task <https://phabricator.wikimedia.org/T215858> so that the engineering teams can think of solutions. I'll share it publicly once it is in a presentable state.
We did some exploration a while ago (from Huji's examples), you can see some notebooks with python approaches here <https://phabricator.wikimedia.org/T267992#6637250>, but there is too much data and doing the same takes a very long time and can be impractical. If you want to give it a try have a look at the notebooks, I don't think the code is too memory intensive, specially in bd808s notebook using the API, and Raspberry Pis could maybe handle it. It is more complex and error-prone, for sure, so disabling those reports and waiting is sadly the option right now, until a suitable solution for this is found. So, to answer your question: Is there going to be a replacement for this functionality? I can't promise anything yet but I can assure you the teams involved in these systems are aware of the need for this functionality and will be looking into how to provide it to make these reports/bots/queries viable. We will send updates or new info to the cloud lists, and you can subscribe to these tasks if you want to follow more closely: - Plan a replacement for wiki replicas that is better suited to typical OLAP use cases than the MediaWiki OLTP schema <https://phabricator.wikimedia.org/T215858> - Provide mechanism to detect name clashed media between Commons and a Local project, without needing to join tables across wiki-db's <https://phabricator.wikimedia.org/T267992> - Provide a mechanism for detecting duplicate files in commons and a local wiki <https://phabricator.wikimedia.org/T268240> - Provide a mechanism for detecting duplicate files in enwiki and another wikipedia <https://phabricator.wikimedia.org/T268242> - Provide a mechanism for accessing the names of image files on Commons when querying another wiki <https://phabricator.wikimedia.org/T268244> On Wed, Mar 31, 2021 at 10:57 AM Fastily <fastil...@gmail.com> wrote: > A little late to the party, I just learned about this change today. > > I maintain a number of bot tasks > <https://en.wikipedia.org/wiki/User:FastilyBot> and database > <https://fastilybot-reports.toolforge.org> reports > <https://en.wikipedia.org/wiki/Wikipedia:Database_reports> on enwp that > rely on cross-wiki joins (mostly page title joins between enwp and Commons) > to function properly. I didn't find the migration instructions > <https://wikitech.wikimedia.org/w/index.php?title=News/Wiki_Replicas_2020_Redesign&oldid=1905818#How_do_I_cross_reference_data_between_wikis_like_I_do_with_cross_joins_today?> > very helpful; I run FastilyBot on a Raspberry Pi, and needless to say it > would be grossly impractical for me to perform a "join" in the bot's code. > > Is there going to be a replacement for this functionality? > > Fastily > > On Mon, Mar 15, 2021 at 3:09 PM Dan Andreescu <dandree...@wikimedia.org> > wrote: > >> [4] was made to figure out common use cases and possibilities to enable >>> them again. >>> >> ... >> >>> [4] https://phabricator.wikimedia.org/T215858 >>> >> >> I just want to highlight this ^ thing Joaquin said and mention that our >> team (Data Engineering) is also participating in brainstorming ways to >> bring back not just cross-wiki joins but better datasets to run these >> queries. We have some good ideas, so please do participate in the task and >> give us more input so we can pick the best solution quickly. >> _______________________________________________ >> Wikimedia Cloud Services mailing list >> Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org) >> https://lists.wikimedia.org/mailman/listinfo/cloud >> > _______________________________________________ > Wikimedia Cloud Services mailing list > Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org) > https://lists.wikimedia.org/mailman/listinfo/cloud > -- Joaquin Oltra Hernandez Developer Advocate - Wikimedia Foundation
_______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud