Is it feasible to do a log analysis of the database servers to find out what tools are (were?) using cross-wiki joins? At least that would ensure that all the tool owners could be contacted directly to make sure they know this is happening.
> On Mar 31, 2021, at 3:46 PM, Joaquin Oltra Hernandez > <jhernan...@wikimedia.org> wrote: > > Hi Fastily, we are aware of the use case for matching commons > pages/images/sha1s between commons/big wikis and other wikis, as it has come > up many times. I'm cataloging all the comments and examples that have come up > in the last 5 months in order to provide categorized input to the parent task > <https://phabricator.wikimedia.org/T215858> so that the engineering teams can > think of solutions. I'll share it publicly once it is in a presentable state. > > We did some exploration a while ago (from Huji's examples), you can see some > notebooks with python approaches here > <https://phabricator.wikimedia.org/T267992#6637250>, but there is too much > data and doing the same takes a very long time and can be impractical. If you > want to give it a try have a look at the notebooks, I don't think the code is > too memory intensive, specially in bd808s notebook using the API, and > Raspberry Pis could maybe handle it. > > It is more complex and error-prone, for sure, so disabling those reports and > waiting is sadly the option right now, until a suitable solution for this is > found. > > So, to answer your question: > > Is there going to be a replacement for this functionality? > I can't promise anything yet but I can assure you the teams involved in these > systems are aware of the need for this functionality and will be looking into > how to provide it to make these reports/bots/queries viable. > > We will send updates or new info to the cloud lists, and you can subscribe to > these tasks if you want to follow more closely: > Plan a replacement for wiki replicas that is better suited to typical OLAP > use cases than the MediaWiki OLTP schema > <https://phabricator.wikimedia.org/T215858> > Provide mechanism to detect name clashed media between Commons and a Local > project, without needing to join tables across wiki-db's > <https://phabricator.wikimedia.org/T267992> > Provide a mechanism for detecting duplicate files in commons and a local wiki > <https://phabricator.wikimedia.org/T268240> > Provide a mechanism for detecting duplicate files in enwiki and another > wikipedia <https://phabricator.wikimedia.org/T268242> > Provide a mechanism for accessing the names of image files on Commons when > querying another wiki <https://phabricator.wikimedia.org/T268244> > > > On Wed, Mar 31, 2021 at 10:57 AM Fastily <fastil...@gmail.com > <mailto:fastil...@gmail.com>> wrote: > A little late to the party, I just learned about this change today. > > I maintain a number of bot tasks > <https://en.wikipedia.org/wiki/User:FastilyBot> and database > <https://fastilybot-reports.toolforge.org/> reports > <https://en.wikipedia.org/wiki/Wikipedia:Database_reports> on enwp that rely > on cross-wiki joins (mostly page title joins between enwp and Commons) to > function properly. I didn't find the migration instructions > <https://wikitech.wikimedia.org/w/index.php?title=News/Wiki_Replicas_2020_Redesign&oldid=1905818#How_do_I_cross_reference_data_between_wikis_like_I_do_with_cross_joins_today?> > very helpful; I run FastilyBot on a Raspberry Pi, and needless to say it > would be grossly impractical for me to perform a "join" in the bot's code. > > Is there going to be a replacement for this functionality? > > Fastily > > On Mon, Mar 15, 2021 at 3:09 PM Dan Andreescu <dandree...@wikimedia.org > <mailto:dandree...@wikimedia.org>> wrote: > [4] was made to figure out common use cases and possibilities to enable them > again. > ... > [4] https://phabricator.wikimedia.org/T215858 > <https://phabricator.wikimedia.org/T215858> > > I just want to highlight this ^ thing Joaquin said and mention that our team > (Data Engineering) is also participating in brainstorming ways to bring back > not just cross-wiki joins but better datasets to run these queries. We have > some good ideas, so please do participate in the task and give us more input > so we can pick the best solution quickly. > _______________________________________________ > Wikimedia Cloud Services mailing list > Cloud@lists.wikimedia.org <mailto:Cloud@lists.wikimedia.org> (formerly > lab...@lists.wikimedia.org <mailto:lab...@lists.wikimedia.org>) > https://lists.wikimedia.org/mailman/listinfo/cloud > <https://lists.wikimedia.org/mailman/listinfo/cloud> > _______________________________________________ > Wikimedia Cloud Services mailing list > Cloud@lists.wikimedia.org <mailto:Cloud@lists.wikimedia.org> (formerly > lab...@lists.wikimedia.org <mailto:lab...@lists.wikimedia.org>) > https://lists.wikimedia.org/mailman/listinfo/cloud > <https://lists.wikimedia.org/mailman/listinfo/cloud> > > > -- > Joaquin Oltra Hernandez > Developer Advocate - Wikimedia Foundation > _______________________________________________ > Wikimedia Cloud Services mailing list > Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org) > https://lists.wikimedia.org/mailman/listinfo/cloud
_______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud