Is it feasible to do a log analysis of the database servers to find out what 
tools are (were?) using cross-wiki joins?  At least that would ensure that all 
the tool owners could be contacted directly to make sure they know this is 
happening.

> On Mar 31, 2021, at 3:46 PM, Joaquin Oltra Hernandez 
> <jhernan...@wikimedia.org> wrote:
> 
> Hi Fastily, we are aware of the use case for matching commons 
> pages/images/sha1s between commons/big wikis and other wikis, as it has come 
> up many times. I'm cataloging all the comments and examples that have come up 
> in the last 5 months in order to provide categorized input to the parent task 
> <https://phabricator.wikimedia.org/T215858> so that the engineering teams can 
> think of solutions. I'll share it publicly once it is in a presentable state.
> 
> We did some exploration a while ago (from Huji's examples), you can see some 
> notebooks with python approaches here 
> <https://phabricator.wikimedia.org/T267992#6637250>, but there is too much 
> data and doing the same takes a very long time and can be impractical. If you 
> want to give it a try have a look at the notebooks, I don't think the code is 
> too memory intensive, specially in bd808s notebook using the API, and 
> Raspberry Pis could maybe handle it.
> 
> It is more complex and error-prone, for sure, so disabling those reports and 
> waiting is sadly the option right now, until a suitable solution for this is 
> found.
> 
> So, to answer your question:
> 
> Is there going to be a replacement for this functionality?
> I can't promise anything yet but I can assure you the teams involved in these 
> systems are aware of the need for this functionality and will be looking into 
> how to provide it to make these reports/bots/queries viable.
> 
> We will send updates or new info to the cloud lists, and you can subscribe to 
> these tasks if you want to follow more closely:
> Plan a replacement for wiki replicas that is better suited to typical OLAP 
> use cases than the MediaWiki OLTP schema 
> <https://phabricator.wikimedia.org/T215858>
> Provide mechanism to detect name clashed media between Commons and a Local 
> project, without needing to join tables across wiki-db's 
> <https://phabricator.wikimedia.org/T267992>
> Provide a mechanism for detecting duplicate files in commons and a local wiki 
> <https://phabricator.wikimedia.org/T268240>
> Provide a mechanism for detecting duplicate files in enwiki and another 
> wikipedia <https://phabricator.wikimedia.org/T268242>
> Provide a mechanism for accessing the names of image files on Commons when 
> querying another wiki <https://phabricator.wikimedia.org/T268244>
> 
> 
> On Wed, Mar 31, 2021 at 10:57 AM Fastily <fastil...@gmail.com 
> <mailto:fastil...@gmail.com>> wrote:
> A little late to the party, I just learned about this change today.
> 
> I maintain a number of bot tasks 
> <https://en.wikipedia.org/wiki/User:FastilyBot> and database 
> <https://fastilybot-reports.toolforge.org/> reports 
> <https://en.wikipedia.org/wiki/Wikipedia:Database_reports> on enwp that rely 
> on cross-wiki joins (mostly page title joins between enwp and Commons) to 
> function properly.  I didn't find the migration instructions 
> <https://wikitech.wikimedia.org/w/index.php?title=News/Wiki_Replicas_2020_Redesign&oldid=1905818#How_do_I_cross_reference_data_between_wikis_like_I_do_with_cross_joins_today?>
>  very helpful; I run FastilyBot on a Raspberry Pi, and needless to say it 
> would be grossly impractical for me to perform a "join" in the bot's code.
> 
> Is there going to be a replacement for this functionality?
> 
> Fastily
> 
> On Mon, Mar 15, 2021 at 3:09 PM Dan Andreescu <dandree...@wikimedia.org 
> <mailto:dandree...@wikimedia.org>> wrote:
> [4] was made to figure out common use cases and possibilities to enable them 
> again.
> ...
> [4] https://phabricator.wikimedia.org/T215858 
> <https://phabricator.wikimedia.org/T215858>
> 
> I just want to highlight this ^ thing Joaquin said and mention that our team 
> (Data Engineering) is also participating in brainstorming ways to bring back 
> not just cross-wiki joins but better datasets to run these queries.  We have 
> some good ideas, so please do participate in the task and give us more input 
> so we can pick the best solution quickly.
> _______________________________________________
> Wikimedia Cloud Services mailing list
> Cloud@lists.wikimedia.org <mailto:Cloud@lists.wikimedia.org> (formerly 
> lab...@lists.wikimedia.org <mailto:lab...@lists.wikimedia.org>)
> https://lists.wikimedia.org/mailman/listinfo/cloud 
> <https://lists.wikimedia.org/mailman/listinfo/cloud>
> _______________________________________________
> Wikimedia Cloud Services mailing list
> Cloud@lists.wikimedia.org <mailto:Cloud@lists.wikimedia.org> (formerly 
> lab...@lists.wikimedia.org <mailto:lab...@lists.wikimedia.org>)
> https://lists.wikimedia.org/mailman/listinfo/cloud 
> <https://lists.wikimedia.org/mailman/listinfo/cloud>
> 
> 
> -- 
> Joaquin Oltra Hernandez
> Developer Advocate - Wikimedia Foundation
> _______________________________________________
> Wikimedia Cloud Services mailing list
> Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org)
> https://lists.wikimedia.org/mailman/listinfo/cloud

_______________________________________________
Wikimedia Cloud Services mailing list
Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud

Reply via email to