Hi Fastily, we are aware of the use case for matching commons
pages/images/sha1s between commons/big wikis and other wikis, as it has
come up many times. I'm cataloging all the comments and examples that have
come up in the last 5 months in order to provide categorized input to
the parent
task <https://phabricator.wikimedia.org/T215858> so that the engineering
teams can think of solutions. I'll share it publicly once it is in a
presentable state.

We did some exploration a while ago (from Huji's examples), you can see
some notebooks with python approaches here
<https://phabricator.wikimedia.org/T267992#6637250>, but there is too much
data and doing the same takes a very long time and can be impractical. If
you want to give it a try have a look at the notebooks, I don't think the
code is too memory intensive, specially in bd808s notebook using the API,
and Raspberry Pis could maybe handle it.

It is more complex and error-prone, for sure, so disabling those reports
and waiting is sadly the option right now, until a suitable solution for
this is found.

So, to answer your question:

Is there going to be a replacement for this functionality?

I can't promise anything yet but I can assure you the teams involved in
these systems are aware of the need for this functionality and will be
looking into how to provide it to make these reports/bots/queries viable.

We will send updates or new info to the cloud lists, and you can subscribe
to these tasks if you want to follow more closely:

   - Plan a replacement for wiki replicas that is better suited to typical
   OLAP use cases than the MediaWiki OLTP schema
   <https://phabricator.wikimedia.org/T215858>
   - Provide mechanism to detect name clashed media between Commons and a
   Local project, without needing to join tables across wiki-db's
   <https://phabricator.wikimedia.org/T267992>
   - Provide a mechanism for detecting duplicate files in commons and a
   local wiki <https://phabricator.wikimedia.org/T268240>
   - Provide a mechanism for detecting duplicate files in enwiki and
   another wikipedia <https://phabricator.wikimedia.org/T268242>
   - Provide a mechanism for accessing the names of image files on Commons
   when querying another wiki <https://phabricator.wikimedia.org/T268244>



On Wed, Mar 31, 2021 at 10:57 AM Fastily <fastil...@gmail.com> wrote:

> A little late to the party, I just learned about this change today.
>
> I maintain a number of bot tasks
> <https://en.wikipedia.org/wiki/User:FastilyBot> and database
> <https://fastilybot-reports.toolforge.org> reports
> <https://en.wikipedia.org/wiki/Wikipedia:Database_reports> on enwp that
> rely on cross-wiki joins (mostly page title joins between enwp and Commons)
> to function properly.  I didn't find the migration instructions
> <https://wikitech.wikimedia.org/w/index.php?title=News/Wiki_Replicas_2020_Redesign&oldid=1905818#How_do_I_cross_reference_data_between_wikis_like_I_do_with_cross_joins_today?>
> very helpful; I run FastilyBot on a Raspberry Pi, and needless to say it
> would be grossly impractical for me to perform a "join" in the bot's code.
>
> Is there going to be a replacement for this functionality?
>
> Fastily
>
> On Mon, Mar 15, 2021 at 3:09 PM Dan Andreescu <dandree...@wikimedia.org>
> wrote:
>
>> [4] was made to figure out common use cases and possibilities to enable
>>> them again.
>>>
>> ...
>>
>>> [4] https://phabricator.wikimedia.org/T215858
>>>
>>
>> I just want to highlight this ^ thing Joaquin said and mention that our
>> team (Data Engineering) is also participating in brainstorming ways to
>> bring back not just cross-wiki joins but better datasets to run these
>> queries.  We have some good ideas, so please do participate in the task and
>> give us more input so we can pick the best solution quickly.
>> _______________________________________________
>> Wikimedia Cloud Services mailing list
>> Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org)
>> https://lists.wikimedia.org/mailman/listinfo/cloud
>>
> _______________________________________________
> Wikimedia Cloud Services mailing list
> Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org)
> https://lists.wikimedia.org/mailman/listinfo/cloud
>


-- 
Joaquin Oltra Hernandez
Developer Advocate - Wikimedia Foundation
_______________________________________________
Wikimedia Cloud Services mailing list
Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud

Reply via email to