[Wikitech-l] Re: MediaWiki core schema drifts with production databases

2022-08-22 Thread William Avery
Well done. This will save a ton of trouble down the road.

The number associated with FlaggedRevs is 'interesting', and is  not
inconsistent with my experiences as an editor.

On Mon, 22 Aug 2022 at 08:56, Amir Sarabadani 
wrote:

> Hello,
>
> After many years of work, I'm happy to announce a milestone in addressing
> one of our major areas of tech debt in database infrastructure: we have
> eliminated all schema drifts between MediaWiki core and production.
>
> It all started six years ago when users in English Wikipedia reported that
> checking history of some pages is quite slow *at random*. More in-depth
> analysis showed the revision table in English Wikipedia was missing an
> important index in some of the replicas. An audit of the schema of the
> revision table revealed much bigger drifts in the revision table of that
> Wiki. You can read more in its ticket: T132416
> 
>
> Lack of schema parity between expectation and reality is quite dangerous.
> Trying to force an index in code assuming it would exist in production
> (under the same name) would cause fatal error every time it’s attempted.
> Trying to write to a field that doesn’t exist is similar. Such changes
> easily pass tests and work well in our test setups (such as beta cluster)
> just to cause an outage in production.
>
> If only one table in one Wiki had this many drifts, looking at all Wikis
> and all tables became of vital importance. We have around ~1,000 wikis,
> ~200 hosts (each one hosting on average ~100 Wikis), and each Wiki has
> around ~130 tables (half of them being tables from MediaWiki core) and each
> table can have multiple drifts.
>
> We slowly started looking for and addressing schema drifts five years ago
> and later automated the discovery by utilizing abstract schema (before
> that, the tool had to parse SQL) and discovered an overwhelming number of
> drifts. You can look at the history of the work in T104459
> .
>
> Around fifty tickets addressing the drifts have been completed and they
> are collected in T312538 . I
> suggest checking some of them to see the scale of the work done. Each one
> of these tickets took days to months of work to finish. Large number of
> them also existed in primary databases, requiring a primary switchover and
> read-only time for one or more Wikis. Each drift was different, in some
> cases, you needed to change the code and not production so it needed a
> thorough investigation.
>
> Why do such drifts happen? The most common reason was when a schema change
> happened in code but it was never requested to be applied in production.
> For example, a schema change in code in 2007 led to having any wiki created
> before that date to have a different schema than wikis created after it. We
> introduced processes
> 
> and tooling to make sure this doesn’t happen anymore in 2015 but we still
> needed to address previous drifts. The second common reason was when a host
> didn’t get the schema change for various reasons (was out of rotation when
> the schema was being applied, a shortcoming of the manual process). By
> automating  most of the
> schema change operational work we reduced the chance of such drifts from
> happening as well.
>
> After finishing core, we now need to look at WMF-deployed extensions,
> starting with FlaggedRevs 
> that, while being deployed to only 50 wikis and having only 8 tables, has
> ~7,000 drifts. Thankfully, most other extensions are in a healthier state.
>
> I would like to personally thank Manuel Arostegui and Jaime Crespo for
> their monumental dedication to fix these issues in the past years. Also a
> big thank you to several of our amazing developers, Umherirrender, James
> Forrester and Sam Reed who helped on reporting, going through the history
> of MediaWiki to figure out why these drifts happened, and helping build the
> reporting tools.
>
> Best
> --
> *Amir Sarabadani (he/him)*
> Staff Database Architect
> Wikimedia Foundation 
> ___
> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
> To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org
> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
___
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/

[Wikitech-l] Re: [Wikimedia-l] Accessing wikipedia metadata

2021-09-16 Thread William Avery
A memorable piece of research in this area sampled articles using the API.
https://arxiv.org/abs/1904.08139

Regards,

Will Avery

On Thu, 16 Sep 2021, 19:21 Risker,  wrote:

> Mike's suggestion is good.  You would likely get better responses by
> asking this question to the Wikimedia developers, so I am forwarding to
> that list.
>
> Risker
>
> On Thu, 16 Sept 2021 at 14:04, Gava, Cristina via Wikimedia-l <
> wikimedi...@lists.wikimedia.org> wrote:
>
>> Hello everyone,
>>
>>
>>
>> It is my first time interacting in this mailing list, so I will be happy
>> to receive further feedbacks on how to better interact with the community :)
>>
>>
>>
>> I am trying to access Wikipedia meta data in a streaming and
>> time/resource sustainable manner. By meta data I mean many of the voices
>> that can be found in the statistics of a wiki article, such as edits,
>> editors list, page views etc.
>>
>> I would like to do such for an online classifier type of structure:
>> retrieve the data from a big number of wiki pages every tot time and use it
>> as input for predictions.
>>
>>
>>
>> I tried to use the Wiki API, however it is time and resource expensive,
>> both for me and Wikipedia.
>>
>>
>>
>> My preferred choice now would be to query the specific tables in the
>> Wikipedia database, in the same way this is done through the Quarry tool.
>> The problem with Quarry is that I would like to build a standalone script,
>> without having to depend on a user interface like Quarry. Do you think that
>> this is possible? I am still fairly new to all of this and I don’t know
>> exactly which is the best direction.
>>
>> I saw [1]  that I could
>> access wiki replicas both through Toolforge and PAWS, however I didn’t
>> understand which one would serve me better, could I ask you for some
>> feedback?
>>
>>
>>
>> Also, as far as I understood [2]
>> , directly
>> accessing the DB through Hive is too technical for what I need, right?
>> Especially because it seems that I would need an account with production
>> shell access and I honestly don’t think that I would be granted access to
>> it. Also, I am not interested in accessing sensible and private data.
>>
>>
>>
>> Last resource is parsing analytics dumps, however this seems less organic
>> in the way of retrieving and polishing the data. As also, it would be
>> strongly decentralised and physical-machine dependent, unless I upload the
>> polished data online every time.
>>
>>
>>
>> Sorry for this long message, but I thought it was better to give you a
>> clearer picture (hoping this is clear enough). If you could give me even
>> some hint it would be highly appreciated.
>>
>>
>>
>> Best,
>>
>> Cristina
>>
>>
>>
>> [1] https://meta.wikimedia.org/wiki/Research:Data
>>
>> [2] https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake
>> ___
>> Wikimedia-l mailing list -- wikimedi...@lists.wikimedia.org, guidelines
>> at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and
>> https://meta.wikimedia.org/wiki/Wikimedia-l
>> Public archives at
>> https://lists.wikimedia.org/hyperkitty/list/wikimedi...@lists.wikimedia.org/message/6OZE7WIRDCMRA7TESD6XVCVB6ZQV4OFP/
>> To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org
>
> ___
> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
> To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org
> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
___
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/

Re: [Wikitech-l] Stuck/Missing Grid Job for tools.william-avery-bot

2021-03-26 Thread William Avery
Thanks Bryan,

It's now resumed it's not particularly critical task:
https://www.wikidata.org/wiki/Special:Contributions/William_Avery_Bot

Will

On Fri, 26 Mar 2021 at 21:45, Bryan Davis  wrote:

> On Fri, Mar 26, 2021 at 3:27 PM William Avery 
> wrote:
> >
> > Hi,
> >
> > I got the email below telling me that my cron job running as
> william-avery-bot had throw an error, and I noticed that the Grid job that
> it kicks off hasn't run since.
> >
> > I tried deleting the job using the instructions at
> https://wikitech.wikimedia.org/wiki/Help:Toolforge/Grid#Stopping_jobs_with_%E2%80%98qdel%E2%80%99_and_%E2%80%98jstop%E2%80%99
> but it appeared "stuck".
>
> I have "force deleted" your job using my Toolforge admin rights.
>
>   $ sudo qdel -f 749
>   root forced the deletion of job 749
>
> The Toolforge grid engine had numerous problems yesterday which led to
> the scheduler losing track of the state of many jobs. Brooke did
> several rounds of looking for these and cleaning the queue state, but
> obviously yours was not cleaned up in that process. Thank you for your
> report, and I hope you can get your tool back into its proper working
> state.
>
> Bryan
> --
> Bryan Davis  Technical Engagement  Wikimedia Foundation
> Principal Software Engineer   Boise, ID USA
> [[m:User:BDavis_(WMF)]]  irc: bd808
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] Stuck/Missing Grid Job for tools.william-avery-bot

2021-03-26 Thread William Avery
Hi,

I got the email below telling me that my cron job running as
william-avery-bot had throw an error, and I noticed that the Grid job that
it kicks off hasn't run since.

I tried deleting the job using the instructions at
https://wikitech.wikimedia.org/wiki/Help:Toolforge/Grid#Stopping_jobs_with_%E2%80%98qdel%E2%80%99_and_%E2%80%98jstop%E2%80%99
but it appeared "stuck".

"qstat -xml" outputs the following:

http://arc.liv.ac.uk/repos/darcs/sge/source/dist/util/resources/schemas/qstat/qstat.xsd
">
  

  749
  0.25319
  cron-TaxonbarSyncerBot
  tools.william-avery-bot
  dr
  2021-03-25T17:49:16
  task@tools-sgeexec-0916.tools.eqiad.wmflabs
  1

  
  
  


But when I ssh to tools-sgeexec-0916.tools.eqiad.wmflabs I see no sign of
any processes under tools.william-avery-bot, except the ones associated
with my interactive session.

Can anyone help resolve this or advise of a venue to raise it?

Thanks in advance,

Will

-- Forwarded message -
From: Cron Daemon 
Date: Thu, 25 Mar 2021 at 16:49
Subject: Cron  /usr/bin/jsub -N
cron-TaxonbarSyncerBot -once -quiet ~/TaxonbarSyncerBot.sh
To: 


error: commlib error: got select error (Connection refused)
error: unable to send message to qmaster using port 6444 on host
"tools-sgegrid-shadow.tools.eqiad1.wikimedia.cloud": got send error
Traceback (most recent call last):
  File "/usr/bin/job", line 48, in 
root = xml.etree.ElementTree.fromstring(proc.stdout.read())
  File "/usr/lib/python3.5/xml/etree/ElementTree.py", line 1345, in XML
return parser.close()
xml.etree.ElementTree.ParseError: no element found: line 1, column 0
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l