Hi Subbu ! I have barely started using WPCleaner to fix some errors reported by Linter, and I know I still have work to do on WPCleaner to make it easier for users. But I have a few questions / suggestions regarding Linter for the moment:
- Is is possible to retrieve also the localized names of the Linter categories and priorities: for example, on frwiki, you can see on the Linter page [1] that the high priority is translated into "Priorité haute" and that self-closed-tag has a user friendly name "Balises auto-fermantes". I don't see the localized names in the informations sent by the API for siteinfo. - Where is it possible to change the description displayed in each page dedicated to a category ? For example, the page for self-closed-ags [2] is very short. It would be nice to be able to add a description of what the error is, what problems it can cause and what are the solutions to fix it (or to be able to link to a page explaining all that). - In the page dedicated to a category, there's a column telling if the problem is due to one template (and which one) or by several templates, but I don't get this information in the REST API for Linter. Is it possible to have it in the API result or should I deduce it myself where the offset given by the API matches a call to a template? [1] https://fr.wikipedia.org/wiki/Sp%C3%A9cial:LintErrors [2] https://fr.wikipedia.org/wiki/Sp%C3%A9cial:LintErrors/self-closed-tag On Thu, Jul 6, 2017 at 2:02 PM, Subramanya Sastry <ssas...@wikimedia.org> wrote: > How to read this post? > ---------------------- > * For those without time to read lengthy technical emails, > read the TL;DR section. > * For those who don't care about all the details but want to > help with this project, you can read sections 1 and 2 about Tidy, > and then skip to section 7. > * For those who like all their details, read the post in its entirety, > and follow the links. > > Please ask follow up questions on wiki *on the FAQ’s talk page* [0]. If you > find a bug, please report it *on Phabricator or on the page mentioned > above*. > > TL;DR > ----- > The Parsing team wants to replace Tidy with a RemexHTML-based solution on > the > Wikimedia cluster by June 2018. This will require editors to fix pages and > templates to address wikitext patterns that behave differently with > RemexHTML. Please see 'What editors will need to do' section on the Tidy > replacement FAQ [1]. > > 1. What is Tidy? > ---------------- > Tidy [2] is a library currently used by MediaWiki to fix some HTML errors > found in wiki pages. > > Badly formed markup is common on wiki pages when editors use HTML tags in > templates and on the page itself. (Ex: unclosed HTML tags, such as a > <small> > without a </small>, are common). In some cases, MediaWiki can generate > erroneous HTML by itself. If we didn't fix these before sending it to > browsers, some would display things in a broken way to readers. > > But Tidy also does other "cleanup" on its own that is not required for > correctness. Ex: it removes empty elements and adds whitespace between HTML > tags, which can sometimes change rendering. > > 2. Why replace it? > ------------------ > Since Tidy is based on HTML4 semantics and the Web has moved to HTML5, it > also makes some incorrect changes to HTML to 'fix' things that used to not > work; for example, Tidy will unexpectedly move a bullet list out of a table > caption even though that's allowed. HTML4 Tidy is no longer maintained or > packaged. There have also been a number of bug reports filed against Tidy > [3]. Since Parsoid is based on HTML5 semantics, there are differences in > rendering between Parsoid's rendering of a page and current read view that > is based on Tidy. > > 3. Project status > ----------------- > Given all these considerations, the Parsing team started work to replace > Tidy > [4] around mid-2015. Tim Starling started this work and after a survey of > existing options, decided to write a wrapper over a Java-based HTML5 > parser. > At the time we started the project, we thought we could probably have Tidy > replaced by mid-2016. Alas! > > 4. What is replacing Tidy? > -------------------------- > Tidy will be replaced by a RemexHTML-based solution that uses the > RemexHTML[5] library along with some Tidy-compatibility shims to ensure > better parity with the current rendering. RemexHTML is a PHP library that > Tim > wrote with C.Scott’s input that implements the HTML5 parsing spec. > > 5. Testing and followup > ----------------------- > We knew that some pages will be affected and need fixing due to this > change. > In order to more precisely identify what that would be, we wanted to do > some > thorough testing. So, we built some new tools [6][7] and overhauled and > upgraded other test infrastructure [8][9] to let us evaluate the impacts of > replacing Tidy (among other such things in the future) which can be a > subject > of a post all on its own. > > You can find the details of our testing on the wiki [1][10], but we found > that a large number of pages had rendering differences. We analyzed the > results and categorized the source of differences. Based on that, to ease > the > process of replacement, we added a bunch of compatibility shims to mimic > what > Tidy does. I am skipping the details in this post. Even after that, newer > testing showed that this nevertheless still leaves us with a few patterns > that need fixing that we cannot / don't want to work around automatically. > > 6. Tools to assist editors: Linter & ParserMigration > ---------------------------------------------------- > In October 2016, at the parsing team offsite, Kunal ([[User:Legoktm > (WMF)]]) > dusted off the stalled wikitext linting project [11] and (with the help > from > a bunch of people on the Parsoid, db/security/code review areas) built the > Linter extension that surfaces wikitext errors that Parsoid knows about to > let editors fix them. > > Earlier this year, we decided to use Linter in service of Tidy replacement. > Based on our earlier testing results, we have added a set of high-priority > linter categories that identifies specific wikitext markup patterns on wiki > pages that need to be fixed [12]. > > Separately, Tim built the ParserMigration extension to let editors evaluate > their fixes to pages [13]. You can enable this in your editing preferences > or > replace '&action=edit' in your url bar with '&action=parsermigration-edit' > . > > 7. What editors have to do > -------------------------- > The part that you have all been waiting for! > > Please see 'What editors will need to do' section on the Tidy replacement > FAQ > [1]. We have added simplified instructions, so that even community members > who do not consider themselves "techies" can still learn about ways to fix > pages. We'll keep that section up to date based on feedback and questions. > But since it is a wiki, please also edit and tweak as required to make the > text useful for yourselves! This is a first call for fixes and it is about > the problems defined as "high priority". We'll issue other calls in the > future for any other necessary Tidy fixups. > > Caveats: > > * As noted on that page, the linter categories don't cover all the possible > sources of rendering differences. For example, there is still T157418 > [14] > left to address. For those who have an opinion about this, please chime > in > on that task. We are still evaluating the best solution for this without > adding more cruft to wikitext behavior or kicking the cleanup can down > the road. > > * As the issues in the identified linter categories are fixed, we might be > better able to isolate other issues that need addressing. > > 8. So, when will Tidy actually be replaced? > ------------------------------------------- > We really would like to get Tidy removed from the cluster latest by June > 2018 > (or sooner if possible), and your assistance and prompt attention to these > markup issues would be very helpful. We will do this in a phased manner on > different wikis rather than all at once on all wikis. > > We really want to do this as smoothly as possible without disrupting the > work > of editors or affecting the rendering of the large corpus of pages on the > various wikis. As you might have gathered from the text above, we have > built > and leveraged a wide variety of tools to assist with this. > > 9. Monitoring progress > ---------------------- > In order to monitor progress, we plan to do a weekly (or some such periodic > frequency) test run that compares the rendering of pages with Tidy and with > RemexHTML on a large sample of pages (in the 50K range) from a large subset > of Wikimedia wikis (~50 or so). This will give us a pulse of how fixups > are > going, and when we might be able to flip the switch on different wikis. > > Subramanya (Subbu) Sastry > Parsing Team. > > References > ---------- > 0. https://www.mediawiki.org/wiki/Talk:Parsing/Replacing_Tidy/FAQ > 1. https://www.mediawiki.org/wiki/Parsing/Replacing_Tidy/FAQ# > What_will_editors_need_to_do.3F > 2. https://en.wikipedia.org/wiki/HTML_Tidy > 3. https://phabricator.wikimedia.org/tag/tidy/ > 4. https://phabricator.wikimedia.org/T89331 > 5. https://github.com/wikimedia/mediawiki-libs-RemexHtml > 6. https://phabricator.wikimedia.org/T120345 > 7. https://github.com/wikimedia/integration-uprightdiff > 8. https://github.com/wikimedia/integration-visualdiff > 9. https://github.com/wikimedia/mediawiki-services-parsoid-testreduce > 10. https://www.mediawiki.org/wiki/Parsing/Replacing_Tidy > 11. https://phabricator.wikimedia.org/T48705 > 12. https://www.mediawiki.org/wiki/Help:Extension:Linter#Goal:_ > Replacing_Tidy > 13. https://www.mediawiki.org/wiki/Help:Extension:Linter#Verifyi > ng_fixes_for_these_lint_categories > 14. https://phabricator.wikimedia.org/T157418 > > > _______________________________________________ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l