On 07/23/2013 06:55 PM, John Vandenberg wrote:
On Wed, Jul 24, 2013 at 9:02 AM, Subramanya Sastry
<ssas...@wikimedia.org> wrote:
http://parsoid.wmflabs.org:8001/stats

This is the url for our round trip testing on 160K pages (20K each from 8
wikipedias).
Very minor point .. there are ~400 missing pages on the list; is that
intentional ? ;-)

One is 'Mos:time' which is in NS 0, and does actually exist as a
redirect to the WP: manual of style:
https://en.wikipedia.org/wiki/Mos:time

1. Some pages get deleted and then go 404. (http://parsoid.wmflabs.org:8001/failedFetches) 2. There are some (known) bugs in our rt testing infrastructure around recording results -- should be fixed once our testing infrastructure is updated and moved to mysql (from sqlite)

...
But, 99.6% means that 0.4% of pages still had corruptions, and that 15% of
pages had syntactic dirty diffs.
So 15% is 24000 pages which can bust, but may not if the edit doesnt
touch the bustable part.

No, 15% of pages aren't bust. 15% of pages introduce meaning-preserving (hence purely syntactic) dirty diffs depending on what piece of the page is edited. Ex: whitespace diffs, addition of " around attribute values are the most common ones.

For an example, see this: http://parsoid.wmflabs.org:8001/result/d5fe6c9052c23bcc0b63a4d0d1b3e5b68fd2ef37/en/Ketill_Flatnose

0.4% (~ 640) pages are classified as semantic diffs. We assign a numerical score in base 1000 (digit 3 = # errors, digit 2 = # semantic errors, digit 1: # syntactic errors). When results are sorted in reverse order of score, it gives us the most egregious pages to focus on (crashers first, semantic errors next, purely dirty diffs next).

So, going to http://parsoid.wmflabs.org:8001/topfails and paging through that will give you what you are looking for. 16 pages with 40 entries each. We hang out on #mediawiki-parsoid and can help editors make sense of the diffs if anyone wants to look for broken wikitext and fix them.

Subbu.

Does /topfails cycle through all 24000, 40 pages at a time?

Could you provide a dump of the list of 24000 bustable pages?  Split
by project?  Each community could then investigate those pages for
broken tables, and more critically .. templates which emit broken
wikisyntax that is causing your team grief.

Do you have stats on each of those eight wikipedias? i.e. is there
noticeable differences in the percentages on different wikipedias? if
so, can you report those percentages for each projects?  I'm guessing
Chinese is an example where there are higher percentages..?

--
John Vandenberg

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to