Subramanya Sastry wrote: >* Unclosed HTML tags (very common) >* Misnested tags >* Misnesting of tags (ex: links in links .. [http://foo.bar this is a >[[foobar]] company]) >* Fostered content in tables >(<table>this-content-will-show-up-outside-the-table<tr><td>.... ></td></tr></table>) >... this has been one of the biggest source of complexity inside Parsoid >... in combination with templates, this is nasty. >* Other ways in which HTML5 content model might be violated. (ex: ><small>\n*a\n*b\n</small>) >* Look at the parser tests file and see all the tests we've added with >annotations that say "php parser relies on tidy"
I don't see why we would want to incur the maintenance cost of continuing to support any of these bad inputs. I think we should look to deprecate, not replace, Tidy. This is a case of the cure being worse than the disease. >So, you cannot just rip out Tidy and not replace it with something in >its place. Even replacing it with a HTML5 parser (as per the current >plan) is not entirely straightforward simply because of all the other >unrelated-to-html5-semantics behavior. Part of the task of replacing >Tidy is to figure out all the ways those pages might break and the best >way to handle that breakage. We shouldn't rip out Tidy immediately, we should implement a means of disabling Tidy on a per-page or per-user basis and allow the wiki process to correct bad markup over time. Cunningham's Law applies here. MZMcBride _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l