On Wednesday, 28 June 2017 at 18:08:12 UTC, aberba wrote:
I wanted strip_tags() for sanitization in vibe.d and I set out
for algorithms on how to do it and came across this JavaScript
library at
string stripTags(string input, in string[] allowedTags = [])
{
import std.regex: Captures, replaceAll, ctRegex;
auto regex = ctRegex!(`</?(\w*)>`);
Ouch, parsing html or xml with regular expressions is problematic.
What people generally don't realize is that the > is not required
to be encoded as entity when in the data. This means that <thing
attr="Hello >"> or
<data>></data> are absolutely legal. Regular expressions may
break when they encounter them.
http://haacked.com/archive/2004/10/25/usingregularexpressionstomatchhtml.aspx/
https://blog.codinghorror.com/parsing-html-the-cthulhu-way/