Re: [PHP-DEV] domdocument loadhtml and encoding
On Fri, Jun 1, 2012 at 5:57 PM, Tjerk Meesters datib...@hotmail.com wrote: Gentlemen, Regarding this bug report: https://bugs.php.net/bug.php?id=49705 As more developers move away from using regular expressions to parse HTML and start using DOMDocument, I've noticed that quite a few stumble over encoding issues. They're not bugs, because it's documented (I think) that if a document is loaded using ::loadHTMLFile() or if it contains a content-type meta tag which specifies the character encoding it will work as expected. So far I've suggested a hack that involves adding the meta-tag in front of the string that contains the HTML. As horrible as it seems, that does the job! That said, I'm hoping to get enough internals support to add a parameter to ::loadHTML() that set / overrides the default character set when processing the document; when given, any meta tags pertaining to character set encoding should be ignored (AFAIK that's also the browser's behavior). Btw, there's another patch that also introduces a new parameter to ::parseHTML() which has gone into 5.4 branch (https://bugs.php.net/bug.php?id=54037), so it looks like this would be the second (optional) parameter then. Thoughts? would be nice. bump. -- Ferenc Kovács @Tyr43l - http://tyrael.hu
[PHP-DEV] domdocument loadhtml and encoding
Gentlemen, Regarding this bug report: https://bugs.php.net/bug.php?id=49705 As more developers move away from using regular expressions to parse HTML and start using DOMDocument, I've noticed that quite a few stumble over encoding issues. They're not bugs, because it's documented (I think) that if a document is loaded using ::loadHTMLFile() or if it contains a content-type meta tag which specifies the character encoding it will work as expected. So far I've suggested a hack that involves adding the meta-tag in front of the string that contains the HTML. As horrible as it seems, that does the job! That said, I'm hoping to get enough internals support to add a parameter to ::loadHTML() that set / overrides the default character set when processing the document; when given, any meta tags pertaining to character set encoding should be ignored (AFAIK that's also the browser's behavior). Btw, there's another patch that also introduces a new parameter to ::parseHTML() which has gone into 5.4 branch (https://bugs.php.net/bug.php?id=54037), so it looks like this would be the second (optional) parameter then. Thoughts? -- -- Tjerk -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php