Re: [PHP-DOC] HTML entities to UTF-8.

Ferenc Kovacs Fri, 05 Aug 2011 07:37:56 -0700

On Fri, Aug 5, 2011 at 2:10 PM, Richard Quadling <rquadl...@gmail.com> wrote:
> On 5 August 2011 13:04, Ferenc Kovacs <tyr...@gmail.com> wrote:
>> On Fri, Aug 5, 2011 at 1:43 PM, Richard Quadling <rquadl...@gmail.com> wrote:
>>> Hello all.
>>>
>>> During the last week, I've been converting the HTML Entities in phpdoc
>>> to their Unicode counterparts, in connection to
>>> http://news.php.net/php.doc.cvs/8536
>>>
>>> "Remove html entities (the english translation no longer uses any.. if
>>> this breaks translations then they should folow the english one, or if
>>> to much work, we can revert this commit)"
>>>
>>>
>>> In examining the translations, there are a significant number of files
>>> NOT encoded using UTF-8.
>>>
>>> As such, embedding a UTF-8 character in these files will produce garbage.
>>>
>>> As an English only speaker, I am not confident that my convertion from
>>> ISO encoding to UTF-8 encoding is accurate - and that I have no
>>> realistic way to check.
>>>
>>> So, here is a list of all the files requiring someone with the
>>> language skills to look at them and manually convert them.
>>>
>>>
>>> If someone has a routine that can convert ISO encoded XML to UTF-8
>>> accurately, then I can apply that and then process the entities.
>>>
>>>
>>> cs/bookinfo.xml
>>> cs/faq/generanl.xml
>>> cs/reference/strings/functions/get-html-translation-table.xml
>>>
>>> hk/variables.xml
>>>
>>> hu/bookinfo.xml
>>> hu/language/control-structures.xml
>>> hu/reference/image/functions/imagearc.xml
>>> hu/reference/mbstring/functions/mb-strtoupper.xml
>>> hu/reference/recode/functions/recode-string.xml
>>
>> hi Richard
>>
>> I will fix it for the hungarian files.
>>
>>
>> --
>> Ferenc Kovács
>> @Tyr43l - http://tyrael.hu
>>
>
> If you could detail what you do in terms of re-encoding, then I'm
> quite happy to rely on that process for the other files.
>
> At some stage, converting all the encoded files to UTF-8 would be a
> nice step, but that is a significant step. If/when that was
> undertaken, I'd suggest adding a pre-commit hook to reject non UTF-8
> encoded XML files from phpdoc.
>
> --
> Richard Quadling
> Twitter : EE : Zend : PHPDoc
> @RQuadling : e-e.com/M_248814.html : bit.ly/9O8vFY : bit.ly/lFnVea
>


Hi Richard,

first of all, you have to figure out what encoding the files are
using, then re-encode the non-utf8 files from that encoding to utf-8.

for getting a list of the non utf-8 files, I've used something like this:
find ./hu/trunk/ -type f|grep -v '\.svn'|xargs file -i|grep -v 'charset=utf-8'

as I quessed, most of the hungarian files are encoded with iso-8859-2,
some of them was us-ascii, but I removed those files as they were just
copyed from the en without any modification.
find ./hu/trunk/ -type f|grep -v '\.svn'|xargs file -i|grep
'us-ascii'|cut -f1 -d ':'|xargs svn rm

with having the us-ascii files removed, I was left with iso-8859 files
only (which was iso-8859-2 to be correct, the file tool can't tell you
that, but you can know from the language and the xml encoding
attribute).

so I converted all of those files to utf-8:
find ./hu/trunk/ -type f|grep -v '\.svn'|xargs file -i|grep
'charset=iso-8859-1'|cut -f1 -d ':'|xargs recode iso-8859-2..utf-8

and replaced the iso-8859-2 occurences in the files:
find ./en/trunk/ -type f|grep -v '\.svn'|cut -f1 -d ':'|xargs sed -i
-e "s/\(encoding=[\'\"]\)iso-8859-1/\1utf-8/gI"

it should be noted that there are some documentation where this
expression would match and replace unintended stuff, like in
reference/xsl/examples.xml so a better approach would be to parse the
files as xml documents, and change the encoding attribute.

-- 
Ferenc Kovács
@Tyr43l - http://tyrael.hu

Re: [PHP-DOC] HTML entities to UTF-8.

Reply via email to