On Fri, Aug 5, 2011 at 2:10 PM, Richard Quadling <rquadl...@gmail.com> wrote: > On 5 August 2011 13:04, Ferenc Kovacs <tyr...@gmail.com> wrote: >> On Fri, Aug 5, 2011 at 1:43 PM, Richard Quadling <rquadl...@gmail.com> wrote: >>> Hello all. >>> >>> During the last week, I've been converting the HTML Entities in phpdoc >>> to their Unicode counterparts, in connection to >>> http://news.php.net/php.doc.cvs/8536 >>> >>> "Remove html entities (the english translation no longer uses any.. if >>> this breaks translations then they should folow the english one, or if >>> to much work, we can revert this commit)" >>> >>> >>> In examining the translations, there are a significant number of files >>> NOT encoded using UTF-8. >>> >>> As such, embedding a UTF-8 character in these files will produce garbage. >>> >>> As an English only speaker, I am not confident that my convertion from >>> ISO encoding to UTF-8 encoding is accurate - and that I have no >>> realistic way to check. >>> >>> So, here is a list of all the files requiring someone with the >>> language skills to look at them and manually convert them. >>> >>> >>> If someone has a routine that can convert ISO encoded XML to UTF-8 >>> accurately, then I can apply that and then process the entities. >>> >>> >>> cs/bookinfo.xml >>> cs/faq/generanl.xml >>> cs/reference/strings/functions/get-html-translation-table.xml >>> >>> hk/variables.xml >>> >>> hu/bookinfo.xml >>> hu/language/control-structures.xml >>> hu/reference/image/functions/imagearc.xml >>> hu/reference/mbstring/functions/mb-strtoupper.xml >>> hu/reference/recode/functions/recode-string.xml >> >> hi Richard >> >> I will fix it for the hungarian files. >> >> >> -- >> Ferenc Kovács >> @Tyr43l - http://tyrael.hu >> > > If you could detail what you do in terms of re-encoding, then I'm > quite happy to rely on that process for the other files. > > At some stage, converting all the encoded files to UTF-8 would be a > nice step, but that is a significant step. If/when that was > undertaken, I'd suggest adding a pre-commit hook to reject non UTF-8 > encoded XML files from phpdoc. > > -- > Richard Quadling > Twitter : EE : Zend : PHPDoc > @RQuadling : e-e.com/M_248814.html : bit.ly/9O8vFY : bit.ly/lFnVea >
Hi Richard, first of all, you have to figure out what encoding the files are using, then re-encode the non-utf8 files from that encoding to utf-8. for getting a list of the non utf-8 files, I've used something like this: find ./hu/trunk/ -type f|grep -v '\.svn'|xargs file -i|grep -v 'charset=utf-8' as I quessed, most of the hungarian files are encoded with iso-8859-2, some of them was us-ascii, but I removed those files as they were just copyed from the en without any modification. find ./hu/trunk/ -type f|grep -v '\.svn'|xargs file -i|grep 'us-ascii'|cut -f1 -d ':'|xargs svn rm with having the us-ascii files removed, I was left with iso-8859 files only (which was iso-8859-2 to be correct, the file tool can't tell you that, but you can know from the language and the xml encoding attribute). so I converted all of those files to utf-8: find ./hu/trunk/ -type f|grep -v '\.svn'|xargs file -i|grep 'charset=iso-8859-1'|cut -f1 -d ':'|xargs recode iso-8859-2..utf-8 and replaced the iso-8859-2 occurences in the files: find ./en/trunk/ -type f|grep -v '\.svn'|cut -f1 -d ':'|xargs sed -i -e "s/\(encoding=[\'\"]\)iso-8859-1/\1utf-8/gI" it should be noted that there are some documentation where this expression would match and replace unintended stuff, like in reference/xsl/examples.xml so a better approach would be to parse the files as xml documents, and change the encoding attribute. -- Ferenc Kovács @Tyr43l - http://tyrael.hu