Re: Utility to report and repair broken surrogate pairs in UTF-16 text

Martin J. Dürst Fri, 05 Nov 2010 02:28:00 -0700

On 2010/11/05 8:30, Markus Scherer wrote:

If the conversion libraries you are using do not support this (I don't
know), then you could ask for such options. Or use conversion libraries that
do support such options (like ICU and Java).

The encoding conversion library in Ruby 1.9 also supports this. Here'san example:


>>>>
utf16_borken = "\x00a\x00b\xD8\x00\x00c\x00d".force_encoding('UTF-16BE')
utf8_clean = utf16_borken.encode('UTF-8',
                                 invalid: :replace, replace: '')
puts utf8_clean      # prints "abcd"
>>>>

In general, and in particular for Unicode Encoding Forms, it's a badidea to just "replace with nothing", because of the securityimplications this might have. I guess that's the reason Perl doesn'tallow this. But if you are sure there are no security implications, thenthere is no reason to not remove lone surrogates.


Regards,   Martin.

P.S.: Why would you use Ruby for conversion when programming in Perl?You could just as well program in Ruby, it's much more fun!



--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:[email protected]

Re: Utility to report and repair broken surrogate pairs in UTF-16 text

Reply via email to