On 2010/11/05 8:30, Markus Scherer wrote:

If the conversion libraries you are using do not support this (I don't
know), then you could ask for such options. Or use conversion libraries that
do support such options (like ICU and Java).

The encoding conversion library in Ruby 1.9 also supports this. Here's an example:

>>>>
utf16_borken = "\x00a\x00b\xD8\x00\x00c\x00d".force_encoding('UTF-16BE')
utf8_clean = utf16_borken.encode('UTF-8',
                                 invalid: :replace, replace: '')
puts utf8_clean      # prints "abcd"
>>>>

In general, and in particular for Unicode Encoding Forms, it's a bad idea to just "replace with nothing", because of the security implications this might have. I guess that's the reason Perl doesn't allow this. But if you are sure there are no security implications, then there is no reason to not remove lone surrogates.

Regards,   Martin.


P.S.: Why would you use Ruby for conversion when programming in Perl? You could just as well program in Ruby, it's much more fun!


--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:[email protected]

Reply via email to