I am trying to decode strings of suspect UTF origins - Encode::Guess seems to be the way to go....


So I am opening a file "normally" and just reading line by line. I will pass the line of text through Encode::Guess which I have used thusly:

use Encode::Guess qw(UTF-8 UTF-16BE); #I may add more in future

Now what I read in is *usually* UTF-8 and all is good. But if a UTF-16BE string comes along here is what happens:

Encode/Guess.pm: 92
  DB<2> x $octet
0  "[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@:[EMAIL PROTECTED]@[EMAIL 
PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]"

Encode/Guess.pm: 94
  DB<3> x $line
0  "[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@:[EMAIL PROTECTED]@[EMAIL 
PROTECTED]@[EMAIL PROTECTED]@"

*NOW* when it is testing the decode of a UTF-16BE string it will _always_ come up one byte short and will never match a successful decode even though that is what it really is.

we should have :
0  "[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@:[EMAIL PROTECTED]@[EMAIL 
PROTECTED]@[EMAIL PROTECTED]"

changing the split to include "\000+" in the split fixes this problem. But it would break for UTF-16LE, right?

Points
- what is the best way to open and read data that might be: UTF-8, UTF-16, UTF-16BE, or UTF-16LE?
- is there a good way to chop the line endings reliably for the above 4 sets?
- maybe detecting the flavour of unicode is better left to a different process?
Encode::Guess::Unicode?


Plz advise - perhaps just documentation expansion is necessary and can help w/ that based on this matter.
Jay




Reply via email to