I am trying to decode strings of suspect UTF origins - Encode::Guess seems to be the way to go....
So I am opening a file "normally" and just reading line by line. I will pass the line of text through Encode::Guess which I have used thusly:
use Encode::Guess qw(UTF-8 UTF-16BE); #I may add more in future
Now what I read in is *usually* UTF-8 and all is good. But if a UTF-16BE string comes along here is what happens:
Encode/Guess.pm: 92 DB<2> x $octet 0 "[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@:[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]"
Encode/Guess.pm: 94 DB<3> x $line 0 "[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@:[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@"
*NOW* when it is testing the decode of a UTF-16BE string it will _always_ come up one byte short and will never match a successful decode even though that is what it really is.
we should have : 0 "[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@:[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]"
changing the split to include "\000+" in the split fixes this problem. But it would break for UTF-16LE, right?
Points
- what is the best way to open and read data that might be: UTF-8, UTF-16, UTF-16BE, or UTF-16LE?
- is there a good way to chop the line endings reliably for the above 4 sets?
- maybe detecting the flavour of unicode is better left to a different process?
Encode::Guess::Unicode?
Plz advise - perhaps just documentation expansion is necessary and can help w/ that based on this matter.
Jay