Re: dealing with UTF8 text

Joel Rees Thu, 31 Mar 2005 07:22:33 -0800


On 2005.3.31, at 10:18 AM, Avi Rappoport wrote:

Hi old friends (and new),
I'm quite enjoying getting back to scripting, and like Perl a lot, especially with Affrus. While I'm probably inefficient, it's nice to have a language actually designed for text processing (search engine logs, in my case). However, I've got some Unicode issues and that seems to be platform-specific, so thought I'd ask here.

Have you done "perldoc perlunicode" and used that as a lullaby for several afternoon naps in a row? Used the stuff referred there for a few more afternoon naps? (perldoc always seems to put me to sleep, but if I don't open it up and stare at it in spite of the soporific effect, nothing seeps in at all.) Have you gone to unicode.org and scanned what they have to offer relevant to the character ranges (languages) you need to be parsing? Have you looked up the traditional encodings for your language/locale, particularly the microsoft (bleaugh) code pages? (Google or your other favorite search engines can help.)

I've done enough research to know that I should avoid hardcoded counting with positions and use the perl functions which will automatically handle utf8 characters properly. That's cool. I'm pretty sure I'm reading in utf8 and comparisons seem to work.

Comparisons can seem to work when the encoding is all off, as long as the input is being munged the same way in all inputs. That doesn't mean it will work for all valid input, however.

What I can't do is generate readable cross-platform output to show my clients.

Nothing necessarily surprising there. It takes quite a bit of tuning your brain to get the code right. (I speak from experience with Japanese encodings. ;)

Even opening the output in BBEdit as UTF8 doesn't convert the codes into properly rendered extended characters, and by the time it gets into Excel on their Windows workstation, all hope is pretty much gone.

BBEdit, IIRC, handles some of the traditional encodings fairly well. (Does quite well with the Japanese encodings, at any rate.) So if you are opening UTF-8 and it isn't looking right, your output is probably not UTF-8. If you check the options in the file opening dialogs, you may find a way to convert from the actual encoding you're writing out. And/or you should be able to adjust your perl, but we can't help you with that unless we see some code and have some idea what encoding/language/locale you're trying to write out.

Incidentally, in many of the traditional encodings, the basic Latin will be in the some positions (same code points) as UTF-8 Unicode basic Latin.

The stuff that looks like HTML entities is fine when viewed in a browser:

&#1575;&#1604;&#1578;&#1593;&#1575;&#1585;&#1601;
s&#305;emens

And if necessary, I can deliver in HTML.

But my logs have characters like this in them:

(from BBEdit as UTF8:)
ˆáˆáˆáˆáˆáˆáˆáˆáˆáˆáˆáˆáˆáˆâ ˆ‚ˆáˆ°ˆüˆì ˆ¶ˆèˆ¨ ˆáˆîˆ¶ˆùˆâ
atualiza§£o
carreo

(from BBEdit as Mac Roman)
É íáßÓ  Ô¯É
atualizaˆÉ¬ßˆÉ¬£o
torunn tÃ¸mmervold
lÃ¶schen

I can tell they mean something, but I can't figure out how to make them readable. Help?

TIA,

Avi

Re: dealing with UTF8 text

Reply via email to