Oh this is a messy problem. Let me guess that you are using <@URL> to (or CMD wget or some such) to retrieve a page via HTTP GET.

The character encoding is supposed to be specified by the server, but sometimes it's in a <meta>. And sometimes it's specified but wrong, and just appears correctly to the author of the page, but not necessarily when rendered/parsed by a remote client. So that's the first problem, wWhat's the encoding -- Latin-1, UTF-8, etc, and how do you map this to a character set. Sometimes there is no equivalent character.

And then there's the whole problem of trying to parse this. Some parsers assume every byte is a character, but we now have multi-byte character sets also. I'm thinking of some perl scripts here, but I actually don't know how Witango's string handling will deal with multi-byte characters.

But assuming you can identify a byte sequence containing the desired data, maybe you can always convert it to something like UTF-8 so it can be stored as XML CDATA? At least you would have a consistent representation.


bill


On Thursday, September 8, 2005, at 06:03  AM, Dale Graham wrote:

We're collecting data from a remote website. Author names from this website occasionally come in with umlauts, diacriticals and the like. We'd like if at all possible to preserve this data or at worst, make a reasonable conversion (e.g. umlauted u to u), but I'm having trouble figuring out how to do this, since the character set I am receiving from the remote server doesn't match the character set on my Witango server. (Mac OS X)

That is, an umlaut on my setup would be &#252; but is coming through from the remote server as &#195;&#188;

And would that data be different if the person receiving it was on a Windows or *nix browser instead of a Mac browser? (To add to the level of complexity!)

How do the experts out there handle this?

I tried to search the archives, but seemed to be lacking the magic keywords to find anything I could use. _______________________________________________________________________ _
TO UNSUBSCRIBE: Go to http://www.witango.com/developer/maillist.taf


________________________________________________________________________
TO UNSUBSCRIBE: Go to http://www.witango.com/developer/maillist.taf

Reply via email to