Hi, in the course of producing shapefiles, I applied the libxml2 built-in character set conversion from UTF-8 to Latin-1 to our tag values, and found a lot of problems (about 20k nodes/ways) where it complained.
I cross-checked some of these by downloading the data directly from the API (the data I used for converting has been through Osmosis a number of tims and I wanted to make sure it isn't an Osmosis bug), then running iconv on it. Some went through ok, but many seem to be wrong indeed. Can anybody tell me something about libxml2 character set conversion - is it considered buggy? And about the UTF-8 bugs in the database: are they real? For example: http://www.openstreetmap.org/api/0.5/way/8138279 doesn't seem proper UTF-8 to me but maybe I'm wrong. If it really isn't proper UTF-8, then why don't more of our tools choke on it? Is everybody "sanitizing" in one form or the other? Should we generate some statistics about UTF-8 problems and try to purge them? Here's a list of objects that libxml2 complained about (not complete as I didn't process a full planet): http://www.remote.org/frederik/tmp/utf8.txt Bye Frederik -- Frederik Ramm ## eMail [EMAIL PROTECTED] ## N49°00.09' E008°23.33' _______________________________________________ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/cgi-bin/mailman/listinfo/dev