On 2019-06-28, Chris Angelico <ros...@gmail.com> wrote: > On Sat, Jun 29, 2019 at 6:31 AM Tobiah <t...@tobiah.org> wrote: >> A guy comes in and enters his last name as RĂ–nngren. >> >> So what did the browser really give me; is it encoded >> in some way, like latin-1? Does it depend on whether >> the name was cut and pasted from a Word doc. etc? >> Should I handle these internally as unicode? Right >> now my database tables are latin-1 and things seem >> to usually work, but not always. > > Definitely handle them as Unicode. You'll receive them in some > encoding, probably UTF-8, and it depends on the browser.
You can basically assume it is the encoding that the page the form was on was using - which is a good reason to always explicitly specify utf-8 encoding on HTML pages. >> Also, what do people do when searching for a record. >> Is there some way to get 'Ronngren' to match the other >> possible foreign spellings? > > Ehh....... probably not. That's a human problem, not a programming > one. Best of luck. And yet there are many programs which attempt to solve it. The Python module 'unidecode' will do a decent stab of it if the language is vaguely European. Certainly, storing the UTF-8 string and also the 'unidecoded' ASCII string and searching on both is unlikely to hurt and will often help. Additionally using Metaphone or similar will probably also help. -- https://mail.python.org/mailman/listinfo/python-list