On Mon, Sep 7, 2009 at 5:03 AM, Tracy Reed <tr...@ultraviolet.org> wrote:
> I have a Django app which processes emails. It is often handed emails > with unicode characters in them. My understanding is that Python and > Django handle unicode just fine and somewhat transparently. I was, > however, told that I need to set my database tables to UTF-8 > encoding. I have done this. Yet I still frequently get errors such as > this when my app encounters unicode: > > In some places here you are using the term 'unicode' where non-ASCII would be more correct. The emails your code is handed, for example, contain non-ASCII characters. These emails are not packaged as Python unicode strings (they cannot be, if they are coming from outside Python), they are bytestrings. In order to successfully turn them into unicode objects the correct encoding of the email bytestring must be known. The exception you include below shows Django attempting to convert an email bytestring into a unicode object, assuming the bytestring is utf-8 encoded. This is failing, so apparently the email bytestring is using some other encoding. Python/Django handling unicode "transparently" is a bit of an optimistic hope. Python has unicode support, and what Django attempts to do is take bytestrings at boundary points and convert to and from unicode objects so that your application code never has to deal with bytestrings but rather always has unicode strings. So Django will convert bytestrings from the database and bytestrings from web clients and convert them to unicode before handing them to your application code. Similarly it will accept unicode from your application and convert to bytestrings for sending outside the boundary (back to the DB or out as a client response). Django does not require, however, that your application only use unicode strings -- you are free to hand Django functions bytestrings. When given a bytestring, though, Django has to make some assumption about what encoding the bytestring is using. The problem with bytestrings is that they do not carry around with them any encoding information. What Django does when handed a bytestring is assume it utf-8 encoded. If your application hands Django a bytestring that is not utf-8 encoded, you'll get errors like the one you include below. Someone else responding on this thread mentioned BeautifulSoup fixing problems like this. My understanding (I don't have time to verify at the moment) is BeautifulSoup either detects encoding by examining the bytes and guessing what the proper encoding may be, or trying different encoding until one works. Django does not do this -- it simply assumes if your code hands in a bytestring that it is utf-8 encoded. Thus if you have non-utf8 encoded bytestrings you are dealing with (as you apparently do) you will need to convert them to unicode before handing them to Django. That, of course, just pushes the problem back onto you, and you will have to now figure out what encoding these things are using. Perhaps someone on this list can help with that, but you haven't provided enough information to really help here. All you have said is that your app is "handed emails". All I can tell you about those emails, based on the traceback below, is that they are bytestrings and they are not utf-8 encoded. If you show some of your code that is receiving the emails perhaps someone can provide more guidance on how to transform the email bytestrings into unicode. > Traceback (most recent call last): > > File "/usr/lib/python2.4/site-packages/django/core/handlers/base.py", > line 92, in get_response > response = callback(request, *callback_args, **callback_kwargs) > > File "/var/spool/filter/email_archive/store_emails/views.py", line 84, in > mail_detail > return render_to_response('mail_detail.html', {'mail': ourmail, > File "/usr/lib/python2.4/site-packages/django/shortcuts/__init__.py", line > 20, in render_to_response > return HttpResponse(loader.render_to_string(*args, **kwargs), > **httpresponse_kwargs) > > [snip bunches of template context traceback] > File "/usr/lib/python2.4/site-packages/django/template/__init__.py", line > 831, in render > return _render_value_in_context(output, context) > > File "/usr/lib/python2.4/site-packages/django/template/__init__.py", line > 811, in _render_value_in_context > value = force_unicode(value) > > File "/usr/lib/python2.4/site-packages/django/utils/encoding.py", line > 92, in force_unicode > raise DjangoUnicodeDecodeError(s, *e.args) > > DjangoUnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in > position 1468: unexpected code byte. You passed in "\nGood > Day,\n\n\n\nWe offer a part time job on your computer. > > <text of spam containing unicode deleted> > > There is a 0x92 in position 1468 just as the error says. > > Do I need to be doing a .encode('utf-8') before putting anything into > the db? I cannot seem to get a clear answer on this. Some say no, some > say yes. Do I need to do any decoding or anything on data pulled out > of the db? I have been told that MySQL should be handling all of this > for me. > > Note the boundary your traceback is dealing with here is not the database, the traceback shows trying to render something in a template for a response. Whatever code path you are following here involves your template trying to render a non-utf8 bytestring. It's running into trouble because Django is attempting to convert the bytestring to unicode assuming utf-8 encoding. You are not dealing with the database boundary here. But to answer the database question: no, you do not have to encode/decode anything at the database boundary. Django handles that for you. The only exception here is if you are using a binary collation on MySQL then Django is not able to do the bytestring/unicode conversion. See: http://docs.djangoproject.com/en/dev/ref/databases/#collation-settings. > I have been banging my head on this particular error off and on for a > couple of weeks and cannot seem to find the solution. > > Any pointers appreciated. > > For further help you will need to give some more information about how your code is getting these emails. Karen --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Django users" group. To post to this group, send email to django-users@googlegroups.com To unsubscribe from this group, send email to django-users+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/django-users?hl=en -~----------~----~----~----~------~----~------~--~---