On Mon, Sep 7, 2009 at 5:03 AM, Tracy Reed <tr...@ultraviolet.org> wrote:

> I have a Django app which processes emails. It is often handed emails
> with unicode characters in them. My understanding is that Python and
> Django handle unicode just fine and somewhat transparently. I was,
> however, told that I need to set my database tables to UTF-8
> encoding. I have done this. Yet I still frequently get errors such as
> this when my app encounters unicode:
>
>
In some places here you are using the term 'unicode' where non-ASCII would
be more correct.  The emails your code is handed, for example, contain
non-ASCII characters.  These emails are not packaged as Python unicode
strings (they cannot be, if they are coming from outside Python), they are
bytestrings.  In order to successfully turn them into unicode objects the
correct encoding of the email bytestring must be known.  The exception you
include below shows Django attempting to convert an email bytestring into a
unicode object, assuming the bytestring is utf-8 encoded.  This is failing,
so apparently the email bytestring is using some other encoding.

Python/Django handling unicode "transparently" is a bit of an optimistic
hope.  Python has unicode support, and what Django attempts to do is take
bytestrings at boundary points and convert to and from unicode objects so
that your application code never has to deal with bytestrings but rather
always has unicode strings.  So Django will convert bytestrings from the
database and bytestrings from web clients and convert them to unicode before
handing them to your application code.  Similarly it will accept unicode
from your application and convert to bytestrings for sending outside the
boundary (back to the DB or out as a client response).

Django does not require, however, that your application only use unicode
strings -- you are free to hand Django functions bytestrings.  When given a
bytestring, though, Django has to make some assumption about what encoding
the bytestring is using.  The problem with bytestrings is that they do not
carry around with them any encoding information.  What Django does when
handed a bytestring is assume it utf-8 encoded.  If your application hands
Django a bytestring that is not utf-8 encoded, you'll get errors like the
one you include below.

Someone else responding on this thread mentioned BeautifulSoup fixing
problems like this.  My understanding (I don't have time to verify at the
moment) is BeautifulSoup either detects encoding by examining the bytes and
guessing what the proper encoding may be, or trying different encoding until
one works.  Django does not do this -- it simply assumes if your code hands
in a bytestring that it is utf-8 encoded.  Thus if you have non-utf8 encoded
bytestrings you are dealing with (as you apparently do) you will need to
convert them to unicode before handing them to Django.

That, of course, just pushes the problem back onto you, and you will have to
now figure out what encoding these things are using.  Perhaps someone on
this list can help with that, but you haven't provided enough information to
really help here.  All you have said is that your app is "handed emails".
All I can tell you about those emails, based on the traceback below, is that
they are bytestrings and they are not utf-8 encoded.  If you show some of
your code that is receiving the emails perhaps someone can provide more
guidance on how to transform the email bytestrings into unicode.



>  Traceback (most recent call last):
>
>   File "/usr/lib/python2.4/site-packages/django/core/handlers/base.py",
> line 92, in get_response
>     response = callback(request, *callback_args, **callback_kwargs)
>
>   File "/var/spool/filter/email_archive/store_emails/views.py", line 84, in
> mail_detail
>     return render_to_response('mail_detail.html', {'mail': ourmail,
>


  File "/usr/lib/python2.4/site-packages/django/shortcuts/__init__.py", line
> 20, in render_to_response
>     return HttpResponse(loader.render_to_string(*args, **kwargs),
> **httpresponse_kwargs)
>
>   [snip bunches of template context traceback]
>   File "/usr/lib/python2.4/site-packages/django/template/__init__.py", line
> 831, in render
>     return _render_value_in_context(output, context)
>
>   File "/usr/lib/python2.4/site-packages/django/template/__init__.py", line
> 811, in _render_value_in_context
>     value = force_unicode(value)
>
>   File "/usr/lib/python2.4/site-packages/django/utils/encoding.py", line
> 92, in force_unicode
>     raise DjangoUnicodeDecodeError(s, *e.args)
>
>  DjangoUnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in
>  position 1468: unexpected code byte. You passed in "\nGood
>  Day,\n\n\n\nWe offer a part time job on your computer.
>
> <text of spam containing unicode deleted>
>
> There is a 0x92 in position 1468 just as the error says.
>
> Do I need to be doing a .encode('utf-8') before putting anything into
> the db? I cannot seem to get a clear answer on this. Some say no, some
> say yes. Do I need to do any decoding or anything on data pulled out
> of the db? I have been told that MySQL should be handling all of this
> for me.
>
>
Note the boundary your traceback is dealing with here is not the database,
the traceback shows trying to render something in a template for a
response.  Whatever code path you are following here involves your template
trying to render a non-utf8 bytestring.  It's running into trouble because
Django is attempting to convert the bytestring to unicode assuming utf-8
encoding.  You are not dealing with the database boundary here.

But to answer the database question: no, you do not have to encode/decode
anything at the database boundary.  Django handles that for you.  The only
exception here is if you are using a binary collation on MySQL then Django
is not able to do the bytestring/unicode conversion.  See:
http://docs.djangoproject.com/en/dev/ref/databases/#collation-settings.


> I have been banging my head on this particular error off and on for a
> couple of weeks and cannot seem to find the solution.
>
> Any pointers appreciated.
>
>
For further help you will need to give some more information about how your
code is getting these emails.

Karen

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to