Re: Ongoing UnicodeDecodeError's with web crawlers and file caching

2008-09-16 Thread Malcolm Tredinnick


On Fri, 2008-09-12 at 04:44 -0700, Julien Phalip wrote:
> Hi,
> 
> I'm running a fairly large website (10,000 news items). Initially it
> was made in ASP with MSSQL, then I took the project over and ported it
> to PHP and MYSQL. Finally, 6 months ago I ported it to Django and
> MySQL.
> 
> Now, ever since the site has been running on Django, I've received
> about a dozen error emails every couple of days (once I even received
> 400 overnight!). Those errors are systematically caused by web
> crawlers (yahoo slurp, googlebot, msn, yeti, etc.). It systematically
> chokes on the same line of code, which is loading some data from file
> caching. The traceback is pretty much always as follows:
> 
>  File "/MYPATH/apps/news/templatetags/news_tags.py", line 13, in
> show_sidebar
>cached_sidebar = cache.get('the_sidebar')
> 
>  File "/MYPATH/django/core/cache/backends/filebased.py", line 50, in
> get
>return pickle.load(f)
> 
> UnicodeDecodeError: 'utf8' codec can't decode byte 0x8a in position
> 5999: unexpected code byte
> 
> The actual utf-8 character varies each time.

You mean the byte, not the UTF-8 character, since the whole point is
that it isn't a UTF-8 encoding of anything.

> 
> I spent a lot of time cleaning up, reorganising and improving the
> code... in vain. Now I strongly suspect it might be because of the
> data being corrupted in some way.

T
> 
> Now, what puzzles me is that all the URLs which fail with web
> crawlers, actually work perfectly well when I simply open them in a
> browser.
> 
> The code has always followed a recent trunk of Django and now runs on
> 1.0. I have already raised that issue in this mailing list a couple of
> times in the past, but I didn't get much help. I haven't opened a
> ticket because I cannot reproduce the error myself (it only happens
> with web crawlers) and because I suspect it might be because of my
> setup (no other site that I have and use file caching have this
> problem).

So if I were you I'd put some extra debugging into Django itself to try
and gather more information. In particular, since it's loading a
particular file at the time of the problem, you could log the file name
and probably copy the contents somewhere aside for later investigation
(or log the file contents as well).

I can't think of any reason why Django's caching code is going to cause
this problem, since it's (allegedly) pickling valid data and then
unpickling the same data and I trust Python's pickling process to work.
However, it's probably also relevant to know *what* you are pickling
here. What type of object is it? Where did the data for that object come
from?

Regards,
Malcolm




--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en
-~--~~~~--~~--~--~---



Ongoing UnicodeDecodeError's with web crawlers and file caching

2008-09-12 Thread Julien Phalip

Hi,

I'm running a fairly large website (10,000 news items). Initially it
was made in ASP with MSSQL, then I took the project over and ported it
to PHP and MYSQL. Finally, 6 months ago I ported it to Django and
MySQL.

Now, ever since the site has been running on Django, I've received
about a dozen error emails every couple of days (once I even received
400 overnight!). Those errors are systematically caused by web
crawlers (yahoo slurp, googlebot, msn, yeti, etc.). It systematically
chokes on the same line of code, which is loading some data from file
caching. The traceback is pretty much always as follows:

 File "/MYPATH/apps/news/templatetags/news_tags.py", line 13, in
show_sidebar
   cached_sidebar = cache.get('the_sidebar')

 File "/MYPATH/django/core/cache/backends/filebased.py", line 50, in
get
   return pickle.load(f)

UnicodeDecodeError: 'utf8' codec can't decode byte 0x8a in position
5999: unexpected code byte

The actual utf-8 character varies each time.

I spent a lot of time cleaning up, reorganising and improving the
code... in vain. Now I strongly suspect it might be because of the
data being corrupted in some way.

Now, what puzzles me is that all the URLs which fail with web
crawlers, actually work perfectly well when I simply open them in a
browser.

The code has always followed a recent trunk of Django and now runs on
1.0. I have already raised that issue in this mailing list a couple of
times in the past, but I didn't get much help. I haven't opened a
ticket because I cannot reproduce the error myself (it only happens
with web crawlers) and because I suspect it might be because of my
setup (no other site that I have and use file caching have this
problem).

I could also say that the site was originally running with mod_python,
then with mod_wsgi, and has even moved servers. After all these
changes the problem is still there, to it seems to be independent from
server configuration.

So, any hint to debug this would be very much appreciated. If you
think this is worth filing a ticket, I'd also appreciate any hint on
how to phrase this problem correctly.

Thanks a lot for your help.

Julien
--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en
-~--~~~~--~~--~--~---