Re: urllib2 spinning CPU on read

2006-11-28 Thread kdotsky

 I didn't try looking at your example, but I think it's likely a bug
 both in that site's HTTP server and in httplib.  If it's the same one
 I saw, it's already reported, but nobody fixed it yet.

 http://python.org/sf/1411097


 John

Thanks.  I tried the example in the link you gave, and it appears to be
the same behavior.

Do you have any suggestions on how I could avoid this in the meantime?

-- 
http://mail.python.org/mailman/listinfo/python-list


urllib2 spinning CPU on read

2006-11-26 Thread kdotsky
Hello All,
I've ran into this problem on several sites where urllib2 will hang
(using all the CPU) trying to read a page.  I was able to reproduce it
for one particular site.  I'm using python 2.4

import urllib2
url = 'http://www.wautomas.info'
request = urllib2.Request(url)
opener = urllib2.build_opener()
result = opener.open(request)
data = result.read()

It never returns from this read call.

I did some profiling to try and see what was going on and make sure it
wasn't my code.  There was a huge number of calls to (and amount of
time spent in) socket.py:315(readline) and to recv.  A large amount of
time was also spent in httplib.py:482(_read_chunked).  Here's the
significant part of the statistics:

 32564841 function calls (32563582 primitive calls) in 545.250
CPU seconds

   Ordered by: internal time
   List reduced from 416 to 50 due to restriction 50

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 10844775  233.9200.000  447.4400.000 socket.py:315(readline)
 10846078  152.4300.000  152.4300.000 :0(recv)
3   97.330   32.443  544.730  181.577
httplib.py:482(_read_chunked)
 10844812   61.0900.000   61.0900.000 :0(join)


Also, where should I go to see if something like this has already been
reported as a bug?

Thanks for any help you can give me.

-- 
http://mail.python.org/mailman/listinfo/python-list


don't need dictionary's keys - hash table?

2006-07-12 Thread kdotsky
Hello,
I am using some very large dictionaries with keys that are long strings
(urls).  For a large dictionary these keys start to take up a
significant amount of memory.  I do not need access to these keys -- I
only need to be able to retrieve the value associated with a certain
key, so I do not want to have the keys stored in memory.  Could I just
hash() the url strings first and use the resulting integer as the key?
I think what I'm after here is more like a tradition hash table.  If I
do it this way am I going to get the memory savings I am after?  Will
the hash function always generate unique keys?  Also, would the same
technique work for a set?

Any other thoughts or considerations are appreciated.

Thank You.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: don't need dictionary's keys - hash table?

2006-07-12 Thread kdotsky
[EMAIL PROTECTED] wrote:
 Hello,
 I am using some very large dictionaries with keys that are long strings
 (urls).  For a large dictionary these keys start to take up a
 significant amount of memory.  I do not need access to these keys -- I
 only need to be able to retrieve the value associated with a certain
 key, so I do not want to have the keys stored in memory.  Could I just
 hash() the url strings first and use the resulting integer as the key?
 I think what I'm after here is more like a tradition hash table.  If I
 do it this way am I going to get the memory savings I am after?  Will
 the hash function always generate unique keys?  Also, would the same
 technique work for a set?


I just realized that of course the hash is not always going to be
unique, so this wouldn't really work.  And it seems a hash table would
still need to store the keys (as strings) so that string comparisons
can be done when a collision occurs.  I guess there's no avoiding
storing they keys?

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: don't need dictionary's keys - hash table?

2006-07-12 Thread kdotsky
 depending on your application, a bloom filter might be a good enough:

  http://en.wikipedia.org/wiki/Bloom_filter


Thanks (everyone) for the comments.  I like the idea of the bloom
filter or using an md5 hash, since a rare collision will not be a
show-stopper in my case.

-- 
http://mail.python.org/mailman/listinfo/python-list