Re: hash() yields different results for different platforms
> "Kerry, Richard" <[EMAIL PROTECTED]> (KR) wrote: >KR> The hash is not expected to be unique, it just provides a starting point >KR> for another search (usually linear ?). >KR> See http://en.wikipedia.org/wiki/Hash_function That only contains a definition of a hash function. I know what a hash function is. But the OP wanted to use the hash as a unique key. -- Piet van Oostrum <[EMAIL PROTECTED]> URL: http://www.cs.uu.nl/~piet [PGP 8DAE142BE17999C4] Private email: [EMAIL PROTECTED] -- http://mail.python.org/mailman/listinfo/python-list
Re: hash() yields different results for different platforms
On 2006-07-12, Qiangning Hong <[EMAIL PROTECTED]> wrote: > Grant Edwards wrote: >> On 2006-07-11, Qiangning Hong <[EMAIL PROTECTED]> wrote: >> > However, when I come to Python's builtin hash() function, I >> > found it produces different values in my two computers! In a >> > pentium4, hash('a') -> -468864544; in a amd64, hash('a') -> >> > 12416037344. Does hash function depend on machine's word >> > length? >> >> Apparently. :) >> >> The low 32 bits match, so perhaps you should just use that >> portion of the returned hash? >> >> >>> hex(12416037344) >> '0x2E40DB1E0L' >> >>> hex(-468864544 & 0x) >> '0xE40DB1E0L' >> >> >>> hex(12416037344 & 0x) >> '0xE40DB1E0L' >> >>> hex(-468864544 & 0x) >> '0xE40DB1E0L' > > Is this relationship (same low 32 bits) guaranteed? No, I don't believe so. > Will it change in the future version? It may. -- Grant Edwards grante Yow! Is this an out-take at from the "BRADY BUNCH"? visi.com -- http://mail.python.org/mailman/listinfo/python-list
Re: hash() yields different results for different platforms
On 2006-07-12, Carl Banks <[EMAIL PROTECTED]> wrote: > Grant Edwards wrote: >> On 2006-07-11, Qiangning Hong <[EMAIL PROTECTED]> wrote: >> >> > I'm writing a spider. I have millions of urls in a table (mysql) to >> > check if a url has already been fetched. To check fast, I am >> > considering to add a "hash" column in the table, make it a unique key, >> > and use the following sql statement: >> > insert ignore into urls (url, hash) values (newurl, hash_of_newurl) >> > to add new url. >> > >> > I believe this will be faster than making the "url" column unique key >> > and doing string comparation. Right? >> >> I doubt it will be significantly faster. Comparing two strings >> and hashing a string are both O(N). > > Playing Devil's Advocate: The hash would be a one-time operation during > database insertion, whereas string comparison would happen every > search. Good point. > Conceivably, comparing hash strings (which is O(1)) could > result in a big savings compared to comparing regular strings; Still, I doubt that the URLs are long enough so that there's a significant difference. > but I expect most decent sql implementations already hash data > internally, so rolling your own hash would be useless at best. Precisely. DB designers and implementers have been working on this problem for 30 years. I doubt the OP is going to be able to best them with a few minutes work. > If the OP's database is lacking, md5 is probably fine. Perhaps > using a subset of the md5 (the low 32 bits, say) could speed > up comparisons at risk of more collisions. Probably a good > trade off unless the DB is humungous. My advice: do it the simple way first (let the DB handle it). Don't try to fix a problem until you know it exists. Premature optimization -- Grant Edwards grante Yow! It's strange, but I'm at only TRULY ALIVE when I'm visi.comcovered in POLKA DOTS and TACO SAUCE... -- http://mail.python.org/mailman/listinfo/python-list
Re: hash() yields different results for different platforms
"Kerry, Richard" <[EMAIL PROTECTED]> writes: > The hash is not expected to be unique, it just provides a starting point > for another search (usually linear ?). The database is good at organizing indexes and searching in them. Why not let the database do what it's good at. -- http://mail.python.org/mailman/listinfo/python-list
RE: hash() yields different results for different platforms
The hash is not expected to be unique, it just provides a starting point for another search (usually linear ?). See http://en.wikipedia.org/wiki/Hash_function Helpfully, Maybe, Richard. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Piet van Oostrum Sent: 12 July 2006 10:56 To: python-list@python.org Subject: Re: hash() yields different results for different platforms >>>>> Grant Edwards <[EMAIL PROTECTED]> (GE) wrote: >GE> The low 32 bits match, so perhaps you should just use that >GE> portion of the returned hash? If the hashed should be unique, 32 bits is much too low if you have millions of entries. -- Piet van Oostrum <[EMAIL PROTECTED]> URL: http://www.cs.uu.nl/~piet [PGP 8DAE142BE17999C4] Private email: [EMAIL PROTECTED] -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Re: hash() yields different results for different platforms
> Grant Edwards <[EMAIL PROTECTED]> (GE) wrote: >GE> The low 32 bits match, so perhaps you should just use that >GE> portion of the returned hash? If the hashed should be unique, 32 bits is much too low if you have millions of entries. -- Piet van Oostrum <[EMAIL PROTECTED]> URL: http://www.cs.uu.nl/~piet [PGP 8DAE142BE17999C4] Private email: [EMAIL PROTECTED] -- http://mail.python.org/mailman/listinfo/python-list
Re: hash() yields different results for different platforms
Using Python's hash as column in the table might not be a good idea. You just found out why. So you could instead just use the base url and create an index based on that so next time just quickly get all urls from same base address then do a linear search for a specific one, or even easier, implement your own hashes without using any of the Python's built-in hash() functions. For example, transform each character to an int and multply them all mod 2^32-1 or something like that. Even better I think someone already posted the Python's way of generating hashes for string, well, just re-implement it in Python such that your version will yield the same hash on any platform. Hope this helps, Nick V. Qiangning Hong wrote: > I'm writing a spider. I have millions of urls in a table (mysql) to > check if a url has already been fetched. To check fast, I am > considering to add a "hash" column in the table, make it a unique key, > and use the following sql statement: > insert ignore into urls (url, hash) values (newurl, hash_of_newurl) > to add new url. > > I believe this will be faster than making the "url" column unique key > and doing string comparation. Right? > > However, when I come to Python's builtin hash() function, I found it > produces different values in my two computers! In a pentium4, > hash('a') -> -468864544; in a amd64, hash('a') -> 12416037344. Does > hash function depend on machine's word length? > > If it does, I must consider another hash algorithm because the spider > will run concurrently in several computers, some are 32-bit, some are > 64-bit. Is md5 a good choice? Will it be too slow that I have no > performance gain than using the "url" column directly as the unique > key? > > I will do some benchmarking to find it out. But while making my hands > dirty, I would like to hear some advice from experts here. :) -- http://mail.python.org/mailman/listinfo/python-list
Re: hash() yields different results for different platforms
Qiangning Hong wrote: > /.../ add a "hash" column in the table, make it a unique key at this point, you should have slapped yourself on the forehead, and gone back to the drawing board. -- http://mail.python.org/mailman/listinfo/python-list
Re: hash() yields different results for different platforms
[Grant Edwards] >> ... >> The low 32 bits match, so perhaps you should just use that >> portion of the returned hash? >> >> >>> hex(12416037344) >> '0x2E40DB1E0L' >> >>> hex(-468864544 & 0x) >> '0xE40DB1E0L' >> >> >>> hex(12416037344 & 0x) >> '0xE40DB1E0L' >> >>> hex(-468864544 & 0x) >> '0xE40DB1E0L' [Qiangning Hong] > Is this relationship (same low 32 bits) guaranteed? No. Nothing about hashes is guaranteed, except that when x and y are of a hashable type, and x == y, then hash(x) == hash(y) too. > Will it change in the future version? That's possible, but not planned. Note that the guts of string hashing in CPython today is implemented via while (--len >= 0) x = (103*x) ^ *p++; where x is C type "long", and the C language doesn't even define what that does (behavior when signed multiplication overflows isn't defined in C). -- http://mail.python.org/mailman/listinfo/python-list
Re: hash() yields different results for different platforms
Grant Edwards wrote: > On 2006-07-11, Qiangning Hong <[EMAIL PROTECTED]> wrote: > > However, when I come to Python's builtin hash() function, I > > found it produces different values in my two computers! In a > > pentium4, hash('a') -> -468864544; in a amd64, hash('a') -> > > 12416037344. Does hash function depend on machine's word > > length? > > Apparently. :) > > The low 32 bits match, so perhaps you should just use that > portion of the returned hash? > > >>> hex(12416037344) > '0x2E40DB1E0L' > >>> hex(-468864544 & 0x) > '0xE40DB1E0L' > > >>> hex(12416037344 & 0x) > '0xE40DB1E0L' > >>> hex(-468864544 & 0x) > '0xE40DB1E0L' Is this relationship (same low 32 bits) guaranteed? Will it change in the future version? -- http://mail.python.org/mailman/listinfo/python-list
Re: hash() yields different results for different platforms
Grant Edwards wrote: > On 2006-07-11, Qiangning Hong <[EMAIL PROTECTED]> wrote: > > > I'm writing a spider. I have millions of urls in a table (mysql) to > > check if a url has already been fetched. To check fast, I am > > considering to add a "hash" column in the table, make it a unique key, > > and use the following sql statement: > > insert ignore into urls (url, hash) values (newurl, hash_of_newurl) > > to add new url. > > > > I believe this will be faster than making the "url" column unique key > > and doing string comparation. Right? > > I doubt it will be significantly faster. Comparing two strings > and hashing a string are both O(N). Playing Devil's Advocate: The hash would be a one-time operation during database insertion, whereas string comparison would happen every search. Conceivably, comparing hash strings (which is O(1)) could result in a big savings compared to comparing regular strings; but I expect most decent sql implementations already hash data internally, so rolling your own hash would be useless at best. If the OP's database is lacking, md5 is probably fine. Perhaps using a subset of the md5 (the low 32 bits, say) could speed up comparisons at risk of more collisions. Probably a good trade off unless the DB is humungous. Carl Banks -- http://mail.python.org/mailman/listinfo/python-list
Re: hash() yields different results for different platforms
On 2006-07-11, Qiangning Hong <[EMAIL PROTECTED]> wrote: > I'm writing a spider. I have millions of urls in a table (mysql) to > check if a url has already been fetched. To check fast, I am > considering to add a "hash" column in the table, make it a unique key, > and use the following sql statement: > insert ignore into urls (url, hash) values (newurl, hash_of_newurl) > to add new url. > > I believe this will be faster than making the "url" column unique key > and doing string comparation. Right? I doubt it will be significantly faster. Comparing two strings and hashing a string are both O(N). > However, when I come to Python's builtin hash() function, I > found it produces different values in my two computers! In a > pentium4, hash('a') -> -468864544; in a amd64, hash('a') -> > 12416037344. Does hash function depend on machine's word > length? Apparently. :) The low 32 bits match, so perhaps you should just use that portion of the returned hash? >>> hex(12416037344) '0x2E40DB1E0L' >>> hex(-468864544 & 0x) '0xE40DB1E0L' >>> hex(12416037344 & 0x) '0xE40DB1E0L' >>> hex(-468864544 & 0x) '0xE40DB1E0L' -- Grant Edwards grante Yow! Uh-oh!! I forgot at to submit to COMPULSORY visi.comURINALYSIS! -- http://mail.python.org/mailman/listinfo/python-list
Re: hash() yields different results for different platforms
"Qiangning Hong" <[EMAIL PROTECTED]> writes: > However, when I come to Python's builtin hash() function, I found it > produces different values in my two computers! In a pentium4, > hash('a') -> -468864544; in a amd64, hash('a') -> 12416037344. Does > hash function depend on machine's word length? The hash function is unspecified and can depend on anything the implementers feel like. It may(?) even be permitted to differ from one run of the interpreter to another (I haven't checked the spec for this). Don't count on it being consistent from machine to machine. > If it does, I must consider another hash algorithm because the spider > will run concurrently in several computers, some are 32-bit, some are > 64-bit. Is md5 a good choice? Will it be too slow that I have no > performance gain than using the "url" column directly as the unique key? If you're going to accept the overhead of an SQL database you might as well enjoy the use of the abstraction it gives you, instead of trying to implement what amounts to your own form of indexing instead of letting the db take care of it. But md5(url) is certainly very fast compared with processing the outgoing http connection that you presumably plan to open for each url. > I will do some benchmarking to find it out. That's the right way to answer questions like this. -- http://mail.python.org/mailman/listinfo/python-list