Re: [Python-Dev] Why not using the hash when comparing strings?
On 10/19/2012 03:22 AM, Benjamin Peterson wrote: It would be interesting to see how common it is for strings which have their hash computed to be compared. Since all identifier-like strings mentioned in Python are interned, and therefore have had their hash computed, I would imagine comparing them to be fairly common. After all, strings are often used as makeshift enums in Python. On the flip side, those strings are typically small, so a measurable overall speed improvement brought by such a change seems unlikely. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Why not using the hash when comparing strings?
Hrvoje Niksic hrvoje.nik...@avl.com wrote: On 10/19/2012 03:22 AM, Benjamin Peterson wrote: It would be interesting to see how common it is for strings which have their hash computed to be compared. Since all identifier-like strings mentioned in Python are interned, and therefore have had their hash computed, I would imagine comparing them to be fairly common. After all, strings are often used as makeshift enums in Python. On the flip side, those strings are typically small, so a measurable overall speed improvement brought by such a change seems unlikely. I'm pretty sure it would result in a small slowdown. Many (most?) of the comparisons against interned identifiers will be done by dictionary lookups and the dictionary lookup code only tries the string comparison after it has determined that the hashes match. The only time dictionary key strings contents are actually compared is when the hash matches but the pointers are different; it is already the case that if the hashes don't match the strings are never compared. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Why not using the hash when comparing strings?
2012/10/19 Benjamin Peterson benja...@python.org: It would be interesting to see how common it is for strings which have their hash computed to be compared. I implemented a quick hack. When running ./python -m test test_os: Python calls PyUnicode_RichCompare() 15206 times with Py_EQ or Py_NE operator. In 41.4% (6295 calls), the hash of the two operands is known. In 41.2% (6262 times on 15206), the hash of the two operands are known *and are different*! The hit rate may depend since when the process was started. For example, in a fresh interpreter: the hit rate is only 7% (189 hit / 2703 calls). When running the test suite, the hit rate is around 80% (hashs are known in 90%) after running 70 tests. At the same time, the average of string length is 4.1 characters and quite all strings are pure ASCII. I create the issue http://bugs.python.org/issue16286 to discuss this optimization. Victor ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Why not using the hash when comparing strings?
2012/10/19 Duncan Booth duncan.bo...@suttoncourtenay.org.uk: Hrvoje Niksic hrvoje.nik...@avl.com wrote: On 10/19/2012 03:22 AM, Benjamin Peterson wrote: It would be interesting to see how common it is for strings which have their hash computed to be compared. Since all identifier-like strings mentioned in Python are interned, and therefore have had their hash computed, I would imagine comparing them to be fairly common. After all, strings are often used as makeshift enums in Python. On the flip side, those strings are typically small, so a measurable overall speed improvement brought by such a change seems unlikely. I'm pretty sure it would result in a small slowdown. Dictionary code has its own special path for string comparisons, so this is not an issue. -- Regards, Benjamin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Why not using the hash when comparing strings?
On 19 October 2012 11:02, Duncan Booth duncan.bo...@suttoncourtenay.org.uk wrote: Hrvoje Niksic hrvoje.nik...@avl.com wrote: On 10/19/2012 03:22 AM, Benjamin Peterson wrote: It would be interesting to see how common it is for strings which have their hash computed to be compared. Since all identifier-like strings mentioned in Python are interned, and therefore have had their hash computed, I would imagine comparing them to be fairly common. After all, strings are often used as makeshift enums in Python. On the flip side, those strings are typically small, so a measurable overall speed improvement brought by such a change seems unlikely. I'm pretty sure it would result in a small slowdown. Many (most?) of the comparisons against interned identifiers will be done by dictionary lookups and the dictionary lookup code only tries the string comparison after it has determined that the hashes match. The only time dictionary key strings contents are actually compared is when the hash matches but the pointers are different; it is already the case that if the hashes don't match the strings are never compared. My understanding is that: The first part of string comparison is an identity check (pointer comparison) which catches the case of comparing two equal interned strings. What is not checked for is comparing two unequal interned strings (check both strings are interned and then conclude that unequal pointers means unequal strings). The hashes are also not checked during a standard string comparison. The lengths are also not checked since the unicode_compare function is a generic cmp() function for a rich comparison rather than a specific function for equal/unequal comparison: http://hg.python.org/cpython/file/321414874b26/Objects/unicodeobject.c#l6114 Most string comparisons occur during dict lookup in which case the hash is (implicitly?) checked before the strings are compared. A second hash comparison in this case would often be redundant. My *opinion* is that: On average character by character string comparison stops very quickly after comparing 1 or 2 characters. If this is assumed to be the case then it might be wasteful to check whether the strings have hashes and then compare the hashes (likewise for checking whether they are interned). This was the subject of lengthy debate in the recent python-list thread comparing strings from the back: http://mail.python.org/pipermail/python-list/2012-September/629866.html Consider the case where the strings are not interned, non-empty, and differ in the first character: def current(s1, s2): if s1 is s2: return True elif s1.len 0 and s2.len 0 and s1[0] != s2[0]: return False # Most of the time we don't reach this point ... def proposed_hashcheck(s1, s2): if s1 is s2: return True elif s1.hash is not None and s2.hash is not None and s1.hash != s2.hash: return False # I think we often reach this point when not comparing within a dict elif s1.len 0 and s2.len 0 and s1[0] != s2[0]: return False ... When comparing within a dict the hash check (modulo dict size) is already implicitly likely-affirmative and I assume (I'm not sure) that the full hashes are explicitly checked before comparing the strings. Either way I think that the additional hash check within the string comparison will normally be redundant. When not comparing within a dict, it's hard to say how often the strings have hashes but if they usually don't then the extra hash-check will normally be redundant in that case as well. If the character by character comparison fails at the first character it may be quicker to optimise for that case than checking if both hashes are present and then comparing the hashes. Other potential optimisations include comparing length before comparing characters: def proposed_lencheck(s1, s2): if s1 is s2: return True elif s1.len != s2.len: return False elif s1.len 0 and s2.len 0 and s1[0] != s2[0]: return False ... In the above I'm trying to represent the fast path when comparing strings that differ in the first character. One case that isn't helped by either of length or hash checking is comparing equal strings: you always need to check all the characters (unless the identity check succeeds). Depending on how often unequal interned strings are compared another method could be: def proposed_interncheck(s1, s2): if s1 is s2: return True elif s1.interned and s2.interned: return False elif s1.len 0 and s2.len 0 and s1[0] != s2[0]: return False ... All of these solutions add complexity over the current method and require the string comparison functions to be refactored from the current generic rich compare method. None of them (I think) would help much when
Re: [Python-Dev] Why not using the hash when comparing strings?
On Fri, Oct 19, 2012 at 8:36 AM, Victor Stinner victor.stin...@gmail.comwrote: 2012/10/19 Benjamin Peterson benja...@python.org: It would be interesting to see how common it is for strings which have their hash computed to be compared. I implemented a quick hack. When running ./python -m test test_os: Python calls PyUnicode_RichCompare() 15206 times with Py_EQ or Py_NE operator. In 41.4% (6295 calls), the hash of the two operands is known. In 41.2% (6262 times on 15206), the hash of the two operands are known *and are different*! The hit rate may depend since when the process was started. For example, in a fresh interpreter: the hit rate is only 7% (189 hit / 2703 calls). When running the test suite, the hit rate is around 80% (hashs are known in 90%) after running 70 tests. At the same time, the average of string length is 4.1 characters and quite all strings are pure ASCII. I create the issue http://bugs.python.org/issue16286 to discuss this optimization. If you want to measure the performance impact compared to a clean build then you can use the unladen benchmarks as it contains several Python 3-compatible benchmarks now. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Why not using the hash when comparing strings?
On 19/10/12 12:03, Victor Stinner wrote: Hi, I would like to know if there a reason for not using the hash of (bytes or unicode) strings when comparing two objects and the hash of the two objects was already been computed. Using the hash would speed up comparaison of long strings when the two strings are different. Assuming the hash has already been compared, then I imagine it would be faster. Something like: if ((op == Py_EQ || op == Py_NE) a-ob_shash != -1 b-ob_shash != -1 a-ob_shash != b-ob_shash) { /* strings are not equal */ } There are hash collision, so a-ob_shash == b-ob_shash doesn't mean that the two strings are equal. But if the two hashs are different, the two strings are different. Isn't it? I would certainly hope so :) -- Steven ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Why not using the hash when comparing strings?
2012/10/18 Victor Stinner victor.stin...@gmail.com: Hi, I would like to know if there a reason for not using the hash of (bytes or unicode) strings when comparing two objects and the hash of the two objects was already been computed. Using the hash would speed up comparaison of long strings when the two strings are different. Something like: if ((op == Py_EQ || op == Py_NE) a-ob_shash != -1 b-ob_shash != -1 a-ob_shash != b-ob_shash) { /* strings are not equal */ } There are hash collision, so a-ob_shash == b-ob_shash doesn't mean that the two strings are equal. But if the two hashs are different, the two strings are different. Isn't it? It would be interesting to see how common it is for strings which have their hash computed to be compared. -- Regards, Benjamin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Why not using the hash when comparing strings?
On 2012-10-19 02:03, Victor Stinner wrote: Hi, I would like to know if there a reason for not using the hash of (bytes or unicode) strings when comparing two objects and the hash of the two objects was already been computed. Using the hash would speed up comparaison of long strings when the two strings are different. Something like: if ((op == Py_EQ || op == Py_NE) a-ob_shash != -1 b-ob_shash != -1 a-ob_shash != b-ob_shash) { /* strings are not equal */ } There are hash collision, so a-ob_shash == b-ob_shash doesn't mean that the two strings are equal. But if the two hashs are different, the two strings are different. Isn't it? Correct. It's true for any hashable type. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com