Re: [Python-Dev] Why not using the hash when comparing strings?

2012-10-19 Thread Hrvoje Niksic

On 10/19/2012 03:22 AM, Benjamin Peterson wrote:

It would be interesting to see how common it is for strings which have
their hash computed to be compared.


Since all identifier-like strings mentioned in Python are interned, and 
therefore have had their hash computed, I would imagine comparing them 
to be fairly common. After all, strings are often used as makeshift 
enums in Python.


On the flip side, those strings are typically small, so a measurable 
overall speed improvement brought by such a change seems unlikely.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Why not using the hash when comparing strings?

2012-10-19 Thread Duncan Booth
Hrvoje Niksic hrvoje.nik...@avl.com wrote:

 On 10/19/2012 03:22 AM, Benjamin Peterson wrote:
 It would be interesting to see how common it is for strings which have
 their hash computed to be compared.
 
 Since all identifier-like strings mentioned in Python are interned, and 
 therefore have had their hash computed, I would imagine comparing them 
 to be fairly common. After all, strings are often used as makeshift 
 enums in Python.
 
 On the flip side, those strings are typically small, so a measurable 
 overall speed improvement brought by such a change seems unlikely.

I'm pretty sure it would result in a small slowdown.

Many (most?) of the comparisons against interned identifiers will be done 
by dictionary lookups and the dictionary lookup code only tries the string 
comparison after it has determined that the hashes match. The only time 
dictionary key strings contents are actually compared is when the hash 
matches but the pointers are different; it is already the case that if the 
hashes don't match the strings are never compared.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Why not using the hash when comparing strings?

2012-10-19 Thread Victor Stinner
2012/10/19 Benjamin Peterson benja...@python.org:
 It would be interesting to see how common it is for strings which have
 their hash computed to be compared.

I implemented a quick hack. When running ./python -m test test_os:
Python calls PyUnicode_RichCompare() 15206 times with Py_EQ or Py_NE
operator. In 41.4% (6295 calls), the hash of the two operands is
known. In 41.2% (6262 times on 15206), the hash of the two operands
are known *and are different*!

The hit rate may depend since when the process was started. For
example, in a fresh interpreter: the hit rate is only 7% (189 hit /
2703 calls).

When running the test suite, the hit rate is around 80% (hashs are
known in 90%) after running 70 tests. At the same time, the average of
string length is 4.1 characters and quite all strings are pure ASCII.

I create the issue http://bugs.python.org/issue16286 to discuss this
optimization.

Victor
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Why not using the hash when comparing strings?

2012-10-19 Thread Benjamin Peterson
2012/10/19 Duncan Booth duncan.bo...@suttoncourtenay.org.uk:
 Hrvoje Niksic hrvoje.nik...@avl.com wrote:

 On 10/19/2012 03:22 AM, Benjamin Peterson wrote:
 It would be interesting to see how common it is for strings which have
 their hash computed to be compared.

 Since all identifier-like strings mentioned in Python are interned, and
 therefore have had their hash computed, I would imagine comparing them
 to be fairly common. After all, strings are often used as makeshift
 enums in Python.

 On the flip side, those strings are typically small, so a measurable
 overall speed improvement brought by such a change seems unlikely.

 I'm pretty sure it would result in a small slowdown.

Dictionary code has its own special path for string comparisons, so
this is not an issue.



-- 
Regards,
Benjamin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Why not using the hash when comparing strings?

2012-10-19 Thread Oscar Benjamin
On 19 October 2012 11:02, Duncan Booth
duncan.bo...@suttoncourtenay.org.uk wrote:
 Hrvoje Niksic hrvoje.nik...@avl.com wrote:

 On 10/19/2012 03:22 AM, Benjamin Peterson wrote:
 It would be interesting to see how common it is for strings which have
 their hash computed to be compared.

 Since all identifier-like strings mentioned in Python are interned, and
 therefore have had their hash computed, I would imagine comparing them
 to be fairly common. After all, strings are often used as makeshift
 enums in Python.

 On the flip side, those strings are typically small, so a measurable
 overall speed improvement brought by such a change seems unlikely.

 I'm pretty sure it would result in a small slowdown.

 Many (most?) of the comparisons against interned identifiers will be done
 by dictionary lookups and the dictionary lookup code only tries the string
 comparison after it has determined that the hashes match. The only time
 dictionary key strings contents are actually compared is when the hash
 matches but the pointers are different; it is already the case that if the
 hashes don't match the strings are never compared.

My understanding is that:

The first part of string comparison is an identity check (pointer
comparison) which catches the case of comparing two equal interned
strings. What is not checked for is comparing two unequal interned
strings (check both strings are interned and then conclude that
unequal pointers means unequal strings). The hashes are also not
checked during a standard string comparison. The lengths are also not
checked since the unicode_compare function is a generic cmp() function
for a rich comparison rather than a specific function for
equal/unequal comparison:
http://hg.python.org/cpython/file/321414874b26/Objects/unicodeobject.c#l6114

Most string comparisons occur during dict lookup in which case the
hash is (implicitly?) checked before the strings are compared. A
second hash comparison in this case would often be redundant.


My *opinion* is that:

On average character by character string comparison stops very quickly
after comparing 1 or 2 characters. If this is assumed to be the case
then it might be wasteful to check whether the strings have hashes and
then compare the hashes (likewise for checking whether they are
interned).

This was the subject of lengthy debate in the recent python-list
thread comparing strings from the back:
http://mail.python.org/pipermail/python-list/2012-September/629866.html

Consider the case where the strings are not interned, non-empty, and
differ in the first character:

def current(s1, s2):
if s1 is s2:
return True
elif s1.len  0  and s2.len  0 and s1[0] != s2[0]:
return False
# Most of the time we don't reach this point
...

def proposed_hashcheck(s1, s2):
if s1 is s2:
return True
elif s1.hash is not None and s2.hash is not None and s1.hash != s2.hash:
return False
# I think we often reach this point when not comparing within a dict
elif s1.len  0  and s2.len  0 and s1[0] != s2[0]:
return False
...

When comparing within a dict the hash check (modulo dict size) is
already implicitly likely-affirmative and I assume (I'm not sure) that
the full hashes are explicitly checked before comparing the strings.
Either way I think that the additional hash check within the string
comparison will normally be redundant.

When not comparing within a dict, it's hard to say how often the
strings have hashes but if they usually don't then the extra
hash-check will normally be redundant in that case as well. If the
character by character comparison fails at the first character it may
be quicker to optimise for that case than checking if both hashes are
present and then comparing the hashes.

Other potential optimisations include comparing length before
comparing characters:

def proposed_lencheck(s1, s2):
if s1 is s2:
return True
elif s1.len != s2.len:
return False
elif s1.len  0  and s2.len  0 and s1[0] != s2[0]:
return False
...

In the above I'm trying to represent the fast path when comparing
strings that differ in the first character. One case that isn't helped
by either of length or hash checking is comparing equal strings: you
always need to check all the characters (unless the identity check
succeeds).

Depending on how often unequal interned strings are compared another
method could be:

def proposed_interncheck(s1, s2):
if s1 is s2:
return True
elif s1.interned and s2.interned:
return False
elif s1.len  0  and s2.len  0 and s1[0] != s2[0]:
return False
...

All of these solutions add complexity over the current method and
require the string comparison functions to be refactored from the
current generic rich compare method. None of them (I think) would help
much when 

Re: [Python-Dev] Why not using the hash when comparing strings?

2012-10-19 Thread Brett Cannon
On Fri, Oct 19, 2012 at 8:36 AM, Victor Stinner victor.stin...@gmail.comwrote:

 2012/10/19 Benjamin Peterson benja...@python.org:
  It would be interesting to see how common it is for strings which have
  their hash computed to be compared.

 I implemented a quick hack. When running ./python -m test test_os:
 Python calls PyUnicode_RichCompare() 15206 times with Py_EQ or Py_NE
 operator. In 41.4% (6295 calls), the hash of the two operands is
 known. In 41.2% (6262 times on 15206), the hash of the two operands
 are known *and are different*!

 The hit rate may depend since when the process was started. For
 example, in a fresh interpreter: the hit rate is only 7% (189 hit /
 2703 calls).

 When running the test suite, the hit rate is around 80% (hashs are
 known in 90%) after running 70 tests. At the same time, the average of
 string length is 4.1 characters and quite all strings are pure ASCII.

 I create the issue http://bugs.python.org/issue16286 to discuss this
 optimization.


If you want to measure the performance impact compared to a clean build
then you can use the unladen benchmarks as it contains several Python
3-compatible benchmarks now.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Why not using the hash when comparing strings?

2012-10-18 Thread Steven D'Aprano

On 19/10/12 12:03, Victor Stinner wrote:

Hi,

I would like to know if there a reason for not using the hash of
(bytes or unicode) strings when comparing two objects and the hash of
the two objects was already been computed. Using the hash would speed
up comparaison of long strings when the two strings are different.


Assuming the hash has already been compared, then I imagine it would be
faster.


Something like:

 if ((op == Py_EQ || op == Py_NE)
   a-ob_shash != -1
   b-ob_shash != -1
   a-ob_shash != b-ob_shash) {
 /* strings are not equal */
 }

There are hash collision, so a-ob_shash == b-ob_shash doesn't mean
that the two strings are equal. But if the two hashs are different,
the two strings are different. Isn't it?


I would certainly hope so :)


--
Steven
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Why not using the hash when comparing strings?

2012-10-18 Thread Benjamin Peterson
2012/10/18 Victor Stinner victor.stin...@gmail.com:
 Hi,

 I would like to know if there a reason for not using the hash of
 (bytes or unicode) strings when comparing two objects and the hash of
 the two objects was already been computed. Using the hash would speed
 up comparaison of long strings when the two strings are different.

 Something like:

 if ((op == Py_EQ || op == Py_NE)
  a-ob_shash != -1
  b-ob_shash != -1
  a-ob_shash != b-ob_shash) {
 /* strings are not equal */
 }

 There are hash collision, so a-ob_shash == b-ob_shash doesn't mean
 that the two strings are equal. But if the two hashs are different,
 the two strings are different. Isn't it?

It would be interesting to see how common it is for strings which have
their hash computed to be compared.



-- 
Regards,
Benjamin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Why not using the hash when comparing strings?

2012-10-18 Thread MRAB

On 2012-10-19 02:03, Victor Stinner wrote:

Hi,

I would like to know if there a reason for not using the hash of
(bytes or unicode) strings when comparing two objects and the hash of
the two objects was already been computed. Using the hash would speed
up comparaison of long strings when the two strings are different.

Something like:

 if ((op == Py_EQ || op == Py_NE)
  a-ob_shash != -1
  b-ob_shash != -1
  a-ob_shash != b-ob_shash) {
 /* strings are not equal */
 }

There are hash collision, so a-ob_shash == b-ob_shash doesn't mean
that the two strings are equal. But if the two hashs are different,
the two strings are different. Isn't it?


Correct. It's true for any hashable type.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com