[issue1772851] Decimal and long hash, compatibly and efficiently

2007-09-19 Thread Facundo Batista

Facundo Batista added the comment:

Applied the patchs long_hash.patch (rev 58208) and decimal_hash_v2.patch
(rev 58211)

--
resolution:  -> fixed
status: open -> closed

_
Tracker <[EMAIL PROTECTED]>

_
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1772851] Decimal and long hash, compatibly and efficiently

2007-09-14 Thread Mark Dickinson


Mark Dickinson
 added the comment:

And here's the updated decimal.__hash__ patch that goes with the 
long.__hash__ patch.

_
Tracker <[EMAIL PROTECTED]>

_Index: Lib/decimal.py
===
--- Lib/decimal.py  (revision 58158)
+++ Lib/decimal.py  (working copy)
@@ -770,10 +770,17 @@
 if self._isnan():
 raise TypeError('Cannot hash a NaN value.')
 return hash(str(self))
-i = int(self)
-if self == Decimal(i):
-return hash(i)
-assert self.__nonzero__()   # '-0' handled by integer case
+if not self:
+return 0
+if self._isinteger():
+op = _WorkRep(self.to_integral_value())
+# to make computation feasible for Decimals with large
+# exponent, we use the fact that hash(n) == hash(m) for
+# any two nonzero integers n and m such that (i) n and m
+# have the same sign, and (ii) n is congruent to m modulo
+# 2**64-1.  So we can replace hash((-1)**s*c*10**e) with
+# hash((-1)**s*c*pow(10, e, 2**64-1).
+return hash((-1)**op.sign*op.int*pow(10, op.exp, 2**64-1))
 return hash(str(self.normalize()))
 
 def as_tuple(self):
Index: Lib/test/test_decimal.py
===
--- Lib/test/test_decimal.py(revision 58158)
+++ Lib/test/test_decimal.py(working copy)
@@ -910,6 +910,38 @@
 def test_hash_method(self):
 #just that it's hashable
 hash(Decimal(23))
+
+test_values = [Decimal(sign*(2**m + n))
+   for m in [0, 14, 15, 16, 17, 30, 31, 
+ 32, 33, 62, 63, 64, 65, 66]
+   for n in range(-10, 10)
+   for sign in [-1, 1]]
+test_values.extend([
+Decimal("-0"), # zeros
+Decimal("0.00"),
+Decimal("-0.000"),
+Decimal("0E10"),
+Decimal("-0E12"),
+Decimal("10.0"), # negative exponent 
+Decimal("-23.0"),
+Decimal("1230E100"), # positive exponent
+Decimal("-4.5678E50"),
+# a value for which hash(n) != hash(n % (2**64-1))
+# in Python pre-2.6
+Decimal(2**64 + 2**32 - 1),
+# selection of values which fail with the old (before
+# version 2.6) long.__hash__
+Decimal("1.634E100"),
+Decimal("90.697E100"),
+Decimal("188.83E100"),
+Decimal("1652.9E100"),
+Decimal("56531E100"),
+])
+
+# check that hash(d) == hash(int(d)) for integral values
+for value in test_values:
+self.assertEqual(hash(value), hash(int(value)))
+
 #the same hash that to an int
 self.assertEqual(hash(Decimal(23)), hash(23))
 self.assertRaises(TypeError, hash, Decimal('NaN'))
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1772851] Decimal and long hash, compatibly and efficiently

2007-09-14 Thread Mark Dickinson


Mark Dickinson
 added the comment:

Here's a revised patch, that only touches longobject.c and does the minimum 
necessary to alter long.__hash__ from being *almost* completely predictable to 
being completely predictable.  To summarize the effects of this patch:

 - the current long.__hash__ is predictable except for a very small number of 
(unpredictable) inputs;  this patch makes it predictable for all inputs, 
thereby 
making it possible to define an efficient Decimal.__hash__.  To be precise, the 
patched long.__hash__ has the property that it's periodic of period ULONG_MAX 
in 
the ranges [1, infinity) and (-infinity, -1].

 - the value of hash(n) is unchanged for any n that's small enough to be 
representable as an int, and also unchanged for the vast majority of long 
integers n of reasonable size.  (For integers up to 450 digits long, the new 
hash 
value differs from the old for around 1 in every 2500 integers on a 32-bit 
system;  on a 64-bit system the figure is 1 in every 1 billion.)

 - On my test system (Dual Xeon 2.8GHz/SuSE Linux 10.2/gcc 4.1), using
timeit.timeit('hash(n)') there's no measurable slowdown for small integers.  
Unfortunately there *is* slowdown for larger integers:  around 20% slowdown for 
100 digit integers and around 70% extra time for 1000 digit integers, though in 
all cases the times are in fractions of a microsecond.  It doesn't help that 
gcc 
doesn't optimize the inner loop perfectly:  with the -O3 flag, the addition and 
subsequent correction are compiled to (with x in eax and v->ob_digit[i] in edx):

addl%eax, %edx
cmpl%eax, %edx
adcl$0, %edx

and the cmpl here is entirely redundant.  Maybe there's some way of writing the 
C 
code that makes it easier for gcc to pick up on this optimization?

_
Tracker <[EMAIL PROTECTED]>

_Index: Objects/longobject.c
===
--- Objects/longobject.c(revision 58158)
+++ Objects/longobject.c(working copy)
@@ -1937,10 +1937,18 @@
i = -(i);
}
 #define LONG_BIT_SHIFT (8*sizeof(long) - SHIFT)
+   /* The following loop produces a C long x such that (unsigned long)x
+  is congruent to the absolute value of v modulo ULONG_MAX.  The
+  resulting x is nonzero if and only if v is. */
while (--i >= 0) {
/* Force a native long #-bits (32 or 64) circular shift */
x = ((x << SHIFT) & ~MASK) | ((x >> LONG_BIT_SHIFT) & MASK);
x += v->ob_digit[i];
+   /* If the addition above overflowed (thinking of x as
+  unsigned), we compensate by incrementing.  This preserves
+  the value modulo ULONG_MAX. */
+   if ((unsigned long)x < v->ob_digit[i])
+   x++;
}
 #undef LONG_BIT_SHIFT
x = x * sign;
Index: Lib/test/test_hash.py
===
--- Lib/test/test_hash.py   (revision 58158)
+++ Lib/test/test_hash.py   (working copy)
@@ -18,10 +18,21 @@
 
 def test_numeric_literals(self):
 self.same_hash(1, 1L, 1.0, 1.0+0.0j)
+self.same_hash(0, 0L, 0.0, 0.0+0.0j)
+self.same_hash(-1, -1L, -1.0, -1.0+0.0j)
+self.same_hash(-2, -2L, -2.0, -2.0+0.0j)
 
 def test_coerced_integers(self):
 self.same_hash(int(1), long(1), float(1), complex(1),
int('1'), float('1.0'))
+self.same_hash(int(-2**31), long(-2**31), float(-2**31))
+self.same_hash(int(1-2**31), long(1-2**31), float(1-2**31))
+self.same_hash(int(2**31-1), long(2**31-1), float(2**31-1))
+# for 64-bit platforms
+self.same_hash(int(2**31), long(2**31), float(2**31))
+self.same_hash(int(-2**63), long(-2**63), float(-2**63))
+self.same_hash(int(1-2**63), long(1-2**63))
+self.same_hash(int(2**63-1), long(2**63-1))
 
 def test_coerced_floats(self):
 self.same_hash(long(1.23e300), float(1.23e300))
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1772851] Decimal and long hash, compatibly and efficiently

2007-09-12 Thread ajaksu

ajaksu added the comment:

Thanks a lot for the explanation. I still think it could be valuable
(C?)py3k change  (but now for the predictability, as you say :)

Regarding the performance of Decimal.__hash__:

 I believe (int) hash performance would be a moot issue if
Decimal.__int__ was (much) faster. I've been, for the last  three days,
trying to  convert decimal.py to use longs instead of tuples for
Decimal._int. Here are some (hand picked) results (on a Celeron M 1.46GHz):

In [906]: D = int_decimal.Decimal
(...)
In [914]: g = D("1E100")
(...)
In [916]: g
Out[916]: Decimal("1E+100")
(...)
In [918]: %timeit k = int(g)
10 loops, best of 3: 3.22 s per loop
(...)
In [920]: %timeit k = hash(int(g))
10 loops, best of 3: 3.28 s per loop
(...)
In [922]: log(int(g))
Out[922]: 2302585.0929940455

In [923]: log(10**100)
Out[923]: 2302585.0929940455

 I'm not very competent in many skills needed for the job AND I'm
working between field trips (next starts tomorrow), so, unfortunately,
the results are crude and awfully buggy so far. Most "(...)" above were
exceptions for simple things like "hash(g)" and there are lots of these
issues.

The conversion is based on using divisions and multiplications to
truncate/extend Decimal._int, from the notion that the bottleneck is the
tuple->str->int conversion. I predict lots of currently fast things
getting slower and a higher memory footprint, but it might still be
worth it and I'm willing to test the idea.

 As I told Facundo in a private email (y otro llegará hoy o mañana), the
simple cases seem workable for me, but I'm not able to tell right now if
this has any real chances of succeeding (because I'm bad at maths :/).
I've not even looked at __pow__, __mul__ and __div__ yet!

Anyway, I'll try to tidy things up as much as I can and email Facundo
(and yourself, if you are interested) the results of this large ugly
hack before now + 24h.

_
Tracker <[EMAIL PROTECTED]>

_
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1772851] Decimal and long hash, compatibly and efficiently

2007-09-12 Thread Mark Dickinson


Mark Dickinson
 added the comment:

To help explain what's going on, here's some Python code.  The Python
function long_hash1 below has the properties that:

(1) long_hash1(n) == hash(n) for almost all integers n, with a very
small set of exceptions.  (The first positive exception is 2**33-1, and
the next after that is 3*2**33 - 1.  The negative exceptions have the
same absolute value as the positive ones, so the first negative
exception is n = 1-2**33).

(2) long_hash1(n) has a simple closed form, making it possible to
compute an equivalent Decimal hash efficiently.

(3) The current long_hash in Objects/longobject.c can be changed, by
adding a single line of C code, so that hash(n) == long_hash1(n) always.
That line is:

  if ((unsigned long)x < v->ob_digit[i]) x++;

added directly after the addition.

Explanation: the exceptions in (1) arise precisely when the addition

 x += v->ob_digit[i]

in the long_hash code overflows (in the *unsigned* sense---equivalently,
when the addition turns a negative x into a nonnegative one).  Since
ob_digit[i] only has 15 significant bits, and x has 32 (or 64), such
overflow is going to be rare---it'll occur for roughly one addition in
every 2**18 (or about 4 additions in every million), for `random' n.  So
for `most' n, hash(n) and long_hash1(n) are equal.

So what about long_hash2(n)?  This is what the patched long_hash 
is actually equivalent to;  it's essentially the same as long_hash1(n),
but fixed up to be genuinely periodic;  that is, long_hash1(n+ULONG_MAX)
== long_hash1(n) for any integer n.  long_hash1 and long_hash2 are equal
almost always for positive n (the exceptions being multiples of
ULONG_MAX), equal about half the time for negative n, and off-by-one
from each other about half the time.

I don't really know whether the periodicity is that useful---the
predictability is really what matters, so the 1-line change to produce a
hash function equivalent to long_hash1 would do just fine as far as
making Decimal work.

For what it's worth, I regard this patch as an ugly hack.  I don't like
the idea of changing something as fundamental as long.__hash__, and I
don't like having Decimal.__hash__ depend on knowing exactly how
long.__hash__ gets its values.   But there's a real problem here, namely
that, for example,

>>> set([Decimal("1E100")])

takes several minutes to complete.  (Try it!)  And I can't think of any
other easy way to solve this.  Alternative suggestions welcomed!



LONG_BITS = 32
W = 2**LONG_BITS
HW = W // 2
ULONG_MAX = W - 1

def long_hash1(n):
if n == 0:
h = 0
else:
h = (1 + (abs(n)-1) % ULONG_MAX) * (1 if n > 0 else -1)

# reduce mod W to lie in the usual range for a (signed) C long
h = (h + HW) % W - HW
if h == -1:
h = -2
return int(h)

def long_hash2(n):
h = n % ULONG_MAX

h = (h + HW) % W - HW
if h == -1:
h = -2
return int(h)

_
Tracker <[EMAIL PROTECTED]>

_
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1772851] Decimal and long hash, compatibly and efficiently

2007-09-12 Thread ajaksu

ajaksu added the comment:

IMHO this patch should be considered for  (at least) py3k, in which long
becomes the new int. As there is current interest in long/int
performance[1] this seems like a good time to raise this kind of issue.

 Mark, could you please outline the semantic changes (specially at
wraparound)? Would the hash value changes happen only in "wraparound (=
subtraction +   of 2**(bits_in_long)) by incrementing" cases?  
Would hash(-1) become -1?

Sorry to ask all that, but my C (and maths) is (are) *very* limited. I
guess an invalid hash value is the reason to have hash(-1) == hash(-2)
== -2 currently, if so the period would have to skip that, right? I
don't see the need for "y" and would chain some IFs (if x == -1 OR
abs(x) == ULONG_MAX), but those are weak guesses.


 Maybe it's worth discussing this patch  without linking its most
important change (long hash becoming periodic) to Decimal: it would have
many impacts beyond Decimal and is rather a core change (as opposed to
Lib).  Also,  a recent discussion on hashing[2] shows that sometimes the
current behavior was well planned :)

1 -  http://mail.python.org/pipermail/python-3000/2007-September/010374.html
2 - http://mail.python.org/pipermail/python-3000/2007-September/010327.html

--
nosy: +ajaksu2

_
Tracker <[EMAIL PROTECTED]>

_
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com