Re: [pypy-dev] Python vs pypy: interesting performance difference [dict.setdefault]

2014-03-10 Thread David Naylor
On Friday, 26 August 2011 06:37:30 Armin Rigo wrote:
> Hi David,
> 
> On Thu, Aug 25, 2011 at 9:44 PM, David Naylor  
wrote:
> > Below is the patch, and results, for my proposed hash methods for
> > datetime.datetime (and easily adaptable to include tzinfo and the other
> > datetime objects).  I tried to make the hash safe for both 32bit and 64bit
> > systems, and beyond.
> 
> Yes, the patch looks good to me.  I can definitely see how it can be a
> huge improvement in performance :-)
> 
> If you can also "fix" the other __hash__ methods in the same way, it
> would be great.

To follow up on a very old email.  The latest results are:

# python2.7 iforkey.py 
ifdict: [2.110611915588379, 2.12678599357605, 2.1126320362091064]
keydict: [2.1322460174560547, 2.098900079727173, 2.0998198986053467]
defaultdict: [3.184070110321045, 3.2007319927215576, 3.188380002975464]
# pypy2.2 iforkey.py
ifdict: [0.510915994644165, 0.23750996589660645, 0.2241990566253662]
keydict: [0.23270201683044434, 0.18279695510864258, 0.18002104759216309]
defaultdict: [3.4535930156707764, 3.3697848320007324, 3.392897129058838]

And using the latest datetime.py:
pypy iforkey.py
ifdict: [0.2814958095550537, 0.23425602912902832, 0.2206539916992]
keydict: [0.23637700080871582, 0.18506789207458496, 0.1831810474395752]
defaultdict: [2.8174121379852295, 2.74626088142395, 2.7308008670806885]

Excellent, thank you :-)

Regards

signature.asc
Description: This is a digitally signed message part.
___
pypy-dev mailing list
pypy-dev@python.org
https://mail.python.org/mailman/listinfo/pypy-dev


Re: [pypy-dev] Python vs pypy: interesting performance difference [dict.setdefault]

2011-08-25 Thread Armin Rigo
Hi David,

On Thu, Aug 25, 2011 at 9:44 PM, David Naylor  wrote:
> Below is the patch, and results, for my proposed hash methods for
> datetime.datetime (and easily adaptable to include tzinfo and the other
> datetime objects).  I tried to make the hash safe for both 32bit and 64bit
> systems, and beyond.

Yes, the patch looks good to me.  I can definitely see how it can be a
huge improvement in performance :-)

If you can also "fix" the other __hash__ methods in the same way, it
would be great.


A bientôt,

Armin.
___
pypy-dev mailing list
pypy-dev@python.org
http://mail.python.org/mailman/listinfo/pypy-dev


Re: [pypy-dev] Python vs pypy: interesting performance difference [dict.setdefault]

2011-08-25 Thread David Naylor
On Tuesday, 16 August 2011 15:27:30 Armin Rigo wrote:
> Hi David,
> 
> On Mon, Aug 15, 2011 at 6:20 PM, David Naylor  
wrote:
> > For me the performance of datetime object's hashing is sufficient but I
> > think the python code could use some performance improvements.  Is my
> > approach using a direct computation to type long acceptable (in
> > principle).  If so I can refine it and submit a patch.
> 
> Yes, replacing the hash with a faster-to-compute one is fine.  It's
> best performance-wise if you can avoid using Python longs.  As far as
> I know it just needs some random-looking xor-ing and shifting of the
> fields.  Note, of course, that you must carefully satisfy the property
> that for any objects x and y, if "x == y" then "hash(x) == hash(y)".

Below is the patch, and results, for my proposed hash methods for 
datetime.datetime (and easily adaptable to include tzinfo and the other 
datetime objects).  I tried to make the hash safe for both 32bit and 64bit 
systems, and beyond.  

The results are:
# python datetest.py (datetime.py)
hash_unity: 35.83 seconds
hash_unity: 44.60 seconds
hash_datetime: 65.58 seconds
hash_datetime: 53.95 seconds

# python datetest.py
hash_unity: 5.70 seconds
hash_unity: 5.69 seconds
hash_datetime: 4.88 seconds
hash_datetime: 4.90 seconds

# pypy datetest.py
hash_unity: 0.74 seconds
hash_unity: 0.63 seconds
hash_datetime: 11.74 seconds
hash_datetime: 11.47 seconds

# pypy datetest.py (patched datetime.py)
hash_unity: 0.73 seconds
hash_unity: 0.62 seconds
hash_datetime: 0.76 seconds
hash_datetime: 0.64 seconds

So, based on my patch there is a 7.7x improvement over python and a 17.9x 
improvement over the previous pypy implementation.  

If the above approach is acceptable I will complete the patch?  

Regards
--- /usr/local/pypy-1.6/lib_pypy/datetime.py	2011-08-11 20:39:53.0 +0200
+++ datetime.py	2011-08-25 21:31:56.0 +0200
@@ -17,11 +17,14 @@
 """
 
 import time as _time
+import sys as _sys
 import math as _math
 
 MINYEAR = 1
 MAXYEAR = 
 
+_BITS = int(_math.log(_sys.maxint, 2) + 1)
+
 # Utility functions, adapted from Python's Demo/classes/Dates.py, which
 # also assumes the current Gregorian calendar indefinitely extended in
 # both directions.  Difference:  Dates.py calls January 1 of year 0 day
@@ -1796,12 +1799,14 @@
 return base + timedelta(minutes = otoff-myoff)
 
 def __hash__(self):
-tzoff = self._utcoffset()
-if tzoff is None:
-return hash(self.__getstate()[0])
-days = _ymd2ord(self.year, self.month, self.day)
-seconds = self.hour * 3600 + (self.minute - tzoff) * 60 + self.second
-return hash(timedelta(days, seconds, self.microsecond))
+if _BITS >= 64:
+return (self.microsecond ^ (self.second << 20) ^
+(self.minute << 26) ^ (self.hour << 32) ^ (self.day << 37) ^
+(self.month << 42) ^ (self.year << 46))
+else:
+return ((self.microsecond ^ (self.hour << 20) ^ (self.day << 25)) ^
+(self.second ^ (self.minute << 6) ^ (self.month << 12) ^
+ (self.year << 19)))
 
 # Pickle support.
 
import datetime
import random
import time

dates = [
(random.randint(datetime.MINYEAR, datetime.MAXYEAR),
 random.randint(1, 12),
 random.randint(1, 28),
 random.randint(1, 23),
 random.randint(0, 59),
 random.randint(0, 59),
 random.randint(0, 99)) for i in range(1000)]

def bench(fn):
def inner(*args, **kwds):
start = time.clock()
res = fn(*args, **kwds)
end = time.clock()
try:
name = fn.__name__
except AttributeError:
name = repr(fn)
#
print '%s: %.2f seconds' % (name, end-start)
return res
return inner

class unity(datetime.datetime):
def __hash__(self):
return 0

@bench
def hash_unity():
result = 0
for date in dates:
result ^= hash(unity(*date))
return result

@bench
def hash_datetime():
result = 0
for date in dates:
result ^= hash(datetime.datetime(*date))
return result

hash_unity()
hash_unity()
hash_datetime()
hash_datetime()


signature.asc
Description: This is a digitally signed message part.
___
pypy-dev mailing list
pypy-dev@python.org
http://mail.python.org/mailman/listinfo/pypy-dev


Re: [pypy-dev] Python vs pypy: interesting performance difference [dict.setdefault]

2011-08-16 Thread Armin Rigo
Hi David,

On Mon, Aug 15, 2011 at 6:20 PM, David Naylor  wrote:
> For me the performance of datetime object's hashing is sufficient but I think
> the python code could use some performance improvements.  Is my approach using
> a direct computation to type long acceptable (in principle).  If so I can
> refine it and submit a patch.

Yes, replacing the hash with a faster-to-compute one is fine.  It's
best performance-wise if you can avoid using Python longs.  As far as
I know it just needs some random-looking xor-ing and shifting of the
fields.  Note, of course, that you must carefully satisfy the property
that for any objects x and y, if "x == y" then "hash(x) == hash(y)".


A bientôt,

Armin.
___
pypy-dev mailing list
pypy-dev@python.org
http://mail.python.org/mailman/listinfo/pypy-dev


Re: [pypy-dev] Python vs pypy: interesting performance difference [dict.setdefault]

2011-08-15 Thread Antonio Cuni

On 15/08/11 15:36, Alex Gaynor wrote:

I'd like to express that, unless we have a very compelling reason, we should
try to keep more stuff in pure python, as opposed to RPython.  Mostly because
it speeds up translation ;) (also easier to test, easier to write, etc.).


or, on the other hand, we should try to improve the JIT so that pure python 
code can be as fast as rpython :-)

___
pypy-dev mailing list
pypy-dev@python.org
http://mail.python.org/mailman/listinfo/pypy-dev


Re: [pypy-dev] Python vs pypy: interesting performance difference [dict.setdefault]

2011-08-15 Thread David Naylor
On Monday, 15 August 2011 09:05:22 Armin Rigo wrote:
> Hi David,
> 
> On Sat, Aug 13, 2011 at 8:14 PM, David Naylor  
wrote:
> > So, it appears pypy is failing to speed up this contrived example...
> 
> I think that it is expected, because the hash is computed entirely as
> pure Python code in the case of PyPy (doing integer arithmetic with
> overflow checks) whereas it is written in C in the case of RPython.  I
> find it rather good that we get overall close to the speed of CPython,
> given that we just have datetime.py...  Try using datetime.py on top
> of CPython too :-)
> 
> More seriously, if we want to be really better with datetime objects,
> as opposed to "good enough for now", then I fear we need to rewrite
> datetime.py in RPython, like CPython did rewrite it in C.  Or some
> intermediate solution is possible too, having an internal class which
> implements only basic operations (like hash) and which is subclassed
> in pure Python.  But I would happily leave that task to someone with
> an absolutely real need for high-performance datetime objects; for me
> "good enough for now" is good enough, for now...

For me the performance of datetime object's hashing is sufficient but I think 
the python code could use some performance improvements.  Is my approach using 
a direct computation to type long acceptable (in principle).  If so I can 
refine it and submit a patch.  


signature.asc
Description: This is a digitally signed message part.
___
pypy-dev mailing list
pypy-dev@python.org
http://mail.python.org/mailman/listinfo/pypy-dev


Re: [pypy-dev] Python vs pypy: interesting performance difference [dict.setdefault]

2011-08-15 Thread Armin Rigo
Hi Alex,

On Mon, Aug 15, 2011 at 3:36 PM, Alex Gaynor  wrote:
> I'd like to express that, unless we have a very compelling reason, we should
> try to keep more stuff in pure python, as opposed to RPython.  Mostly
> because it speeds up translation ;) (also easier to test, easier to write,
> etc.).

I agree.  I was really talking about someone needing absolutely best
performance for datetime objects.  I'm sure that getting "only
average" performance is more than enough for all use cases apart from
microbenchmarks.


A bientôt,

Armin.
___
pypy-dev mailing list
pypy-dev@python.org
http://mail.python.org/mailman/listinfo/pypy-dev


Re: [pypy-dev] Python vs pypy: interesting performance difference [dict.setdefault]

2011-08-15 Thread Alex Gaynor
On Mon, Aug 15, 2011 at 2:05 AM, Armin Rigo  wrote:

> Hi David,
>
> On Sat, Aug 13, 2011 at 8:14 PM, David Naylor 
> wrote:
> > So, it appears pypy is failing to speed up this contrived example...
>
> I think that it is expected, because the hash is computed entirely as
> pure Python code in the case of PyPy (doing integer arithmetic with
> overflow checks) whereas it is written in C in the case of RPython.  I
> find it rather good that we get overall close to the speed of CPython,
> given that we just have datetime.py...  Try using datetime.py on top
> of CPython too :-)
>
> More seriously, if we want to be really better with datetime objects,
> as opposed to "good enough for now", then I fear we need to rewrite
> datetime.py in RPython, like CPython did rewrite it in C.  Or some
> intermediate solution is possible too, having an internal class which
> implements only basic operations (like hash) and which is subclassed
> in pure Python.  But I would happily leave that task to someone with
> an absolutely real need for high-performance datetime objects; for me
> "good enough for now" is good enough, for now...
>
>
> A bientôt,
>
> Armin.
> ___
> pypy-dev mailing list
> pypy-dev@python.org
> http://mail.python.org/mailman/listinfo/pypy-dev
>

I'd like to express that, unless we have a very compelling reason, we should
try to keep more stuff in pure python, as opposed to RPython.  Mostly
because it speeds up translation ;) (also easier to test, easier to write,
etc.).

Alex

-- 
"I disapprove of what you say, but I will defend to the death your right to
say it." -- Evelyn Beatrice Hall (summarizing Voltaire)
"The people's good is the highest law." -- Cicero
___
pypy-dev mailing list
pypy-dev@python.org
http://mail.python.org/mailman/listinfo/pypy-dev


Re: [pypy-dev] Python vs pypy: interesting performance difference [dict.setdefault]

2011-08-15 Thread Armin Rigo
Hi David,

On Sat, Aug 13, 2011 at 8:14 PM, David Naylor  wrote:
> So, it appears pypy is failing to speed up this contrived example...

I think that it is expected, because the hash is computed entirely as
pure Python code in the case of PyPy (doing integer arithmetic with
overflow checks) whereas it is written in C in the case of RPython.  I
find it rather good that we get overall close to the speed of CPython,
given that we just have datetime.py...  Try using datetime.py on top
of CPython too :-)

More seriously, if we want to be really better with datetime objects,
as opposed to "good enough for now", then I fear we need to rewrite
datetime.py in RPython, like CPython did rewrite it in C.  Or some
intermediate solution is possible too, having an internal class which
implements only basic operations (like hash) and which is subclassed
in pure Python.  But I would happily leave that task to someone with
an absolutely real need for high-performance datetime objects; for me
"good enough for now" is good enough, for now...


A bientôt,

Armin.
___
pypy-dev mailing list
pypy-dev@python.org
http://mail.python.org/mailman/listinfo/pypy-dev


Re: [pypy-dev] Python vs pypy: interesting performance difference [dict.setdefault]

2011-08-13 Thread David Naylor
On Saturday, 13 August 2011 18:32:58 Antonio Cuni wrote:
> On 12/08/11 17:49, David Naylor wrote:
> > Would it not be a simple matter of changing the __(get|set)state method
> > to use a tuple or even an int(long)?
> 
> yes, I think it should be enough. I'm going on vacation soon and I won't
> have a look at it right now, so if anybody wants to work on it, he's very
> welcome (hint, hint :-)).

See attached for my naive attempt (and I did not run any unit tests on the 
code).  It provides between 4.5x to 13.4x improvement in hash speed.  If 
method 1 is acceptable I could properly implement it.  

If you look at the __hash__ method for datetime you will notice three return 
statements.  The performance of those statements are as follows, 
based on:

@bench.bench
def hashdate():
 res = 0
 for i in range(1000):
 now = datetime.datetime(i // 1 + 1, (i % 1) % 12 + 1, (i % 
100) % 28 + 1)
 res ^= hash(now)
 return res

hashdate()

Method 1 (direct integer compute):
hashdate: 0.70 seconds

Method 2 (hash of __getstate()):
hashdate: 2.39 seconds

Method 3 (unity):
hashdate: 0.68 seconds

Method 4 (original):
hashdate: 10.93 seconds (python: 12.60 seconds)

And back to my original "benchmark" with the change of `key = i`:

# python iforkey.py
ifdict: [2.8676719665527344, 2.872897148132324, 2.8396730422973633]
keydict: [2.3266799449920654, 2.3431849479675293, 2.3421859741210938]
defaultdict: [3.706634044647217, 3.6940698623657227, 3.7520179748535156]

# pypy iforkey.py (original)
ifdict: [29.201794147491455, 29.047310829162598, 29.34461998939514]
keydict: [14.939809083938599, 15.250468015670776, 15.542209148406982]
defaultdict: [15.11891484260559, 15.064191102981567, 14.94817304611206]

# pypy iforkey (method 1)
ifdict: [7.455403804779053, 7.376722097396851, 7.447360038757324]
keydict: [3.9056499004364014, 3.833178997039795, 3.8482401371002197]
defaultdict: [3.9568910598754883, 3.8757669925689697, 3.8843569742]

# pypy iforkey.py (method 2)
ifdict: [11.993246078491211, 11.865861892700195, 11.916783094406128]
keydict: [6.141685962677002, 6.092236042022705, 6.082683086395264]
defaultdict: [6.376708030700684, 6.337490081787109, 6.361854791641235]

So, it appears pypy is failing to speed up this contrived example...
--- datetime.py	2011-08-11 20:39:53.0 +0200
+++ /home/DragonSA/datetime.py	2011-08-13 19:48:49.0 +0200
@@ -13,7 +13,7 @@
 Sources for time zone and DST data: http://www.twinsun.com/tz/tz-link.htm
 
 This was originally copied from the sandbox of the CPython CVS repository.
-Thanks to Tim Peters for suggesting using it. 
+Thanks to Tim Peters for suggesting using it.
 """
 
 import time as _time
@@ -742,11 +742,6 @@
 
 year, month, day (required, base 1)
 """
-if isinstance(year, str):
-# Pickle support
-self = object.__new__(cls)
-self.__setstate(year)
-return self
 _check_date_fields(year, month, day)
 self = object.__new__(cls)
 self.__year = year
@@ -986,14 +981,7 @@
 __safe_for_unpickling__ = True  # For Python 2.2
 
 def __getstate(self):
-yhi, ylo = divmod(self.__year, 256)
-return ("%c%c%c%c" % (yhi, ylo, self.__month, self.__day), )
-
-def __setstate(self, string):
-if len(string) != 4 or not (1 <= ord(string[2]) <= 12):
-raise TypeError("not enough arguments")
-yhi, ylo, self.__month, self.__day = map(ord, string)
-self.__year = yhi * 256 + ylo
+return (self.__year, self.__month, self.__day)
 
 def __reduce__(self):
 return (self.__class__, self.__getstate())
@@ -1112,10 +1100,6 @@
 tzinfo (default to None)
 """
 self = object.__new__(cls)
-if isinstance(hour, str):
-# Pickle support
-self.__setstate(hour, minute or None)
-return self
 _check_tzinfo_arg(tzinfo)
 _check_time_fields(hour, minute, second, microsecond)
 self.__hour = hour
@@ -1201,13 +1185,7 @@
 
 def __hash__(self):
 """Hash."""
-tzoff = self._utcoffset()
-if not tzoff: # zero or None
-return hash(self.__getstate()[0])
-h, m = divmod(self.hour * 60 + self.minute - tzoff, 60)
-if 0 <= h < 24:
-return hash(time(h, m, self.second, self.microsecond))
-return hash((h, m, self.second, self.microsecond))
+return hash(self.__getstate())
 
 # Conversion to string
 
@@ -1351,22 +1329,7 @@
 __safe_for_unpickling__ = True  # For Python 2.2
 
 def __getstate(self):
-us2, us3 = divmod(self.__microsecond, 256)
-us1, us2 = divmod(us2, 256)
-basestate = ("%c" * 6) % (self.__hour, self.__minute, self.__second,
-  us1, us2, us3)
-if self._tzinfo is None:
-return (basestate,)
-else:
-return (basestate, self._tzinfo)
-
-def __set

Re: [pypy-dev] Python vs pypy: interesting performance difference [dict.setdefault]

2011-08-13 Thread Antonio Cuni

On 12/08/11 17:49, David Naylor wrote:


Would it not be a simple matter of changing the __(get|set)state method to use
a tuple or even an int(long)?


yes, I think it should be enough. I'm going on vacation soon and I won't have 
a look at it right now, so if anybody wants to work on it, he's very welcome 
(hint, hint :-)).


ciao,
Anto
___
pypy-dev mailing list
pypy-dev@python.org
http://mail.python.org/mailman/listinfo/pypy-dev


Re: [pypy-dev] Python vs pypy: interesting performance difference [dict.setdefault]

2011-08-12 Thread David Naylor
On Friday, 12 August 2011 14:51:36 Antonio Cuni wrote:
> Hello David,
> 
> On 10/08/11 21:27, David Naylor wrote:
> > Hi,
> > 
> > I needed to create a cache of date and time objects and I wondered what
> > was the best way to handle the cache.  For comparison I put together
> 
> > the following test:
> [cut]
> 
> > Pypy displays significant slowdown in the defaultdict function, otherwise
> > displays its usual speedup.  To check what is the cause I replaced
> > i.date() with i.day and found no major difference in times.  It appears
> > dict.setdefault (or it's interaction with jit) is causing a slow down.
> 
> I don't think that setdefault is the culprit here, as shown by this
> benchmark:

I made a mistake in the script, I intended to use the day number as the key 
(as was in the case of the other tests).  Changing the critical line to 
"b.append(cache.setdefault(i.day, i.date()))" gives comparable results to 
python (factoring in the speed up).  



> as you can see, in PyPy there is almost no difference between using a
> try/except or using setdefault.

I would argue that with an expensive value operation the try/except will be 
much faster than setdefault (where cache hits are above 0).  Using my fixed 
script I get:

keydict: [0.3298788070678711, 0.28450703620910645, 0.28931379318237305]
defaultdict: [0.47180604934692383, 0.4183311462402344, 0.4172670841217041]
 
indicating setdefault is about 40% slower (1.4 times slower), which I 
attribute to the cost of i.date().  

> I had a quick look at the code (which in PyPy is written at applevel) and
> it does a lot of nonsense.  In particular, __hash__ calls __getstate which
> formats a dynamically created string, just to call hash() on it.  I
> suppose that this code can (and should) be optimized a lot.  I may try to
> look at it but it's unclear when, since I'm about to go on vacation.

Would it not be a simple matter of changing the __(get|set)state method to use 
a tuple or even an int(long)?  


signature.asc
Description: This is a digitally signed message part.
___
pypy-dev mailing list
pypy-dev@python.org
http://mail.python.org/mailman/listinfo/pypy-dev


Re: [pypy-dev] Python vs pypy: interesting performance difference [dict.setdefault]

2011-08-12 Thread Antonio Cuni

On 12/08/11 14:51, Antonio Cuni wrote:

@bench.bench


for reference, here is the implementation of the "bench" decorator:
https://bitbucket.org/antocuni/env/src/1b11491fab79/pypath/bench.py
___
pypy-dev mailing list
pypy-dev@python.org
http://mail.python.org/mailman/listinfo/pypy-dev


Re: [pypy-dev] Python vs pypy: interesting performance difference [dict.setdefault]

2011-08-12 Thread Antonio Cuni

Hello David,

On 10/08/11 21:27, David Naylor wrote:

Hi,

I needed to create a cache of date and time objects and I wondered what was the 
best way to handle the cache.  For comparison I put together
the following test:


[cut]

Pypy displays significant slowdown in the defaultdict function, otherwise 
displays its usual speedup.  To check what is the cause I replaced i.date()
with i.day and found no major difference in times.  It appears dict.setdefault 
(or it's interaction with jit) is causing a slow down.


I don't think that setdefault is the culprit here, as shown by this benchmark:

@bench.bench
def setdef():
d = {}
for i in range(1000):
d.setdefault(i, i)
return d

@bench.bench
def tryexcept():
d = {}
for i in range(1000):
try:
d[i]
except KeyError:
d[i] = i
return d

setdef()
tryexcept()

$ python dictbench.py
setdef: 2.03 seconds
tryexcept: 8.54 seconds

tmp $ pypy-c dictbench.py
setdef: 1.31 seconds
tryexcept: 1.37 seconds

as you can see, in PyPy there is almost no difference between using a 
try/except or using setdefault.



What is very slow on PyPy seems to be hashing datetime objects:

import datetime

@bench.bench
def hashdate():
res = 0
for i in range(10):
now = datetime.datetime.now()
res ^= hash(now)
return res

hashdate()

$ pypy-c dictbench.py
hashdate: 0.83 seconds

$ python dictbench.py
hashdate: 0.22 seconds

I had a quick look at the code (which in PyPy is written at applevel) and it 
does a lot of nonsense.  In particular, __hash__ calls __getstate which 
formats a dynamically created string, just to call hash() on it.  I suppose 
that this code can (and should) be optimized a lot.  I may try to look at it 
but it's unclear when, since I'm about to go on vacation.


ciao,
Anto
___
pypy-dev mailing list
pypy-dev@python.org
http://mail.python.org/mailman/listinfo/pypy-dev