Re: [pypy-dev] Python vs pypy: interesting performance difference [dict.setdefault]
On Friday, 26 August 2011 06:37:30 Armin Rigo wrote: > Hi David, > > On Thu, Aug 25, 2011 at 9:44 PM, David Naylor wrote: > > Below is the patch, and results, for my proposed hash methods for > > datetime.datetime (and easily adaptable to include tzinfo and the other > > datetime objects). I tried to make the hash safe for both 32bit and 64bit > > systems, and beyond. > > Yes, the patch looks good to me. I can definitely see how it can be a > huge improvement in performance :-) > > If you can also "fix" the other __hash__ methods in the same way, it > would be great. To follow up on a very old email. The latest results are: # python2.7 iforkey.py ifdict: [2.110611915588379, 2.12678599357605, 2.1126320362091064] keydict: [2.1322460174560547, 2.098900079727173, 2.0998198986053467] defaultdict: [3.184070110321045, 3.2007319927215576, 3.188380002975464] # pypy2.2 iforkey.py ifdict: [0.510915994644165, 0.23750996589660645, 0.2241990566253662] keydict: [0.23270201683044434, 0.18279695510864258, 0.18002104759216309] defaultdict: [3.4535930156707764, 3.3697848320007324, 3.392897129058838] And using the latest datetime.py: pypy iforkey.py ifdict: [0.2814958095550537, 0.23425602912902832, 0.2206539916992] keydict: [0.23637700080871582, 0.18506789207458496, 0.1831810474395752] defaultdict: [2.8174121379852295, 2.74626088142395, 2.7308008670806885] Excellent, thank you :-) Regards signature.asc Description: This is a digitally signed message part. ___ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev
Re: [pypy-dev] Python vs pypy: interesting performance difference [dict.setdefault]
Hi David, On Thu, Aug 25, 2011 at 9:44 PM, David Naylor wrote: > Below is the patch, and results, for my proposed hash methods for > datetime.datetime (and easily adaptable to include tzinfo and the other > datetime objects). I tried to make the hash safe for both 32bit and 64bit > systems, and beyond. Yes, the patch looks good to me. I can definitely see how it can be a huge improvement in performance :-) If you can also "fix" the other __hash__ methods in the same way, it would be great. A bientôt, Armin. ___ pypy-dev mailing list pypy-dev@python.org http://mail.python.org/mailman/listinfo/pypy-dev
Re: [pypy-dev] Python vs pypy: interesting performance difference [dict.setdefault]
On Tuesday, 16 August 2011 15:27:30 Armin Rigo wrote: > Hi David, > > On Mon, Aug 15, 2011 at 6:20 PM, David Naylor wrote: > > For me the performance of datetime object's hashing is sufficient but I > > think the python code could use some performance improvements. Is my > > approach using a direct computation to type long acceptable (in > > principle). If so I can refine it and submit a patch. > > Yes, replacing the hash with a faster-to-compute one is fine. It's > best performance-wise if you can avoid using Python longs. As far as > I know it just needs some random-looking xor-ing and shifting of the > fields. Note, of course, that you must carefully satisfy the property > that for any objects x and y, if "x == y" then "hash(x) == hash(y)". Below is the patch, and results, for my proposed hash methods for datetime.datetime (and easily adaptable to include tzinfo and the other datetime objects). I tried to make the hash safe for both 32bit and 64bit systems, and beyond. The results are: # python datetest.py (datetime.py) hash_unity: 35.83 seconds hash_unity: 44.60 seconds hash_datetime: 65.58 seconds hash_datetime: 53.95 seconds # python datetest.py hash_unity: 5.70 seconds hash_unity: 5.69 seconds hash_datetime: 4.88 seconds hash_datetime: 4.90 seconds # pypy datetest.py hash_unity: 0.74 seconds hash_unity: 0.63 seconds hash_datetime: 11.74 seconds hash_datetime: 11.47 seconds # pypy datetest.py (patched datetime.py) hash_unity: 0.73 seconds hash_unity: 0.62 seconds hash_datetime: 0.76 seconds hash_datetime: 0.64 seconds So, based on my patch there is a 7.7x improvement over python and a 17.9x improvement over the previous pypy implementation. If the above approach is acceptable I will complete the patch? Regards --- /usr/local/pypy-1.6/lib_pypy/datetime.py 2011-08-11 20:39:53.0 +0200 +++ datetime.py 2011-08-25 21:31:56.0 +0200 @@ -17,11 +17,14 @@ """ import time as _time +import sys as _sys import math as _math MINYEAR = 1 MAXYEAR = +_BITS = int(_math.log(_sys.maxint, 2) + 1) + # Utility functions, adapted from Python's Demo/classes/Dates.py, which # also assumes the current Gregorian calendar indefinitely extended in # both directions. Difference: Dates.py calls January 1 of year 0 day @@ -1796,12 +1799,14 @@ return base + timedelta(minutes = otoff-myoff) def __hash__(self): -tzoff = self._utcoffset() -if tzoff is None: -return hash(self.__getstate()[0]) -days = _ymd2ord(self.year, self.month, self.day) -seconds = self.hour * 3600 + (self.minute - tzoff) * 60 + self.second -return hash(timedelta(days, seconds, self.microsecond)) +if _BITS >= 64: +return (self.microsecond ^ (self.second << 20) ^ +(self.minute << 26) ^ (self.hour << 32) ^ (self.day << 37) ^ +(self.month << 42) ^ (self.year << 46)) +else: +return ((self.microsecond ^ (self.hour << 20) ^ (self.day << 25)) ^ +(self.second ^ (self.minute << 6) ^ (self.month << 12) ^ + (self.year << 19))) # Pickle support. import datetime import random import time dates = [ (random.randint(datetime.MINYEAR, datetime.MAXYEAR), random.randint(1, 12), random.randint(1, 28), random.randint(1, 23), random.randint(0, 59), random.randint(0, 59), random.randint(0, 99)) for i in range(1000)] def bench(fn): def inner(*args, **kwds): start = time.clock() res = fn(*args, **kwds) end = time.clock() try: name = fn.__name__ except AttributeError: name = repr(fn) # print '%s: %.2f seconds' % (name, end-start) return res return inner class unity(datetime.datetime): def __hash__(self): return 0 @bench def hash_unity(): result = 0 for date in dates: result ^= hash(unity(*date)) return result @bench def hash_datetime(): result = 0 for date in dates: result ^= hash(datetime.datetime(*date)) return result hash_unity() hash_unity() hash_datetime() hash_datetime() signature.asc Description: This is a digitally signed message part. ___ pypy-dev mailing list pypy-dev@python.org http://mail.python.org/mailman/listinfo/pypy-dev
Re: [pypy-dev] Python vs pypy: interesting performance difference [dict.setdefault]
Hi David, On Mon, Aug 15, 2011 at 6:20 PM, David Naylor wrote: > For me the performance of datetime object's hashing is sufficient but I think > the python code could use some performance improvements. Is my approach using > a direct computation to type long acceptable (in principle). If so I can > refine it and submit a patch. Yes, replacing the hash with a faster-to-compute one is fine. It's best performance-wise if you can avoid using Python longs. As far as I know it just needs some random-looking xor-ing and shifting of the fields. Note, of course, that you must carefully satisfy the property that for any objects x and y, if "x == y" then "hash(x) == hash(y)". A bientôt, Armin. ___ pypy-dev mailing list pypy-dev@python.org http://mail.python.org/mailman/listinfo/pypy-dev
Re: [pypy-dev] Python vs pypy: interesting performance difference [dict.setdefault]
On 15/08/11 15:36, Alex Gaynor wrote: I'd like to express that, unless we have a very compelling reason, we should try to keep more stuff in pure python, as opposed to RPython. Mostly because it speeds up translation ;) (also easier to test, easier to write, etc.). or, on the other hand, we should try to improve the JIT so that pure python code can be as fast as rpython :-) ___ pypy-dev mailing list pypy-dev@python.org http://mail.python.org/mailman/listinfo/pypy-dev
Re: [pypy-dev] Python vs pypy: interesting performance difference [dict.setdefault]
On Monday, 15 August 2011 09:05:22 Armin Rigo wrote: > Hi David, > > On Sat, Aug 13, 2011 at 8:14 PM, David Naylor wrote: > > So, it appears pypy is failing to speed up this contrived example... > > I think that it is expected, because the hash is computed entirely as > pure Python code in the case of PyPy (doing integer arithmetic with > overflow checks) whereas it is written in C in the case of RPython. I > find it rather good that we get overall close to the speed of CPython, > given that we just have datetime.py... Try using datetime.py on top > of CPython too :-) > > More seriously, if we want to be really better with datetime objects, > as opposed to "good enough for now", then I fear we need to rewrite > datetime.py in RPython, like CPython did rewrite it in C. Or some > intermediate solution is possible too, having an internal class which > implements only basic operations (like hash) and which is subclassed > in pure Python. But I would happily leave that task to someone with > an absolutely real need for high-performance datetime objects; for me > "good enough for now" is good enough, for now... For me the performance of datetime object's hashing is sufficient but I think the python code could use some performance improvements. Is my approach using a direct computation to type long acceptable (in principle). If so I can refine it and submit a patch. signature.asc Description: This is a digitally signed message part. ___ pypy-dev mailing list pypy-dev@python.org http://mail.python.org/mailman/listinfo/pypy-dev
Re: [pypy-dev] Python vs pypy: interesting performance difference [dict.setdefault]
Hi Alex, On Mon, Aug 15, 2011 at 3:36 PM, Alex Gaynor wrote: > I'd like to express that, unless we have a very compelling reason, we should > try to keep more stuff in pure python, as opposed to RPython. Mostly > because it speeds up translation ;) (also easier to test, easier to write, > etc.). I agree. I was really talking about someone needing absolutely best performance for datetime objects. I'm sure that getting "only average" performance is more than enough for all use cases apart from microbenchmarks. A bientôt, Armin. ___ pypy-dev mailing list pypy-dev@python.org http://mail.python.org/mailman/listinfo/pypy-dev
Re: [pypy-dev] Python vs pypy: interesting performance difference [dict.setdefault]
On Mon, Aug 15, 2011 at 2:05 AM, Armin Rigo wrote: > Hi David, > > On Sat, Aug 13, 2011 at 8:14 PM, David Naylor > wrote: > > So, it appears pypy is failing to speed up this contrived example... > > I think that it is expected, because the hash is computed entirely as > pure Python code in the case of PyPy (doing integer arithmetic with > overflow checks) whereas it is written in C in the case of RPython. I > find it rather good that we get overall close to the speed of CPython, > given that we just have datetime.py... Try using datetime.py on top > of CPython too :-) > > More seriously, if we want to be really better with datetime objects, > as opposed to "good enough for now", then I fear we need to rewrite > datetime.py in RPython, like CPython did rewrite it in C. Or some > intermediate solution is possible too, having an internal class which > implements only basic operations (like hash) and which is subclassed > in pure Python. But I would happily leave that task to someone with > an absolutely real need for high-performance datetime objects; for me > "good enough for now" is good enough, for now... > > > A bientôt, > > Armin. > ___ > pypy-dev mailing list > pypy-dev@python.org > http://mail.python.org/mailman/listinfo/pypy-dev > I'd like to express that, unless we have a very compelling reason, we should try to keep more stuff in pure python, as opposed to RPython. Mostly because it speeds up translation ;) (also easier to test, easier to write, etc.). Alex -- "I disapprove of what you say, but I will defend to the death your right to say it." -- Evelyn Beatrice Hall (summarizing Voltaire) "The people's good is the highest law." -- Cicero ___ pypy-dev mailing list pypy-dev@python.org http://mail.python.org/mailman/listinfo/pypy-dev
Re: [pypy-dev] Python vs pypy: interesting performance difference [dict.setdefault]
Hi David, On Sat, Aug 13, 2011 at 8:14 PM, David Naylor wrote: > So, it appears pypy is failing to speed up this contrived example... I think that it is expected, because the hash is computed entirely as pure Python code in the case of PyPy (doing integer arithmetic with overflow checks) whereas it is written in C in the case of RPython. I find it rather good that we get overall close to the speed of CPython, given that we just have datetime.py... Try using datetime.py on top of CPython too :-) More seriously, if we want to be really better with datetime objects, as opposed to "good enough for now", then I fear we need to rewrite datetime.py in RPython, like CPython did rewrite it in C. Or some intermediate solution is possible too, having an internal class which implements only basic operations (like hash) and which is subclassed in pure Python. But I would happily leave that task to someone with an absolutely real need for high-performance datetime objects; for me "good enough for now" is good enough, for now... A bientôt, Armin. ___ pypy-dev mailing list pypy-dev@python.org http://mail.python.org/mailman/listinfo/pypy-dev
Re: [pypy-dev] Python vs pypy: interesting performance difference [dict.setdefault]
On Saturday, 13 August 2011 18:32:58 Antonio Cuni wrote: > On 12/08/11 17:49, David Naylor wrote: > > Would it not be a simple matter of changing the __(get|set)state method > > to use a tuple or even an int(long)? > > yes, I think it should be enough. I'm going on vacation soon and I won't > have a look at it right now, so if anybody wants to work on it, he's very > welcome (hint, hint :-)). See attached for my naive attempt (and I did not run any unit tests on the code). It provides between 4.5x to 13.4x improvement in hash speed. If method 1 is acceptable I could properly implement it. If you look at the __hash__ method for datetime you will notice three return statements. The performance of those statements are as follows, based on: @bench.bench def hashdate(): res = 0 for i in range(1000): now = datetime.datetime(i // 1 + 1, (i % 1) % 12 + 1, (i % 100) % 28 + 1) res ^= hash(now) return res hashdate() Method 1 (direct integer compute): hashdate: 0.70 seconds Method 2 (hash of __getstate()): hashdate: 2.39 seconds Method 3 (unity): hashdate: 0.68 seconds Method 4 (original): hashdate: 10.93 seconds (python: 12.60 seconds) And back to my original "benchmark" with the change of `key = i`: # python iforkey.py ifdict: [2.8676719665527344, 2.872897148132324, 2.8396730422973633] keydict: [2.3266799449920654, 2.3431849479675293, 2.3421859741210938] defaultdict: [3.706634044647217, 3.6940698623657227, 3.7520179748535156] # pypy iforkey.py (original) ifdict: [29.201794147491455, 29.047310829162598, 29.34461998939514] keydict: [14.939809083938599, 15.250468015670776, 15.542209148406982] defaultdict: [15.11891484260559, 15.064191102981567, 14.94817304611206] # pypy iforkey (method 1) ifdict: [7.455403804779053, 7.376722097396851, 7.447360038757324] keydict: [3.9056499004364014, 3.833178997039795, 3.8482401371002197] defaultdict: [3.9568910598754883, 3.8757669925689697, 3.8843569742] # pypy iforkey.py (method 2) ifdict: [11.993246078491211, 11.865861892700195, 11.916783094406128] keydict: [6.141685962677002, 6.092236042022705, 6.082683086395264] defaultdict: [6.376708030700684, 6.337490081787109, 6.361854791641235] So, it appears pypy is failing to speed up this contrived example... --- datetime.py 2011-08-11 20:39:53.0 +0200 +++ /home/DragonSA/datetime.py 2011-08-13 19:48:49.0 +0200 @@ -13,7 +13,7 @@ Sources for time zone and DST data: http://www.twinsun.com/tz/tz-link.htm This was originally copied from the sandbox of the CPython CVS repository. -Thanks to Tim Peters for suggesting using it. +Thanks to Tim Peters for suggesting using it. """ import time as _time @@ -742,11 +742,6 @@ year, month, day (required, base 1) """ -if isinstance(year, str): -# Pickle support -self = object.__new__(cls) -self.__setstate(year) -return self _check_date_fields(year, month, day) self = object.__new__(cls) self.__year = year @@ -986,14 +981,7 @@ __safe_for_unpickling__ = True # For Python 2.2 def __getstate(self): -yhi, ylo = divmod(self.__year, 256) -return ("%c%c%c%c" % (yhi, ylo, self.__month, self.__day), ) - -def __setstate(self, string): -if len(string) != 4 or not (1 <= ord(string[2]) <= 12): -raise TypeError("not enough arguments") -yhi, ylo, self.__month, self.__day = map(ord, string) -self.__year = yhi * 256 + ylo +return (self.__year, self.__month, self.__day) def __reduce__(self): return (self.__class__, self.__getstate()) @@ -1112,10 +1100,6 @@ tzinfo (default to None) """ self = object.__new__(cls) -if isinstance(hour, str): -# Pickle support -self.__setstate(hour, minute or None) -return self _check_tzinfo_arg(tzinfo) _check_time_fields(hour, minute, second, microsecond) self.__hour = hour @@ -1201,13 +1185,7 @@ def __hash__(self): """Hash.""" -tzoff = self._utcoffset() -if not tzoff: # zero or None -return hash(self.__getstate()[0]) -h, m = divmod(self.hour * 60 + self.minute - tzoff, 60) -if 0 <= h < 24: -return hash(time(h, m, self.second, self.microsecond)) -return hash((h, m, self.second, self.microsecond)) +return hash(self.__getstate()) # Conversion to string @@ -1351,22 +1329,7 @@ __safe_for_unpickling__ = True # For Python 2.2 def __getstate(self): -us2, us3 = divmod(self.__microsecond, 256) -us1, us2 = divmod(us2, 256) -basestate = ("%c" * 6) % (self.__hour, self.__minute, self.__second, - us1, us2, us3) -if self._tzinfo is None: -return (basestate,) -else: -return (basestate, self._tzinfo) - -def __set
Re: [pypy-dev] Python vs pypy: interesting performance difference [dict.setdefault]
On 12/08/11 17:49, David Naylor wrote: Would it not be a simple matter of changing the __(get|set)state method to use a tuple or even an int(long)? yes, I think it should be enough. I'm going on vacation soon and I won't have a look at it right now, so if anybody wants to work on it, he's very welcome (hint, hint :-)). ciao, Anto ___ pypy-dev mailing list pypy-dev@python.org http://mail.python.org/mailman/listinfo/pypy-dev
Re: [pypy-dev] Python vs pypy: interesting performance difference [dict.setdefault]
On Friday, 12 August 2011 14:51:36 Antonio Cuni wrote: > Hello David, > > On 10/08/11 21:27, David Naylor wrote: > > Hi, > > > > I needed to create a cache of date and time objects and I wondered what > > was the best way to handle the cache. For comparison I put together > > > the following test: > [cut] > > > Pypy displays significant slowdown in the defaultdict function, otherwise > > displays its usual speedup. To check what is the cause I replaced > > i.date() with i.day and found no major difference in times. It appears > > dict.setdefault (or it's interaction with jit) is causing a slow down. > > I don't think that setdefault is the culprit here, as shown by this > benchmark: I made a mistake in the script, I intended to use the day number as the key (as was in the case of the other tests). Changing the critical line to "b.append(cache.setdefault(i.day, i.date()))" gives comparable results to python (factoring in the speed up). > as you can see, in PyPy there is almost no difference between using a > try/except or using setdefault. I would argue that with an expensive value operation the try/except will be much faster than setdefault (where cache hits are above 0). Using my fixed script I get: keydict: [0.3298788070678711, 0.28450703620910645, 0.28931379318237305] defaultdict: [0.47180604934692383, 0.4183311462402344, 0.4172670841217041] indicating setdefault is about 40% slower (1.4 times slower), which I attribute to the cost of i.date(). > I had a quick look at the code (which in PyPy is written at applevel) and > it does a lot of nonsense. In particular, __hash__ calls __getstate which > formats a dynamically created string, just to call hash() on it. I > suppose that this code can (and should) be optimized a lot. I may try to > look at it but it's unclear when, since I'm about to go on vacation. Would it not be a simple matter of changing the __(get|set)state method to use a tuple or even an int(long)? signature.asc Description: This is a digitally signed message part. ___ pypy-dev mailing list pypy-dev@python.org http://mail.python.org/mailman/listinfo/pypy-dev
Re: [pypy-dev] Python vs pypy: interesting performance difference [dict.setdefault]
On 12/08/11 14:51, Antonio Cuni wrote: @bench.bench for reference, here is the implementation of the "bench" decorator: https://bitbucket.org/antocuni/env/src/1b11491fab79/pypath/bench.py ___ pypy-dev mailing list pypy-dev@python.org http://mail.python.org/mailman/listinfo/pypy-dev
Re: [pypy-dev] Python vs pypy: interesting performance difference [dict.setdefault]
Hello David, On 10/08/11 21:27, David Naylor wrote: Hi, I needed to create a cache of date and time objects and I wondered what was the best way to handle the cache. For comparison I put together the following test: [cut] Pypy displays significant slowdown in the defaultdict function, otherwise displays its usual speedup. To check what is the cause I replaced i.date() with i.day and found no major difference in times. It appears dict.setdefault (or it's interaction with jit) is causing a slow down. I don't think that setdefault is the culprit here, as shown by this benchmark: @bench.bench def setdef(): d = {} for i in range(1000): d.setdefault(i, i) return d @bench.bench def tryexcept(): d = {} for i in range(1000): try: d[i] except KeyError: d[i] = i return d setdef() tryexcept() $ python dictbench.py setdef: 2.03 seconds tryexcept: 8.54 seconds tmp $ pypy-c dictbench.py setdef: 1.31 seconds tryexcept: 1.37 seconds as you can see, in PyPy there is almost no difference between using a try/except or using setdefault. What is very slow on PyPy seems to be hashing datetime objects: import datetime @bench.bench def hashdate(): res = 0 for i in range(10): now = datetime.datetime.now() res ^= hash(now) return res hashdate() $ pypy-c dictbench.py hashdate: 0.83 seconds $ python dictbench.py hashdate: 0.22 seconds I had a quick look at the code (which in PyPy is written at applevel) and it does a lot of nonsense. In particular, __hash__ calls __getstate which formats a dynamically created string, just to call hash() on it. I suppose that this code can (and should) be optimized a lot. I may try to look at it but it's unclear when, since I'm about to go on vacation. ciao, Anto ___ pypy-dev mailing list pypy-dev@python.org http://mail.python.org/mailman/listinfo/pypy-dev