Re: Ref-strings in logging messages (was: Performance issue with CPython 3.10 + Cython)
> On 8 Oct 2022, at 11:50, Weatherby,Gerard wrote: > > Logging does support passing a callable, if indirectly. It only calls __str__ > on the object passed if debugging is enabled. > > class Defer: > > def __init__(self,fn): > self.fn = fn > > def __str__(self): > return self.fn() > > def some_expensive_function(): > return "hello" > > logging.basicConfig() > logging.debug(Defer(some_expensive_function)) Oh what a clever hack. Took a few minutes of code reading to see why this works. You are exploiting the str(msg) that is in class LogRecords getMessage(). ``` def getMessage(self): """ Return the message for this LogRecord. Return the message for this LogRecord after merging any user-supplied arguments with the message. """ msg = str(self.msg) if self.args: msg = msg % self.args return msg ``` Barry > > > From: Python-list <mailto:python-list-bounces+gweatherby=uchc@python.org>> on behalf of > Barry mailto:ba...@barrys-emacs.org>> > Date: Friday, October 7, 2022 at 1:30 PM > To: MRAB mailto:pyt...@mrabarnett.plus.com>> > Cc: python-list@python.org <mailto:python-list@python.org> > mailto:python-list@python.org>> > Subject: Re: Ref-strings in logging messages (was: Performance issue with > CPython 3.10 + Cython) > > *** Attention: This is an external email. Use caution responding, opening > attachments or clicking on links. *** > > > On 7 Oct 2022, at 18:16, MRAB wrote: > > > > On 2022-10-07 16:45, Skip Montanaro wrote: > >>> On Fri, Oct 7, 2022 at 9:42 AM Andreas Ames > >>> > >>> wrote: > >>> 1. The culprit was me. As lazy as I am, I have used f-strings all over the > >>> place in calls to `logging.logger.debug()` and friends, evaluating all > >>> arguments regardless of whether the logger was enabled or not. > >>> > >> I thought there was some discussion about whether and how to efficiently > >> admit f-strings to the logging package. I'm guessing that's not gone > >> anywhere (yet). > > Letting you pass in a callable to call might help because that you could > > use lambda. > > Yep, that’s the obvious way to avoid expensive log data generation. > Would need logging module to support that use case. > > Barry > > > -- > > https://urldefense.com/v3/__https://mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!mrESxAj9YCHsdtNAfkNiY-Zf6U3WTIqaNrgBmbw1ELlQy51ilob43dD0ONsqvg4a94MEdOdwomgyqfyABbvRnA$ > > > > <https://urldefense.com/v3/__https:/mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!mrESxAj9YCHsdtNAfkNiY-Zf6U3WTIqaNrgBmbw1ELlQy51ilob43dD0ONsqvg4a94MEdOdwomgyqfyABbvRnA$> > > > > -- > https://urldefense.com/v3/__https://mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!mrESxAj9YCHsdtNAfkNiY-Zf6U3WTIqaNrgBmbw1ELlQy51ilob43dD0ONsqvg4a94MEdOdwomgyqfyABbvRnA$ > > <https://urldefense.com/v3/__https:/mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!mrESxAj9YCHsdtNAfkNiY-Zf6U3WTIqaNrgBmbw1ELlQy51ilob43dD0ONsqvg4a94MEdOdwomgyqfyABbvRnA$> -- https://mail.python.org/mailman/listinfo/python-list
Re: Ref-strings in logging messages (was: Performance issue with CPython 3.10 + Cython)
Logging does support passing a callable, if indirectly. It only calls __str__ on the object passed if debugging is enabled. class Defer: def __init__(self,fn): self.fn = fn def __str__(self): return self.fn() def some_expensive_function(): return "hello" logging.basicConfig() logging.debug(Defer(some_expensive_function)) From: Python-list on behalf of Barry Date: Friday, October 7, 2022 at 1:30 PM To: MRAB Cc: python-list@python.org Subject: Re: Ref-strings in logging messages (was: Performance issue with CPython 3.10 + Cython) *** Attention: This is an external email. Use caution responding, opening attachments or clicking on links. *** > On 7 Oct 2022, at 18:16, MRAB wrote: > > On 2022-10-07 16:45, Skip Montanaro wrote: >>> On Fri, Oct 7, 2022 at 9:42 AM Andreas Ames >>> wrote: >>> 1. The culprit was me. As lazy as I am, I have used f-strings all over the >>> place in calls to `logging.logger.debug()` and friends, evaluating all >>> arguments regardless of whether the logger was enabled or not. >>> >> I thought there was some discussion about whether and how to efficiently >> admit f-strings to the logging package. I'm guessing that's not gone >> anywhere (yet). > Letting you pass in a callable to call might help because that you could use > lambda. Yep, that’s the obvious way to avoid expensive log data generation. Would need logging module to support that use case. Barry > -- > https://urldefense.com/v3/__https://mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!mrESxAj9YCHsdtNAfkNiY-Zf6U3WTIqaNrgBmbw1ELlQy51ilob43dD0ONsqvg4a94MEdOdwomgyqfyABbvRnA$<https://urldefense.com/v3/__https:/mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!mrESxAj9YCHsdtNAfkNiY-Zf6U3WTIqaNrgBmbw1ELlQy51ilob43dD0ONsqvg4a94MEdOdwomgyqfyABbvRnA$> > -- https://urldefense.com/v3/__https://mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!mrESxAj9YCHsdtNAfkNiY-Zf6U3WTIqaNrgBmbw1ELlQy51ilob43dD0ONsqvg4a94MEdOdwomgyqfyABbvRnA$<https://urldefense.com/v3/__https:/mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!mrESxAj9YCHsdtNAfkNiY-Zf6U3WTIqaNrgBmbw1ELlQy51ilob43dD0ONsqvg4a94MEdOdwomgyqfyABbvRnA$> -- https://mail.python.org/mailman/listinfo/python-list
Re: Ref-strings in logging messages (was: Performance issue with CPython 3.10 + Cython)
On Fri, 7 Oct 2022 18:28:06 +0100 Barry wrote: > > On 7 Oct 2022, at 18:16, MRAB wrote: > > > > On 2022-10-07 16:45, Skip Montanaro wrote: > >>> On Fri, Oct 7, 2022 at 9:42 AM Andreas Ames > >>> > >>> wrote: > >>> 1. The culprit was me. As lazy as I am, I have used f-strings all over the > >>> place in calls to `logging.logger.debug()` and friends, evaluating all > >>> arguments regardless of whether the logger was enabled or not. > >>> > >> I thought there was some discussion about whether and how to efficiently > >> admit f-strings to the logging package. I'm guessing that's not gone > >> anywhere (yet). > > Letting you pass in a callable to call might help because that you could > > use lambda. > > Yep, that’s the obvious way to avoid expensive log data generation. > Would need logging module to support that use case. I have some logging code that uses eval() to evaluate expressions using locals and globals in a parent stack frame, together with a parser to find `{...}` items in a string. I guess this constitutes a (basic) runtime implementation of f-strings. As such it can avoid expensive evaluation/parsing when disabled, though it's probably slow when enabled compared to native f-strings. It seems to work quite well in practise, and also allows one to add some extra formatting features. For details see: https://git.ghostscript.com/?p=mupdf.git;a=blob;f=scripts/jlib.py;h=e85e9f2c4;hb=HEAD#l41 - Jules -- http://op59.net -- https://mail.python.org/mailman/listinfo/python-list
Re: Ref-strings in logging messages (was: Performance issue with CPython 3.10 + Cython)
> On 7 Oct 2022, at 19:09, Weatherby,Gerard wrote: > The obvious way to avoid log generation is: > > if logger.isEnableFor(logging.DEBUG): >logger.debug( expensive processing ) > > > Of course, having logging alter program flow could lead to hard to debug bugs. Altered flow is less of an issue the the verbosity of the above. We discussed ways to improve this pattern a few years ago. That lead to no changes. What I have used is a class that defines __bool__ to report if logging is enabled and __call__ to log. Then you can do this: log_debug = logger_from(DEBUG) log_debug and log_debug(‘expensive %s’ % (complex(),)) Barry > > From: Python-list on > behalf of Barry > Date: Friday, October 7, 2022 at 1:30 PM > To: MRAB > Cc: python-list@python.org > Subject: Re: Ref-strings in logging messages (was: Performance issue with > CPython 3.10 + Cython) > *** Attention: This is an external email. Use caution responding, opening > attachments or clicking on links. *** > >> On 7 Oct 2022, at 18:16, MRAB wrote: >> >> On 2022-10-07 16:45, Skip Montanaro wrote: >>>> On Fri, Oct 7, 2022 at 9:42 AM Andreas Ames >>>> wrote: >>>> 1. The culprit was me. As lazy as I am, I have used f-strings all over the >>>> place in calls to `logging.logger.debug()` and friends, evaluating all >>>> arguments regardless of whether the logger was enabled or not. >>> I thought there was some discussion about whether and how to efficiently >>> admit f-strings to the logging package. I'm guessing that's not gone >>> anywhere (yet). >> Letting you pass in a callable to call might help because that you could use >> lambda. > > Yep, that’s the obvious way to avoid expensive log data generation. > Would need logging module to support that use case. > > Barry > >> -- >> https://urldefense.com/v3/__https://mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!mrESxAj9YCHsdtNAfkNiY-Zf6U3WTIqaNrgBmbw1ELlQy51ilob43dD0ONsqvg4a94MEdOdwomgyqfyABbvRnA$<https://urldefense.com/v3/__https:/mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!mrESxAj9YCHsdtNAfkNiY-Zf6U3WTIqaNrgBmbw1ELlQy51ilob43dD0ONsqvg4a94MEdOdwomgyqfyABbvRnA$> > > -- > https://urldefense.com/v3/__https://mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!mrESxAj9YCHsdtNAfkNiY-Zf6U3WTIqaNrgBmbw1ELlQy51ilob43dD0ONsqvg4a94MEdOdwomgyqfyABbvRnA$<https://urldefense.com/v3/__https:/mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!mrESxAj9YCHsdtNAfkNiY-Zf6U3WTIqaNrgBmbw1ELlQy51ilob43dD0ONsqvg4a94MEdOdwomgyqfyABbvRnA$> > -- > https://mail.python.org/mailman/listinfo/python-list -- https://mail.python.org/mailman/listinfo/python-list
Re: Ref-strings in logging messages (was: Performance issue with CPython 3.10 + Cython)
The obvious way to avoid log generation is: if logger.isEnableFor(logging.DEBUG): logger.debug( expensive processing ) Of course, having logging alter program flow could lead to hard to debug bugs. From: Python-list on behalf of Barry Date: Friday, October 7, 2022 at 1:30 PM To: MRAB Cc: python-list@python.org Subject: Re: Ref-strings in logging messages (was: Performance issue with CPython 3.10 + Cython) *** Attention: This is an external email. Use caution responding, opening attachments or clicking on links. *** > On 7 Oct 2022, at 18:16, MRAB wrote: > > On 2022-10-07 16:45, Skip Montanaro wrote: >>> On Fri, Oct 7, 2022 at 9:42 AM Andreas Ames >>> wrote: >>> 1. The culprit was me. As lazy as I am, I have used f-strings all over the >>> place in calls to `logging.logger.debug()` and friends, evaluating all >>> arguments regardless of whether the logger was enabled or not. >>> >> I thought there was some discussion about whether and how to efficiently >> admit f-strings to the logging package. I'm guessing that's not gone >> anywhere (yet). > Letting you pass in a callable to call might help because that you could use > lambda. Yep, that’s the obvious way to avoid expensive log data generation. Would need logging module to support that use case. Barry > -- > https://urldefense.com/v3/__https://mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!mrESxAj9YCHsdtNAfkNiY-Zf6U3WTIqaNrgBmbw1ELlQy51ilob43dD0ONsqvg4a94MEdOdwomgyqfyABbvRnA$<https://urldefense.com/v3/__https:/mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!mrESxAj9YCHsdtNAfkNiY-Zf6U3WTIqaNrgBmbw1ELlQy51ilob43dD0ONsqvg4a94MEdOdwomgyqfyABbvRnA$> > -- https://urldefense.com/v3/__https://mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!mrESxAj9YCHsdtNAfkNiY-Zf6U3WTIqaNrgBmbw1ELlQy51ilob43dD0ONsqvg4a94MEdOdwomgyqfyABbvRnA$<https://urldefense.com/v3/__https:/mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!mrESxAj9YCHsdtNAfkNiY-Zf6U3WTIqaNrgBmbw1ELlQy51ilob43dD0ONsqvg4a94MEdOdwomgyqfyABbvRnA$> -- https://mail.python.org/mailman/listinfo/python-list
Re: Ref-strings in logging messages (was: Performance issue with CPython 3.10 + Cython)
> On 7 Oct 2022, at 18:16, MRAB wrote: > > On 2022-10-07 16:45, Skip Montanaro wrote: >>> On Fri, Oct 7, 2022 at 9:42 AM Andreas Ames >>> wrote: >>> 1. The culprit was me. As lazy as I am, I have used f-strings all over the >>> place in calls to `logging.logger.debug()` and friends, evaluating all >>> arguments regardless of whether the logger was enabled or not. >>> >> I thought there was some discussion about whether and how to efficiently >> admit f-strings to the logging package. I'm guessing that's not gone >> anywhere (yet). > Letting you pass in a callable to call might help because that you could use > lambda. Yep, that’s the obvious way to avoid expensive log data generation. Would need logging module to support that use case. Barry > -- > https://mail.python.org/mailman/listinfo/python-list > -- https://mail.python.org/mailman/listinfo/python-list
Re: Ref-strings in logging messages (was: Performance issue with CPython 3.10 + Cython)
On 2022-10-07 16:45, Skip Montanaro wrote: On Fri, Oct 7, 2022 at 9:42 AM Andreas Ames wrote: 1. The culprit was me. As lazy as I am, I have used f-strings all over the place in calls to `logging.logger.debug()` and friends, evaluating all arguments regardless of whether the logger was enabled or not. I thought there was some discussion about whether and how to efficiently admit f-strings to the logging package. I'm guessing that's not gone anywhere (yet). Letting you pass in a callable to call might help because that you could use lambda. -- https://mail.python.org/mailman/listinfo/python-list
Re: Ref-strings in logging messages (was: Performance issue with CPython 3.10 + Cython)
> On 7 Oct 2022, at 16:48, Skip Montanaro wrote: > > On Fri, Oct 7, 2022 at 9:42 AM Andreas Ames > wrote: > >> 1. The culprit was me. As lazy as I am, I have used f-strings all over the >> place in calls to `logging.logger.debug()` and friends, evaluating all >> arguments regardless of whether the logger was enabled or not. >> > > I thought there was some discussion about whether and how to efficiently > admit f-strings to the logging package. I'm guessing that's not gone > anywhere (yet). That cannot be done as the f-string is computed before the log call. Maybe you are thinking of the lazy expression idea for this. That idea seems to have got no where as its not clear how to implement it without performance issues. Barry > > Skip > -- > https://mail.python.org/mailman/listinfo/python-list > -- https://mail.python.org/mailman/listinfo/python-list
Re: Ref-strings in logging messages (was: Performance issue with CPython 3.10 + Cython)
Dang autocorrect. Subject first word was supposed to be "f-strings" not "ref-strings." Sorry about that. S On Fri, Oct 7, 2022, 10:45 AM Skip Montanaro wrote: > > > On Fri, Oct 7, 2022 at 9:42 AM Andreas Ames > wrote: > >> 1. The culprit was me. As lazy as I am, I have used f-strings all over the >> place in calls to `logging.logger.debug()` and friends, evaluating all >> arguments regardless of whether the logger was enabled or not. >> > > I thought there was some discussion about whether and how to efficiently > admit f-strings to the logging package. I'm guessing that's not gone > anywhere (yet). > > Skip > -- https://mail.python.org/mailman/listinfo/python-list
Ref-strings in logging messages (was: Performance issue with CPython 3.10 + Cython)
On Fri, Oct 7, 2022 at 9:42 AM Andreas Ames wrote: > 1. The culprit was me. As lazy as I am, I have used f-strings all over the > place in calls to `logging.logger.debug()` and friends, evaluating all > arguments regardless of whether the logger was enabled or not. > I thought there was some discussion about whether and how to efficiently admit f-strings to the logging package. I'm guessing that's not gone anywhere (yet). Skip -- https://mail.python.org/mailman/listinfo/python-list
Re: Performance issue with CPython 3.10 + Cython
Answering to myself, just for the records: 1. The culprit was me. As lazy as I am, I have used f-strings all over the place in calls to `logging.logger.debug()` and friends, evaluating all arguments regardless of whether the logger was enabled or not. Replacing these f-strings by regular printf-like format strings solved the issue. Now the application bowls happily along, consistently below 0.02 seconds per second application time. 2. Valgrind + callgrind is an awesome toolchain to spot performance issues, even on VMs. Am Di., 4. Okt. 2022 um 11:05 Uhr schrieb Andreas Ames < andreas.0815.qwe...@gmail.com>: > Hi all, > > I am wrapping an embedded application (, which does not use any dynamic > memory management,) using Cython to call it from CPython. The wrapped > application uses a cyclic executive, i.e. everything is done in the > input-logic-output design, typical for some real-time related domains. > Consequentially, every application cycle executes more or less the very > same code. As I am still in a prototyping stadium, the wrapped process is > completely CPU-bound, i.e. except of some logging output there is no I/O > whatsoever. > > During one second of "application time", I am doing exactly 120 calls into > the application through three Cython-wrapped API functions. The > application uses some platform-dependent APIs, which I have also wrapped > with Cython, so that there are numerous callbacks into the Python realm per > call into the application. What I am observing now, is that the performance > per "application second" decreases (remember: executing code that does the > same thing on every cycle) and extending the number of loop iterations does > not seem to cause any bound to this decrease. In the log ouput below, you > can see the GC counts, which look innocent to me. The "real time" is > measured using "time.time()". The "top" utility does not suggest any memory > leak either. I am developing on WSL2, but I have checked that this > performance effect also happens on physical machines. Right now, I am > staring at "kcachegrind", but I have no idea, how to determine time series > for the performance of functions (I am not looking for those functions, > which need the most time, but for those, which consume more and more > execution time). > > One more thing to look for could be memory fragmentation, but before that > I would like to ask the experts here for their ideas and experiences and/or > for tools, which could help to find the culprit. > > 2022-10-04 10:40:50|INFO|__main__ |Execution loop 0 done. GC >> counts = (381, 9, 3); 1 second of application time corresponds to >> 0.06862711906433105 seconds real time. >> 2022-10-04 10:40:51|INFO|__main__ |Execution loop 100 done. GC >> counts = (381, 9, 3); 1 second of application time corresponds to >> 0.08224177360534668 seconds real time. >> 2022-10-04 10:40:52|INFO|__main__ |Execution loop 200 done. GC >> counts = (381, 9, 3); 1 second of application time corresponds to >> 0.08225250244140625 seconds real time. >> 2022-10-04 10:40:53|INFO|__main__ |Execution loop 300 done. GC >> counts = (381, 9, 3); 1 second of application time corresponds to >> 0.10176873207092285 seconds real time. >> 2022-10-04 10:40:54|INFO|__main__ |Execution loop 400 done. GC >> counts = (381, 9, 3); 1 second of application time corresponds to >> 0.10900592803955078 seconds real time. >> 2022-10-04 10:40:55|INFO|__main__ |Execution loop 500 done. GC >> counts = (381, 9, 3); 1 second of application time corresponds to >> 0.12233948707580566 seconds real time. >> 2022-10-04 10:40:56|INFO|__main__ |Execution loop 600 done. GC >> counts = (381, 9, 3); 1 second of application time corresponds to >> 0.14058256149291992 seconds real time. >> 2022-10-04 10:40:58|INFO|__main__ |Execution loop 700 done. GC >> counts = (381, 9, 3); 1 second of application time corresponds to >> 0.14777183532714844 seconds real time. >> 2022-10-04 10:40:59|INFO|__main__ |Execution loop 800 done. GC >> counts = (381, 9, 3); 1 second of application time corresponds to >> 0.15729451179504395 seconds real time. >> 2022-10-04 10:41:01|INFO|__main__ |Execution loop 900 done. GC >> counts = (381, 9, 3); 1 second of application time corresponds to >> 0.17365813255310059 seconds real time. >> 2022-10-04 10:41:03|INFO|__main__ |Execution loop 1000 done. GC >> counts = (381, 9, 3); 1 second of application time corresponds to >> 0.17772984504699707 seconds real time. >> 2022-10-04 10:41:05|INFO|__main__ |Execution loop 1100 done. GC >> counts = (381, 9, 3); 1 second of application time corresponds to >> 0.1955263614654541 seconds real time. >> 2022-10-04 10:41:07|INFO|__main__ |Execution loop 1200 done. GC >> counts = (381, 9, 3); 1 second of application time corresponds to >> 0.20046710968017578 seconds real time. >> 2022-10-04 10:41:09|INFO|__main__ |Executio
Performance issue with CPython 3.10 + Cython
Hi all, I am wrapping an embedded application (, which does not use any dynamic memory management,) using Cython to call it from CPython. The wrapped application uses a cyclic executive, i.e. everything is done in the input-logic-output design, typical for some real-time related domains. Consequentially, every application cycle executes more or less the very same code. As I am still in a prototyping stadium, the wrapped process is completely CPU-bound, i.e. except of some logging output there is no I/O whatsoever. During one second of "application time", I am doing exactly 120 calls into the application through three Cython-wrapped API functions. The application uses some platform-dependent APIs, which I have also wrapped with Cython, so that there are numerous callbacks into the Python realm per call into the application. What I am observing now, is that the performance per "application second" decreases (remember: executing code that does the same thing on every cycle) and extending the number of loop iterations does not seem to cause any bound to this decrease. In the log ouput below, you can see the GC counts, which look innocent to me. The "real time" is measured using "time.time()". The "top" utility does not suggest any memory leak either. I am developing on WSL2, but I have checked that this performance effect also happens on physical machines. Right now, I am staring at "kcachegrind", but I have no idea, how to determine time series for the performance of functions (I am not looking for those functions, which need the most time, but for those, which consume more and more execution time). One more thing to look for could be memory fragmentation, but before that I would like to ask the experts here for their ideas and experiences and/or for tools, which could help to find the culprit. 2022-10-04 10:40:50|INFO|__main__ |Execution loop 0 done. GC > counts = (381, 9, 3); 1 second of application time corresponds to > 0.06862711906433105 seconds real time. > 2022-10-04 10:40:51|INFO|__main__ |Execution loop 100 done. GC > counts = (381, 9, 3); 1 second of application time corresponds to > 0.08224177360534668 seconds real time. > 2022-10-04 10:40:52|INFO|__main__ |Execution loop 200 done. GC > counts = (381, 9, 3); 1 second of application time corresponds to > 0.08225250244140625 seconds real time. > 2022-10-04 10:40:53|INFO|__main__ |Execution loop 300 done. GC > counts = (381, 9, 3); 1 second of application time corresponds to > 0.10176873207092285 seconds real time. > 2022-10-04 10:40:54|INFO|__main__ |Execution loop 400 done. GC > counts = (381, 9, 3); 1 second of application time corresponds to > 0.10900592803955078 seconds real time. > 2022-10-04 10:40:55|INFO|__main__ |Execution loop 500 done. GC > counts = (381, 9, 3); 1 second of application time corresponds to > 0.12233948707580566 seconds real time. > 2022-10-04 10:40:56|INFO|__main__ |Execution loop 600 done. GC > counts = (381, 9, 3); 1 second of application time corresponds to > 0.14058256149291992 seconds real time. > 2022-10-04 10:40:58|INFO|__main__ |Execution loop 700 done. GC > counts = (381, 9, 3); 1 second of application time corresponds to > 0.14777183532714844 seconds real time. > 2022-10-04 10:40:59|INFO|__main__ |Execution loop 800 done. GC > counts = (381, 9, 3); 1 second of application time corresponds to > 0.15729451179504395 seconds real time. > 2022-10-04 10:41:01|INFO|__main__ |Execution loop 900 done. GC > counts = (381, 9, 3); 1 second of application time corresponds to > 0.17365813255310059 seconds real time. > 2022-10-04 10:41:03|INFO|__main__ |Execution loop 1000 done. GC > counts = (381, 9, 3); 1 second of application time corresponds to > 0.17772984504699707 seconds real time. > 2022-10-04 10:41:05|INFO|__main__ |Execution loop 1100 done. GC > counts = (381, 9, 3); 1 second of application time corresponds to > 0.1955263614654541 seconds real time. > 2022-10-04 10:41:07|INFO|__main__ |Execution loop 1200 done. GC > counts = (381, 9, 3); 1 second of application time corresponds to > 0.20046710968017578 seconds real time. > 2022-10-04 10:41:09|INFO|__main__ |Execution loop 1300 done. GC > counts = (381, 9, 3); 1 second of application time corresponds to > 0.22513842582702637 seconds real time. > 2022-10-04 10:41:11|INFO|__main__ |Execution loop 1400 done. GC > counts = (381, 9, 3); 1 second of application time corresponds to > 0.23578548431396484 seconds real time. > 2022-10-04 10:41:13|INFO|__main__ |Execution loop 1500 done. GC > counts = (381, 9, 3); 1 second of application time corresponds to > 0.24581527709960938 seconds real time. > 2022-10-04 10:41:16|INFO|__main__ |Execution loop 1600 done. GC > counts = (381, 9, 3); 1 second of application time corresponds to > 0.2541334629058838 seconds real time. > 2022-10-04 10:41:19|INFO|__mai
Re: A performance issue when using default value
On Sun, 31 Jan 2010 20:58:50 -0800, keakon wrote: > I've found strange performance issue when using default value, the test > code is list below: > > from timeit import Timer > > def f(x): > y = x > y.append(1) > return y > > def g(x=[]): > y = [] > y.append(1) > return y > > def h(x=[]): > y = x > y.append(1) > return y > > def f2(x): > y = x > y.append(1) > return y + [] > > def g2(x=[]): > y = [] > y.append(1) > return y + [] > > def h2(x=[]): > y = x > y.append(1) > return y + [] > > TIMES = 1 > print Timer('f([])','from __main__ import f, g, h').timeit(TIMES) print > Timer('g()','from __main__ import f, g, h').timeit(TIMES) print > Timer('h([])','from __main__ import f, g, h').timeit(TIMES) print > Timer('h()','from __main__ import f, g, h').timeit(TIMES) print > Timer('f2([])','from __main__ import f2, g2, h2').timeit(TIMES) print > Timer('g2()','from __main__ import f2, g2, h2').timeit(TIMES) print > Timer('h2([])','from __main__ import f2, g2, h2').timeit(TIMES) print > Timer('h2()','from __main__ import f2, g2, h2').timeit(TIMES) > > > I tested it with Python 2.5.4, 2.6.4 and 3.1.1 on Windows XP, and get > almost the same result: > 0.00449247041174 > 0.00439608944712 > 0.00455867994396 > 0.00327471787615 > 0.00791581052899 > 0.00684919452053 > 0.00734311204357 > 0.30974942346 You've only run those results once, haven't you? That is not trustworthy. In a multi-tasking operating system, including Windows, some other process might have run for some milliseconds, throwing the results off. More importantly, if you had run it multiple times, you'd see this: >>> for i in range(7): ... Timer('h2()','from __main__ import f2, g2, h2').timeit(TIMES) ... 0.0190200805664 0.315117120743 0.941411972046 1.56618499756 2.18553495407 2.79832291603 3.44865894318 h2 (and probably h, but I haven't tested it) get slower and slower and slower every time you run it. Why? I'm pretty sure this is in the FAQs, but default values are created once, not every time the function is called. So you create a function with the default value of []. Call the function, and that list gets 1 appended to it, so the default value is now [1]. Call the function again, and it becomes [1, 1]. And so on, and so on, and so on. The other functions create a new empty list each time the function is called, so you're comparing the time it takes to grow an ever-bigger list with the time it takes to create a tiny list. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: A performance issue when using default value
keakon wrote: > The default value is mutable, and can be reused by all each call. > So each call it will append 1 to the default value, that's very > different than C++. Being different from C++ is one of the many reasons some of us choose Python ;) This tends to bite most newcomers, so it's mentioned in the FAQ: http://www.python.org/doc/faq/general/#id52 -- http://mail.python.org/mailman/listinfo/python-list
Re: A performance issue when using default value
On 2月1日, 下午1时20分, alex23 wrote: > alex23 wrote: > > keakon wrote: > > > def h2(x=[]): > > > y = x > > > y.append(1) > > > return y + [] > > > Are you aware that 'y = x' _doesn't_ make a copy of [], that it > > actually points to the same list as x? > > Sorry, I meant to suggest trying the following instead: > > def h2(x=None): > if x is None: x = [] > x.append(1) > return x + [] > > It's a common idiom to use None as a sentinel for this situation, esp. > where you _don't_ want a default mutable object to be reused. Thank you, I got it. The default value is mutable, and can be reused by all each call. So each call it will append 1 to the default value, that's very different than C++. -- http://mail.python.org/mailman/listinfo/python-list
Re: A performance issue when using default value
alex23 wrote: > keakon wrote: > > def h2(x=[]): > > y = x > > y.append(1) > > return y + [] > > Are you aware that 'y = x' _doesn't_ make a copy of [], that it > actually points to the same list as x? Sorry, I meant to suggest trying the following instead: def h2(x=None): if x is None: x = [] x.append(1) return x + [] It's a common idiom to use None as a sentinel for this situation, esp. where you _don't_ want a default mutable object to be reused. -- http://mail.python.org/mailman/listinfo/python-list
Re: A performance issue when using default value
On Sun, Jan 31, 2010 at 8:58 PM, keakon wrote: > I've found strange performance issue when using default value, the > test code is list below: > > from timeit import Timer > > def f(x): > y = x > y.append(1) > return y > > def g(x=[]): > y = [] > y.append(1) > return y > > def h(x=[]): > y = x > y.append(1) > return y > > def f2(x): > y = x > y.append(1) > return y + [] > > def g2(x=[]): > y = [] > y.append(1) > return y + [] > > def h2(x=[]): > y = x > y.append(1) > return y + [] > > TIMES = 1 > print Timer('f([])','from __main__ import f, g, h').timeit(TIMES) > print Timer('g()','from __main__ import f, g, h').timeit(TIMES) > print Timer('h([])','from __main__ import f, g, h').timeit(TIMES) > print Timer('h()','from __main__ import f, g, h').timeit(TIMES) > print Timer('f2([])','from __main__ import f2, g2, h2').timeit(TIMES) > print Timer('g2()','from __main__ import f2, g2, h2').timeit(TIMES) > print Timer('h2([])','from __main__ import f2, g2, h2').timeit(TIMES) > print Timer('h2()','from __main__ import f2, g2, h2').timeit(TIMES) > > > I tested it with Python 2.5.4, 2.6.4 and 3.1.1 on Windows XP, and get > almost the same result: > 0.00449247041174 > 0.00439608944712 > 0.00455867994396 > 0.00327471787615 > 0.00791581052899 > 0.00684919452053 > 0.00734311204357 > 0.30974942346 > > h2() is about 42 times slower than h2([]), but h() is a litter faster > than h([]). > > If change TIMES to 2, other results are 2 times than before, but h2 > () is 4 times(about 1.2 sec) than before. > > Is there any tricks in it? Are you aware of the following pitfall?: >>> def foo(a=[]): ... a.append(7) ... return a >>> >>> print foo() [7] >>> print foo() [7, 7] >>> print foo() [7, 7, 7] i.e. the default argument is only evaluated once (when the function is defined) and is then reused every call when the caller doesn't provide a value. Cheers, Chris -- http://blog.rebertia.com -- http://mail.python.org/mailman/listinfo/python-list
Re: A performance issue when using default value
keakon wrote: > def h2(x=[]): > y = x > y.append(1) > return y + [] > h2() is about 42 times slower than h2([]), but h() is a litter faster > than h([]). Are you aware that 'y = x' _doesn't_ make a copy of [], that it actually points to the same list as x? My guess is that the slowdown you're seeing is due to the increasing size of x _per timing iteration_. h2([]) will pass a new list each time, while h2() will append to the same list _every_ time. The difference between h & h2 is due to the concatenation of a new list to the built one: the longer the default list grows, the longer this will take, as extending a list takes O(k) time, with k being the number of elements. -- http://mail.python.org/mailman/listinfo/python-list
A performance issue when using default value
I've found strange performance issue when using default value, the test code is list below: from timeit import Timer def f(x): y = x y.append(1) return y def g(x=[]): y = [] y.append(1) return y def h(x=[]): y = x y.append(1) return y def f2(x): y = x y.append(1) return y + [] def g2(x=[]): y = [] y.append(1) return y + [] def h2(x=[]): y = x y.append(1) return y + [] TIMES = 1 print Timer('f([])','from __main__ import f, g, h').timeit(TIMES) print Timer('g()','from __main__ import f, g, h').timeit(TIMES) print Timer('h([])','from __main__ import f, g, h').timeit(TIMES) print Timer('h()','from __main__ import f, g, h').timeit(TIMES) print Timer('f2([])','from __main__ import f2, g2, h2').timeit(TIMES) print Timer('g2()','from __main__ import f2, g2, h2').timeit(TIMES) print Timer('h2([])','from __main__ import f2, g2, h2').timeit(TIMES) print Timer('h2()','from __main__ import f2, g2, h2').timeit(TIMES) I tested it with Python 2.5.4, 2.6.4 and 3.1.1 on Windows XP, and get almost the same result: 0.00449247041174 0.00439608944712 0.00455867994396 0.00327471787615 0.00791581052899 0.00684919452053 0.00734311204357 0.30974942346 h2() is about 42 times slower than h2([]), but h() is a litter faster than h([]). If change TIMES to 2, other results are 2 times than before, but h2 () is 4 times(about 1.2 sec) than before. Is there any tricks in it? -- http://mail.python.org/mailman/listinfo/python-list
Re: Strange performance issue
Ulrich Eckhardt wrote: Dan Stromberg wrote: My new version formats an SD card and preallocates some file space in about 3 minutes with "Optimize Performance" selected, and in about 30 minutes with "Optimize for Quick Removal" selected. Needless to say, I don't like the 27 minute penalty much. For performance, the driver will probably use delayed write operations, buffering etc. For quick removal, it will probably make all operations atomic, i.e. perform a write operation and not return until the data has really reached the SD card. The driver may delay writes, but I'd think it more likely that the filesystem or buffer cache would be doing so. But the old version of the program formats and preallocates in 3 minutes irrespective of whether the drive is in "optimize performance" or "optimize for quick removal". Somehow, I guess the new version does many small writes while the old one doesn't. When optimizing for quick removal, those operations add up, otherwise their overhead is negligible. Nope, same block size. one_meg = 'a'*2**20 That's one mib, not one meg. ;) You're aware that a lot of people are ignoring the new units? file = open('remove-me', 'w') Try adding the 'b' flag, too. I wouldn't expect it to affect the IO speed, but it is one factor that adds up. This appears to have wholly accounted for the problem! Thank you very much. Adding a "b" to my open sped up the writes by a factor of about 15. Otherwise, I looked over your program and couldn't find anything that would explain a problem. Just wondering, what transfer speed do you get with the two versions? What is the theoretical maximum for the SD card? I don't know what the theoretical max write speeds of the USB bus, card reader and card are, but I was getting over 10x faster using the same card and card reader on an ancient Thinkpad running Ubuntu. With the "b" specified, the XP system is now faster than the Thinkpad. Interesting that the "b" would make such a performance difference. You'd think it'd just be, in C, looking for newlines and adding a carriage return after them into a buffer of potentially double the size, then writing as usual. The data I've been writing so far -has- no newlines. Thanks! -- http://mail.python.org/mailman/listinfo/python-list
Re: Strange performance issue
Steven D'Aprano wrote: On Mon, 05 Oct 2009 22:31:05 -0700, Dan Stromberg wrote: I'm rewriting 3 programs as one program - from Python with Tkinter to Python with pygtk, both on Windows XP. My new version formats an SD card and preallocates some file space in about 3 minutes with "Optimize Performance" selected, and in about 30 minutes with "Optimize for Quick Removal" selected. Needless to say, I don't like the 27 minute penalty much. I don't understand what that means. How are you formatting the SD card? The FormatEx function via ctypes. I'm guessing that Optimize for Quick Removal means that every write is immediately synced to disk. That will probably be slow, no matter what. Yes, presumably Optimize for Quick Removal is a write-through cache (synchronous) and "Optimize for Performance" is a write-back cache (asynchronous). But the old version of the program formats and preallocates in 3 minutes irrespective of whether the drive is in "optimize performance" or "optimize for quick removal". I suspect that if there was no performance penalty in the old version, it was because you inadvertently were always using "Optimize Performance" no matter what. I'm pretty confident that unless Windows was lying to me, that I did some tests with Optimize for Performance and some tests without. BTW, if you want to pre-allocate some space, and you don't care what is in the file, you *may* be able to use file.truncate() to do so quickly. Despite the name, it doesn't just truncate files, it can also be used to extend them. (At least on POSIX systems.) On a POSIX system, I'd normally just seek and write a single null to get a file with holes. But this is not only Windows, but FAT32 - and the data will later be read without going through the filesystem. It seems prudent to actually write the blocks this time. Thanks, and see my other message (to be sent momentarily) for an apparent solution. f = open('empty', 'w') f.seek(1000) f.truncate() f.close() f = open('empty', 'r') len(f.read()) 1000 f.close() -- http://mail.python.org/mailman/listinfo/python-list
Re: Strange performance issue
Dan Stromberg wrote: > My new version formats an SD card and preallocates some file space in > about 3 minutes with "Optimize Performance" selected, and in about 30 > minutes with "Optimize for Quick Removal" selected. Needless to say, I > don't like the 27 minute penalty much. For performance, the driver will probably use delayed write operations, buffering etc. For quick removal, it will probably make all operations atomic, i.e. perform a write operation and not return until the data has really reached the SD card. > But the old version of the program formats and preallocates in 3 minutes > irrespective of whether the drive is in "optimize performance" or > "optimize for quick removal". Somehow, I guess the new version does many small writes while the old one doesn't. When optimizing for quick removal, those operations add up, otherwise their overhead is negligible. > one_meg = 'a'*2**20 That's one mib, not one meg. ;) > file = open('remove-me', 'w') Try adding the 'b' flag, too. I wouldn't expect it to affect the IO speed, but it is one factor that adds up. Otherwise, I looked over your program and couldn't find anything that would explain a problem. Just wondering, what transfer speed do you get with the two versions? What is the theoretical maximum for the SD card? Uli -- Sator Laser GmbH Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932 -- http://mail.python.org/mailman/listinfo/python-list
Re: Strange performance issue
On Mon, 05 Oct 2009 22:31:05 -0700, Dan Stromberg wrote: > I'm rewriting 3 programs as one program - from Python with Tkinter to > Python with pygtk, both on Windows XP. > > My new version formats an SD card and preallocates some file space in > about 3 minutes with "Optimize Performance" selected, and in about 30 > minutes with "Optimize for Quick Removal" selected. Needless to say, I > don't like the 27 minute penalty much. I don't understand what that means. How are you formatting the SD card? I'm guessing that Optimize for Quick Removal means that every write is immediately synced to disk. That will probably be slow, no matter what. > But the old version of the program formats and preallocates in 3 minutes > irrespective of whether the drive is in "optimize performance" or > "optimize for quick removal". I suspect that if there was no performance penalty in the old version, it was because you inadvertently were always using "Optimize Performance" no matter what. BTW, if you want to pre-allocate some space, and you don't care what is in the file, you *may* be able to use file.truncate() to do so quickly. Despite the name, it doesn't just truncate files, it can also be used to extend them. (At least on POSIX systems.) >>> f = open('empty', 'w') >>> f.seek(1000) >>> f.truncate() >>> f.close() >>> f = open('empty', 'r') >>> len(f.read()) 1000 >>> f.close() -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Strange performance issue
I'm rewriting 3 programs as one program - from Python with Tkinter to Python with pygtk, both on Windows XP. My new version formats an SD card and preallocates some file space in about 3 minutes with "Optimize Performance" selected, and in about 30 minutes with "Optimize for Quick Removal" selected. Needless to say, I don't like the 27 minute penalty much. But the old version of the program formats and preallocates in 3 minutes irrespective of whether the drive is in "optimize performance" or "optimize for quick removal". I'm baffled. I've gone over the old code, and found nothing relevant looking, and I've never seen anything this puzzling in my years of work on *ix. Can someone please elucidate? (I've also written a pair of minimal programs in Python and C that exhibit the same behavior on this XP system - so it's not just my GUI app) Thanks! PS: Here's the minimal python test program I mentioned. It's dog slow unless I Optimize for Performance the drive.: #!/cygdrive/c/Python26/python import os import time one_meg = 'a'*2**20 t0 = time.time() i = 0 file = open('remove-me', 'w') while True: i += 1 file.write(one_meg) t1 = time.time() bits_written = i*2**20*8 duration = t1-t0 print i, str(duration)+'s', bits_written/(duration*1024*1024), 'Mbps' if duration > 120: break file.close() os.unlink('remove-me') -- http://mail.python.org/mailman/listinfo/python-list
Re: Levenshtein word comparison -performance issue
On Sat, Feb 14, 2009 at 3:01 PM, Peter Otten <__pete...@web.de> wrote: > Gabriel Genellina wrote: > > > En Fri, 13 Feb 2009 08:16:00 -0200, S.Selvam Siva < > s.selvams...@gmail.com> > > escribió: > > > >> I need some help. > >> I tried to find top n(eg. 5) similar words for a given word, from a > >> dictionary of 50,000 words. > >> I used python-levenshtein module,and sample code is as follow. > >> > >> def foo(searchword): > >> disdict={} > >> for word in self.dictionary-words: > >>distance=Levenshtein.ratio(searchword,word) > >>disdict[word]=distance > >> """ > >> sort the disdict dictionary by values in descending order > >> """ > >> similarwords=sorted(disdict, key=disdict.__getitem__, reverse=True) > >> > >> return similarwords[:5] > > > > You may replace the last steps (sort + slice top 5) by heapq.nlargest - > at > > least you won't waste time sorting 49995 irrelevant words... > > Anyway you should measure the time taken by the first part (Levenshtein), > > it may be the most demanding. I think there is a C extension for this, > > should be much faster than pure Python calculations. > > > > [I didn't see the original post] > > You can use the distance instead of the ratio and put the words into bins > of > the same length. Then if you find enough words with a distance <= 1 in the > bin with the same length as the search word you can stop looking. > > You might be able to generalize this approach to other properties that are > fast to calculate and guarantee a minimum distance, e. g. set(word). > > Peter > -- > http://mail.python.org/mailman/listinfo/python-list > Thank you all for your response, [sorry,I was away for a while.] I used functools,heapq modules but that helped me a little, then i categorized the words depending on the length and compares with a small set(each set 5/4=12,500), so now its taking quarter of time as compared to older method. Further, can i use Thread to achieve parallel comparison ?,as i have little knowledge on python-thread. Will the following code achive parallelism? thread1= threading.Thread(target=self.findsimilar, args=("1",searchword,dic-word-set1) thread2= threading.Thread(target=self.findsimilar, args=("2",searchword,dic-word-set1) thread3= threading.Thread(target=self.findsimilar, args=("3",searchword,dic-word-set1) thread1.start() thread2.start() thread3.start() thread1.join() thread2.join() thread3.join() I would like to hear suggestion. Note:The issue is i am developing spell checker for my local languge,i may use more than 2.5 lakh words,so i need to have a best way to find out alternative wordlist -- Yours, S.Selvam -- http://mail.python.org/mailman/listinfo/python-list
Re: Levenshtein word comparison -performance issue
Gabriel Genellina wrote: > En Fri, 13 Feb 2009 08:16:00 -0200, S.Selvam Siva > escribió: > >> I need some help. >> I tried to find top n(eg. 5) similar words for a given word, from a >> dictionary of 50,000 words. >> I used python-levenshtein module,and sample code is as follow. >> >> def foo(searchword): >> disdict={} >> for word in self.dictionary-words: >>distance=Levenshtein.ratio(searchword,word) >>disdict[word]=distance >> """ >> sort the disdict dictionary by values in descending order >> """ >> similarwords=sorted(disdict, key=disdict.__getitem__, reverse=True) >> >> return similarwords[:5] > > You may replace the last steps (sort + slice top 5) by heapq.nlargest - at > least you won't waste time sorting 49995 irrelevant words... > Anyway you should measure the time taken by the first part (Levenshtein), > it may be the most demanding. I think there is a C extension for this, > should be much faster than pure Python calculations. > [I didn't see the original post] You can use the distance instead of the ratio and put the words into bins of the same length. Then if you find enough words with a distance <= 1 in the bin with the same length as the search word you can stop looking. You might be able to generalize this approach to other properties that are fast to calculate and guarantee a minimum distance, e. g. set(word). Peter -- http://mail.python.org/mailman/listinfo/python-list
Re: Levenshtein word comparison -performance issue
On Feb 13, 5:42 am, "Gabriel Genellina" wrote: > You may replace the last steps (sort + slice top 5) by heapq.nlargest - at > least you won't waste time sorting 49995 irrelevant words... > Anyway you should measure the time taken by the first part (Levenshtein), > it may be the most demanding. I think there is a C extension for this, > should be much faster than pure Python calculations. subdist: http://pypi.python.org/pypi/subdist/0.2.1 It uses a modified "fuzzy" version of the Levenshtein algorithm, which I found more useful than the strict version. The only quirk to it is that it accepts nothing but unicode. Other than that, it's a keeper. It is extremely fast. Cheers, -Basilisk96 -- http://mail.python.org/mailman/listinfo/python-list
Re: Levenshtein word comparison -performance issue
Gabriel Genellina wrote: En Fri, 13 Feb 2009 08:16:00 -0200, S.Selvam Siva escribió: I need some help. I tried to find top n(eg. 5) similar words for a given word, from a dictionary of 50,000 words. I used python-levenshtein module,and sample code is as follow. def foo(searchword): disdict={} for word in self.dictionary-words: distance=Levenshtein.ratio(searchword,word) disdict[word]=distance """ sort the disdict dictionary by values in descending order """ similarwords=sorted(disdict, key=disdict.__getitem__, reverse=True) return similarwords[:5] You may replace the last steps (sort + slice top 5) by heapq.nlargest - at least you won't waste time sorting 49995 irrelevant words... There is also no need to build the 5 entry word-distance dictionary. import heapq, functools def foo(searchword, n): distance = functools.partial(Levenshtein.ratio, searchword) return heapq.nlargest(n, words, distance) If the distances are wanted along with the similar words, I strongly suspect that it would be faster to recalculate a small number than to generate the dict of 5 pairs. Anyway you should measure the time taken by the first part (Levenshtein), it may be the most demanding. I think there is a C extension for this, should be much faster than pure Python calculations. And such could be dropped into the code above. Terry Jan Reedy -- http://mail.python.org/mailman/listinfo/python-list
Re: Levenshtein word comparison -performance issue
En Fri, 13 Feb 2009 08:16:00 -0200, S.Selvam Siva escribió: I need some help. I tried to find top n(eg. 5) similar words for a given word, from a dictionary of 50,000 words. I used python-levenshtein module,and sample code is as follow. def foo(searchword): disdict={} for word in self.dictionary-words: distance=Levenshtein.ratio(searchword,word) disdict[word]=distance """ sort the disdict dictionary by values in descending order """ similarwords=sorted(disdict, key=disdict.__getitem__, reverse=True) return similarwords[:5] You may replace the last steps (sort + slice top 5) by heapq.nlargest - at least you won't waste time sorting 49995 irrelevant words... Anyway you should measure the time taken by the first part (Levenshtein), it may be the most demanding. I think there is a C extension for this, should be much faster than pure Python calculations. -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list
Levenshtein word comparison -performance issue
Hi all, I need some help. I tried to find top n(eg. 5) similar words for a given word, from a dictionary of 50,000 words. I used python-levenshtein module,and sample code is as follow. def foo(searchword): disdict={} for word in self.dictionary-words: distance=Levenshtein.ratio(searchword,word) disdict[word]=distance """ sort the disdict dictionary by values in descending order """ similarwords=sorted(disdict, key=disdict.__getitem__, reverse=True) return similarwords[:5] foo() takes a search word and compares it with dictionary of 50,000 and assigns each word a value(lies between 0 to 1). Then after sorting in descending order it returns top 5 similar words. The problem is, it* takes long time* for processing(as i need to pass more search words within a loop),i guess the code could be improved to work efficiently.Your suggestions are welcome... -- Yours, S.Selvam -- http://mail.python.org/mailman/listinfo/python-list
Re: cgitb performance issue
Gabriel Genellina wrote: En Wed, 14 May 2008 13:51:40 -0300, Ethan Furman <[EMAIL PROTECTED]> escribió: Gabriel Genellina wrote: En Mon, 05 May 2008 15:56:26 -0300, Ethan Furman <[EMAIL PROTECTED]> escribió: I tried adding a form to our website for uploading large files. Personally, I dislike the forms that tell you you did something wrong and make you re-enter *all* your data again, so this one cycles and remembers your answers, and only prompts for the file once the rest of the entered data is good. However, the first time the form loads it can take up to 30 seconds... any ideas why? Hard to tell without looking at the code... And what has cgitb to do with this? H... excellent question, and the answer is -- Nothing. My apologies. I'm importing cgi (cgitb was while I was debugging it), as well as a bunch of others. Here's the source, with a bunch of the actual html stripped out. The -u as well as the last two lines were an attempt to eliminate the 30-second pause while it loads, as it seems to get all data transferred, then just waits for a while. Any ideas appreciated. My apologies for the ugly code. *When* do you see the 30-secs delay? After uploading the files? I see they're transferred using SMTP, but I think you're not talking about the time it takes to send the mail. I'd use the old-fashioned "print" statement to see *what* gets executed and how long it takes each step. The initial load of the page in question is a static html page (done as a crutch so there's *something* on screen while the rest of the thing loads). After five seconds the static html page redirects to the python cgi script -- and that's where the delay occurs. I just double checked, and it also occurs everytime it has to reload the script. I'll try the print statements and see if I can narrow down where the bottleneck is located. The sending of the file via smtp happens on our server after the file has been uploaded, the delays are happening long before that. Thanks for your help! -- Ethan -- http://mail.python.org/mailman/listinfo/python-list
Re: cgitb performance issue
En Wed, 14 May 2008 13:51:40 -0300, Ethan Furman <[EMAIL PROTECTED]> escribió: Gabriel Genellina wrote: En Mon, 05 May 2008 15:56:26 -0300, Ethan Furman <[EMAIL PROTECTED]> escribió: I tried adding a form to our website for uploading large files. Personally, I dislike the forms that tell you you did something wrong and make you re-enter *all* your data again, so this one cycles and remembers your answers, and only prompts for the file once the rest of the entered data is good. However, the first time the form loads it can take up to 30 seconds... any ideas why? Hard to tell without looking at the code... And what has cgitb to do with this? H... excellent question, and the answer is -- Nothing. My apologies. I'm importing cgi (cgitb was while I was debugging it), as well as a bunch of others. Here's the source, with a bunch of the actual html stripped out. The -u as well as the last two lines were an attempt to eliminate the 30-second pause while it loads, as it seems to get all data transferred, then just waits for a while. Any ideas appreciated. My apologies for the ugly code. *When* do you see the 30-secs delay? After uploading the files? I see they're transferred using SMTP, but I think you're not talking about the time it takes to send the mail. I'd use the old-fashioned "print" statement to see *what* gets executed and how long it takes each step. -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list
Re: cgitb performance issue
Gabriel Genellina wrote: En Mon, 05 May 2008 15:56:26 -0300, Ethan Furman <[EMAIL PROTECTED]> escribió: I tried adding a form to our website for uploading large files. Personally, I dislike the forms that tell you you did something wrong and make you re-enter *all* your data again, so this one cycles and remembers your answers, and only prompts for the file once the rest of the entered data is good. However, the first time the form loads it can take up to 30 seconds... any ideas why? Hard to tell without looking at the code... And what has cgitb to do with this? H... excellent question, and the answer is -- Nothing. My apologies. I'm importing cgi (cgitb was while I was debugging it), as well as a bunch of others. Here's the source, with a bunch of the actual html stripped out. The -u as well as the last two lines were an attempt to eliminate the 30-second pause while it loads, as it seems to get all data transferred, then just waits for a while. Any ideas appreciated. My apologies for the ugly code. -- Ethan #!/usr/local/bin/python -u import cgi #import cgitb; cgitb.enable() import smtplib import mimetypes import os import sys from email.Encoders import encode_base64 from email.MIMEBase import MIMEBase from email.MIMEText import MIMEText from email.MIMEMultipart import MIMEMultipart HEADER = """Content-type: text/html\n """ SELECTFILES = """ """ FILESRECEIVED = """ """ VERIFIED = """ """ FOOTER = """ """ REQUIRED = '*required' form = cgi.FieldStorage() clientname = clientemail = clientcsr = subject = body = "" csrs = {"frank":{'html':'selected="yes"', 'name':'', 'email':''}, \ "kathy":{'html':'', 'name':'', 'email':'removed>'}, \ "chris":{'html':'', 'name':'', 'email':'removed>'}, \ "tracy":{'html':'', 'name':'', 'email':'removed>'}, \ "susi": {'html':'', 'name':'', 'email':'removed>'}, \ "bill": {'html':'', 'name':'', 'email':'removed>'}} incomplete = False files = [] filelist = [] if form: if form.has_key("name"): clientname = form["name"].value if form.has_key("email"): clientemail = form["email"].value if form.has_key("CSR"): clientcsr = form["CSR"].value csrs["frank"]['html'] = '' # clear default csrs[clientcsr]['html'] = 'selected="yes"' if form.has_key("subject"): subject = form["subject"].value if form.has_key("body"): body = form["body"].value incomplete = not (clientname and clientemail and clientcsr and subject) if form.has_key("file1"): file1 = form["file1"] if file1.filename: files.append(file1.filename + "") filelist.append(file1) if form.has_key("file2"): file2 = form["file2"] if file2.filename: files.append(file2.filename + "") filelist.append(file2) if form.has_key("file3"): file3 = form["file3"] if file3.filename: files.append(file3.filename + "") filelist.append(file3) if form.has_key("file4"): file4 = form["file4"] if file4.filename: files.append(file4.filename + "") filelist.append(file4) if form.has_key("file5"): file5 = form["file5"] if file5.filename: files.append(file5.filename + "") filelist.append(file5) def csrform(): cn_present = '' ce_present = '' sb_present = '' if incomplete: if not clientname: cn_present = REQUIRED if not clientemail: ce_present = REQUIRED if not subject: sb_present = REQUIRED print """ method="POST" enctype="multipart/form-data"> Please fill in your name, e-mail address, subject, and select your CSR. You will then be able to upload files on the next screen. Any additional notes may go in the Message field. """ print """ Your namesize="40" maxlength="50" name="name" value="%s">%s""" \ % (clientname, cn_present) print """ Your e-mailsize="40" maxlength="100" name="email" value="%s">%s""" \ % (clientemail, ce_present) print """ Subject linesize="40" maxlength="100" name="subject" value="%s">%s""" \ % (subject, sb_present) print """ Your CSR Frank Kathy Chris Tracy Susi Bill """ \ % (csrs["frank"]['html'], csrs["kathy"]['html'], csrs["chris"]['html'], csrs["tracy"]['html'], csrs["susi"]['html'], csrs["bill"]['html']) print """ """ % (body) def uploadform(): try: msg = MIMEMultipart() msg['To'] =
Re: cgitb performance issue
En Mon, 05 May 2008 15:56:26 -0300, Ethan Furman <[EMAIL PROTECTED]> escribió: > I tried adding a form to our website for uploading large files. > Personally, I dislike the forms that tell you you did something wrong > and make you re-enter *all* your data again, so this one cycles and > remembers your answers, and only prompts for the file once the rest of > the entered data is good. However, the first time the form loads it can > take up to 30 seconds... any ideas why? Hard to tell without looking at the code... And what has cgitb to do with this? -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list
cgitb performance issue
Greetings! I tried adding a form to our website for uploading large files. Personally, I dislike the forms that tell you you did something wrong and make you re-enter *all* your data again, so this one cycles and remembers your answers, and only prompts for the file once the rest of the entered data is good. However, the first time the form loads it can take up to 30 seconds... any ideas why? My server is using Apache 2.0.53 and Python 2.4.1 on a FreeBSD 5.4 system. Thanks! -- Ethan -- http://mail.python.org/mailman/listinfo/python-list
Re: Random Drawing Simulation -- performance issue
David J. Braden wrote: > Travis E. Oliphant wrote: >> Brendon Towle wrote: >>> I need to simulate scenarios like the following: "You have a deck of >>> 3 orange cards, 5 yellow cards, and 2 blue cards. You draw a card, >>> replace it, and repeat N times." >>> >> >> Thinking about the problem as drawing sample froms a discrete >> distribution defined by the population might help. >> >> For example, in SciPy you can define your own discrete random variable: >> >> var = scipy.stats.rv_discrete(name='sample', >> values=([0,1,2],[3/10.,5/10.,2/10.])) >> >> Then >> >> var.rvs(size=1) >> >> would quickly return an array of "draws" >> >> If you didn't want to install SciPy, but were willing to install >> NumPy, then the crux of the algorithm is to generate an entire array >> of uniform random variables: numpy.random.rand(count) and then map >> them through the inverse cumulative distribution function to generate >> your samples. The inverse cumulative distribution function is just a >> series of steps whose width depends on the probablity distribution. >> >> Thus, the population defines the draw boundaries. To make it easy, >> just multiply the uniform random number by the total number of cards >> and then the boundaries are on the integers of the number of each kind >> of card. >> >> Here is an implementation. I find this version to be 2x - 5x faster >> depending on how many draws are used. >> >> >> import numpy as N >> >> def randomDrawing_N(count, population): >> probs = [x[0] for x in population] >> res = [[0, item[1]] for item in population] >> nums = N.random.rand(count) >> cprobs = N.cumsum([0]+probs) >> nums *= cprobs[-1] >> for k, val in enumerate(probs): >> res[k][0] += ((nums > cprobs[k]) & (nums < cprobs[k+1])).sum() >> return res >> >> >> -Travis Oliphant >> > > In response to what Travis and Simon wrote - > (1) Where the heck can I find a description of scipy's stat functions? > Documentation on these seems sparse. > > (2) How does one set up a good timer for Python as implemented for > Windows? (i.e., how can I make API calls to Windows from Python?) > > Please bear with me, or not; I am /just/ starting off with Python. > > Thanks, > Dave Braden Oops, or rather, duh. Rereading Robert's post clarified for me where multinomial is, and that to get help I need to specify more than simply help(multinomial). Sheesh, I'm dim today. My question re timing code stands. TIA, DaveB -- http://mail.python.org/mailman/listinfo/python-list
Re: Random Drawing Simulation -- performance issue
Travis E. Oliphant wrote: > Brendon Towle wrote: >> I need to simulate scenarios like the following: "You have a deck of >> 3 orange cards, 5 yellow cards, and 2 blue cards. You draw a card, >> replace it, and repeat N times." >> > > Thinking about the problem as drawing sample froms a discrete > distribution defined by the population might help. > > For example, in SciPy you can define your own discrete random variable: > > var = scipy.stats.rv_discrete(name='sample', > values=([0,1,2],[3/10.,5/10.,2/10.])) > > Then > > var.rvs(size=1) > > would quickly return an array of "draws" > > If you didn't want to install SciPy, but were willing to install NumPy, > then the crux of the algorithm is to generate an entire array of uniform > random variables: numpy.random.rand(count) and then map them through > the inverse cumulative distribution function to generate your samples. > The inverse cumulative distribution function is just a series of steps > whose width depends on the probablity distribution. > > Thus, the population defines the draw boundaries. To make it easy, just > multiply the uniform random number by the total number of cards and then > the boundaries are on the integers of the number of each kind of card. > > Here is an implementation. I find this version to be 2x - 5x faster > depending on how many draws are used. > > > import numpy as N > > def randomDrawing_N(count, population): > probs = [x[0] for x in population] > res = [[0, item[1]] for item in population] > nums = N.random.rand(count) > cprobs = N.cumsum([0]+probs) > nums *= cprobs[-1] > for k, val in enumerate(probs): > res[k][0] += ((nums > cprobs[k]) & (nums < cprobs[k+1])).sum() > return res > > > -Travis Oliphant > In response to what Travis and Simon wrote - (1) Where the heck can I find a description of scipy's stat functions? Documentation on these seems sparse. (2) How does one set up a good timer for Python as implemented for Windows? (i.e., how can I make API calls to Windows from Python?) Please bear with me, or not; I am /just/ starting off with Python. Thanks, Dave Braden -- http://mail.python.org/mailman/listinfo/python-list
Re: Random Drawing Simulation -- performance issue
On 13 Sep 2006, at 1:01 AM, [EMAIL PROTECTED] wrote: > Date: 12 Sep 2006 20:17:47 -0700 > From: Paul Rubin <http://[EMAIL PROTECTED]> > Subject: Re: Random Drawing Simulation -- performance issue > To: python-list@python.org > > "Travis E. Oliphant" <[EMAIL PROTECTED]> writes: >>> I need to simulate scenarios like the following: "You have a deck of >>> 3 orange cards, 5 yellow cards, and 2 blue cards. You draw a card, >>> replace it, and repeat N times." >>> >> Thinking about the problem as drawing sample froms a discrete >> distribution defined by the population might help. > > Is there some important reason you want to do this as a simulation? > And is the real problem more complicated? If you draw from the > distribution 100,000 times with replacement and sum the results, per > the Central Limit Theorem you'll get something very close to a normal > distribution whose parameters you can determine analytically. There > is probably also some statistics formula to find the precise error. > So you can replace the 100,000 draws with a single draw. The real problem is not substantially more complicated. (The real code is, because it's embedded in a bunch of other stuff, but that's not the point.) I guess the essential reason that I want to do it as a simulation, and not as a statistics formula, is that I'd like the code to be readable (and modifiable) by a programmer who doesn't have a statistics background. I could dredge up enough of my college stats to do as you suggest (although I might not enjoy it), but I don't think I want to make that a requirement. On the other hand (quote somewhat snipped): > Date: Tue, 12 Sep 2006 22:46:04 -0500 > From: Robert Kern <[EMAIL PROTECTED]> > Subject: Re: Random Drawing Simulation -- performance issue > To: python-list@python.org > > Along the lines of what you're trying to get at, the problem that > the OP is > describing is one of sampling from a multinomial distribution. > > numpy has a function that will do the sampling for you: > > In [4]: numpy.random.multinomial? > Docstring: > Multinomial distribution. > > multinomial(n, pvals, size=None) -> random values > > pvals is a sequence of probabilities that should sum to 1 > (however, the > last element is always assumed to account for the remaining > probability > as long as sum(pvals[:-1]) <= 1). Here, I'm torn. I do want the code to be accessible to non-stats people, but this just might do the trick. Must ponder. Thanks, everyone, for your helpful suggestions! B. -- Brendon Towle, PhD Cognitive Scientist +1-412-690-2442x127 Carnegie Learning, Inc. The Cognitive Tutor Company ® Helping over 375,000 students in 1000 school districts succeed in math. -- http://mail.python.org/mailman/listinfo/python-list
Re: Random Drawing Simulation -- performance issue
On 12 Sep 2006, at 6:33 PM, [EMAIL PROTECTED] wrote: > Date: 12 Sep 2006 15:23:51 -0700 > From: "Simon Forman" <[EMAIL PROTECTED]> > Subject: Re: Random Drawing Simulation -- performance issue > > Brendon Towle wrote: >> I need to simulate scenarios like the following: "You have a deck of >> 3 orange cards, 5 yellow cards, and 2 blue cards. You draw a card, >> replace it, and repeat N times." >> [my original code snipped] > > I got nearly a 2x speed up with this variant: > > def randomDrawing3(count, population): > res = [[0, item[1]] for item in population] > mapping = [] > for i in xrange(len(population)): > mapping.extend([i]*population[i][0]) > > n = len(mapping) > for i in xrange(count): > index = int(n * random.random()) > res[mapping[index]][0] += 1 > > return res Excellent! For some reason, the speedup I get is only ~1.5x, but that's still non-trivial. Thanks much for the pointer- B. -- Brendon Towle, PhD Cognitive Scientist +1-412-690-2442x127 Carnegie Learning, Inc. The Cognitive Tutor Company ® Helping over 375,000 students in 1000 school districts succeed in math. -- http://mail.python.org/mailman/listinfo/python-list
Re: Random Drawing Simulation -- performance issue
Paul Rubin wrote: > "Travis E. Oliphant" <[EMAIL PROTECTED]> writes: >>> I need to simulate scenarios like the following: "You have a deck of >>> 3 orange cards, 5 yellow cards, and 2 blue cards. You draw a card, >>> replace it, and repeat N times." >>> >> Thinking about the problem as drawing sample froms a discrete >> distribution defined by the population might help. > > Is there some important reason you want to do this as a simulation? > And is the real problem more complicated? If you draw from the > distribution 100,000 times with replacement and sum the results, per > the Central Limit Theorem you'll get something very close to a normal > distribution whose parameters you can determine analytically. There > is probably also some statistics formula to find the precise error. > So you can replace the 100,000 draws with a single draw. Along the lines of what you're trying to get at, the problem that the OP is describing is one of sampling from a multinomial distribution. http://en.wikipedia.org/wiki/Multinomial_distribution numpy has a function that will do the sampling for you: In [4]: numpy.random.multinomial? Type: builtin_function_or_method Base Class: String Form: Namespace: Interactive Docstring: Multinomial distribution. multinomial(n, pvals, size=None) -> random values pvals is a sequence of probabilities that should sum to 1 (however, the last element is always assumed to account for the remaining probability as long as sum(pvals[:-1]) <= 1). Sampling from the multinomial distribution is quite simply implemented given a binomial sampler. Unfortunately, the standard library's random module does not have one. If the number of samples is high enough, then one might be able to approximate the binomial distribution with a normal one, but you'd be better off just installing numpy. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco -- http://mail.python.org/mailman/listinfo/python-list
Re: Random Drawing Simulation -- performance issue
"Travis E. Oliphant" <[EMAIL PROTECTED]> writes: > > I need to simulate scenarios like the following: "You have a deck of > > 3 orange cards, 5 yellow cards, and 2 blue cards. You draw a card, > > replace it, and repeat N times." > > > Thinking about the problem as drawing sample froms a discrete > distribution defined by the population might help. Is there some important reason you want to do this as a simulation? And is the real problem more complicated? If you draw from the distribution 100,000 times with replacement and sum the results, per the Central Limit Theorem you'll get something very close to a normal distribution whose parameters you can determine analytically. There is probably also some statistics formula to find the precise error. So you can replace the 100,000 draws with a single draw. -- http://mail.python.org/mailman/listinfo/python-list
Re: Random Drawing Simulation -- performance issue
Brendon Towle wrote: > I need to simulate scenarios like the following: "You have a deck of > 3 orange cards, 5 yellow cards, and 2 blue cards. You draw a card, > replace it, and repeat N times." > Thinking about the problem as drawing sample froms a discrete distribution defined by the population might help. For example, in SciPy you can define your own discrete random variable: var = scipy.stats.rv_discrete(name='sample', values=([0,1,2],[3/10.,5/10.,2/10.])) Then var.rvs(size=1) would quickly return an array of "draws" If you didn't want to install SciPy, but were willing to install NumPy, then the crux of the algorithm is to generate an entire array of uniform random variables: numpy.random.rand(count) and then map them through the inverse cumulative distribution function to generate your samples. The inverse cumulative distribution function is just a series of steps whose width depends on the probablity distribution. Thus, the population defines the draw boundaries. To make it easy, just multiply the uniform random number by the total number of cards and then the boundaries are on the integers of the number of each kind of card. Here is an implementation. I find this version to be 2x - 5x faster depending on how many draws are used. import numpy as N def randomDrawing_N(count, population): probs = [x[0] for x in population] res = [[0, item[1]] for item in population] nums = N.random.rand(count) cprobs = N.cumsum([0]+probs) nums *= cprobs[-1] for k, val in enumerate(probs): res[k][0] += ((nums > cprobs[k]) & (nums < cprobs[k+1])).sum() return res -Travis Oliphant -- http://mail.python.org/mailman/listinfo/python-list
Re: Random Drawing Simulation -- performance issue
Brendon Towle wrote: > I need to simulate scenarios like the following: "You have a deck of > 3 orange cards, 5 yellow cards, and 2 blue cards. You draw a card, > replace it, and repeat N times." > > So, I wrote the following code, which works, but it seems quite slow > to me. Can anyone point out some obvious thing that I'm doing > inefficiently? Or, is this more or less as good as it gets? > > For reference, my typical numbers look like this: > >2 <= len(population) <= 7 >4 <= len(mapping) <= 50 >10 <= count <= 100 > > B. > > > #!/usr/bin/env python > > import random > > def randomDrawing(count, population): > """Simulates drawing items from , with > replacement. > population is a list of lists: [[count1, type1], [count2, > type2], ...] > > Typical examples: > >>>randomDrawing(100, [[3, 'orange'], [5, 'yellow'], [2, 'blue']]) > [[28, 'orange'], [57, 'yellow'], [15, 'blue']] > > >>>randomDrawing(10, [[3, 'orange'], [5, 'yellow'], [2, > 'blue']]) > [[29923, 'orange'], [50208, 'yellow'], [19869, 'blue']] > > """ > res = [[0, item[1]] for item in population] > mapping = [] > for i in xrange(len(population)): > mapping.extend([i]*population[i][0]) > for i in xrange(count): > index = random.choice(mapping) > res[index][0] += 1 > return res > > > > -- > Brendon Towle, PhD > Cognitive Scientist > +1-412-690-2442x127 > Carnegie Learning, Inc. > The Cognitive Tutor Company ® > Helping over 375,000 students in 1000 school districts succeed in math. I got nearly a 2x speed up with this variant: def randomDrawing3(count, population): res = [[0, item[1]] for item in population] mapping = [] for i in xrange(len(population)): mapping.extend([i]*population[i][0]) n = len(mapping) for i in xrange(count): index = int(n * random.random()) res[mapping[index]][0] += 1 return res Apparently "int(n * random.random())" is faster than random.choice() or random.randint() [EMAIL PROTECTED]:~ $ python -mtimeit -s'import random; n=10' 'int(n * random.random())' 10 loops, best of 3: 3.22 usec per loop [EMAIL PROTECTED]:~ $ python -mtimeit -s'import random; n=10' 'random.randint(1, n)' 10 loops, best of 3: 13 usec per loop [EMAIL PROTECTED]:~ $ python -mtimeit -s'import random; n=range(10)' 'random.choice(n)' 10 loops, best of 3: 6.07 usec per loop (Note that the first and last examples produce values 0..9 while the middle one produces 1..10) I don't know for sure, but I think the random, uh, spread or whatever will be the same for random() as for choice(). If it's important, you should verify that. ;-) Peace, ~Simon -- http://mail.python.org/mailman/listinfo/python-list
Re: Random Drawing Simulation -- performance issue
Brendon Towle wrote: > I need to simulate scenarios like the following: "You have a deck of 3 > orange cards, 5 yellow cards, and 2 blue cards. You draw a card, replace > it, and repeat N times." > > So, I wrote the following code, which works, but it seems quite slow to > me. Can anyone point out some obvious thing that I'm doing > inefficiently? Or, is this more or less as good as it gets? > > For reference, my typical numbers look like this: > > 2 <= len(population) <= 7 > 4 <= len(mapping) <= 50 > 10 <= count <= 100 > > B. > > > #!/usr/bin/env python > > import random > > def randomDrawing(count, population): > """Simulates drawing items from , with replacement. > population is a list of lists: [[count1, type1], [count2, type2], ...] > > Typical examples: > >>>randomDrawing(100, [[3, 'orange'], [5, 'yellow'], [2, 'blue']]) > [[28, 'orange'], [57, 'yellow'], [15, 'blue']] > > >>>randomDrawing(10, [[3, 'orange'], [5, 'yellow'], [2, 'blue']]) > [[29923, 'orange'], [50208, 'yellow'], [19869, 'blue']] > > """ > res = [[0, item[1]] for item in population] > mapping = [] > for i in xrange(len(population)): > mapping.extend([i]*population[i][0]) > for i in xrange(count): > index = random.choice(mapping) > res[index][0] += 1 > return res > > > > --Brendon Towle, PhD > Cognitive Scientist > +1-412-690-2442x127 > Carnegie Learning, Inc. > The Cognitive Tutor Company ® > Helping over 375,000 students in 1000 school districts succeed in math. > > Given the data structure, might it not be faster to generate binomial variates for n-1 types, and set nth count = #draws - sum(other outcomes)? And for a "large" count, could you get by with a normal approximation? If you *do* feel the need for exact binomial, there are short, somewhat sluggish routines in Devroye (1986 - on the web!, pg 525), and Random_binomial1, compiled by Alan Miller(AMrandom.zip0, available, xlated, at http://www.esbconsult.com/. Even though they are relatively slow, generating a few of these would seem to be a lot faster than your current approach. Sorry I can't help more - I'm just starting to learn Python, have yet to even get IDLE going. Regards, Dave Braden -- http://mail.python.org/mailman/listinfo/python-list
Random Drawing Simulation -- performance issue
I need to simulate scenarios like the following: "You have a deck of 3 orange cards, 5 yellow cards, and 2 blue cards. You draw a card, replace it, and repeat N times." So, I wrote the following code, which works, but it seems quite slow to me. Can anyone point out some obvious thing that I'm doing inefficiently? Or, is this more or less as good as it gets? For reference, my typical numbers look like this: 2 <= len(population) <= 7 4 <= len(mapping) <= 50 10 <= count <= 100 B. #!/usr/bin/env python import random def randomDrawing(count, population): """Simulates drawing items from , with replacement. population is a list of lists: [[count1, type1], [count2, type2], ...] Typical examples: >>>randomDrawing(100, [[3, 'orange'], [5, 'yellow'], [2, 'blue']]) [[28, 'orange'], [57, 'yellow'], [15, 'blue']] >>>randomDrawing(10, [[3, 'orange'], [5, 'yellow'], [2, 'blue']]) [[29923, 'orange'], [50208, 'yellow'], [19869, 'blue']] """ res = [[0, item[1]] for item in population] mapping = [] for i in xrange(len(population)): mapping.extend([i]*population[i][0]) for i in xrange(count): index = random.choice(mapping) res[index][0] += 1 return res -- Brendon Towle, PhD Cognitive Scientist +1-412-690-2442x127 Carnegie Learning, Inc. The Cognitive Tutor Company ® Helping over 375,000 students in 1000 school districts succeed in math. -- http://mail.python.org/mailman/listinfo/python-list
Re: Performance issue
Marc 'BlackJack' Rintsch wrote: def make_anagram_map(words): anagram_map = dict() for word in imap(lambda w: w.strip().lower(), words): sorted_word = ''.join(sorted(list(word))) anagram_map.setdefault(sorted_word, list()).append(word) return dict(ifilter(lambda x: len(x[1]) > 1, anagram_map.iteritems())) Or if you're afraid of map and filter like me, you can try: def make_anagram_map(words): anagram_map = {} for word in (w.strip().lower() for w in words): anagram_map.setdefault(''.join(sorted(word)), []).append(word) return dict(sortedword_wordlist for sortedword_wordlist in anagram_map.iteritems() if len(sortedword_wordlist[1]) > 1) py> make_anagram_map(['owers', 'pest', 'rowse', 'pets', 'sower', 'step']) {'epst': ['pest', 'pets', 'step'], 'eorsw': ['owers', 'rowse', 'sower']} STeVe -- http://mail.python.org/mailman/listinfo/python-list
Re: Performance issue
On Sat, 02 Apr 2005 10:29:19 -0800, Shalabh Chaturvedi <[EMAIL PROTECTED]> wrote: >Tom Carrick wrote: >> Hi, >> >> In my attempted learning of python, I've decided to recode an old >> anagram solving program I made in C++. The C++ version runs in less >> than a second, while the python takes 30 seconds. I'm not willing to >> think it's just python being slow, so I was hoping someone could find >> a faster way of doing this. Also, I was wondering if there was a more >> builtin, or just nicer way of converting a string to a list (or using >> the sort function on a list) than making a function for it. > > > >Others have already given a good commentary and alternate suggestions. >Here is some more (and some disagreements): > >* Know your data structures (and how long different operations take). >Like don't do del L[0] unless required. This generally comes from >experience (and asking on this list). > >* list(any_sequence_here) will build a list from the sequence. There are >usually easy ways of converting built-in types - the Python docs will >help you here. > >* Try to write code as close to an english description of the problem as >possible. For example 'for word in words:' rather than using counters >and []. This is usually faster, clearer and IMO an important ingredient >of being 'Pythonic'. > >Anyway here's my rendition of your program: > >### >anagram = raw_input("Find anagrams of word: ") >lanagram = list(anagram) >lanagram.sort() >sorted_anagram = ''.join(lanagram).lower() > >f = open('/usr/share/dict/words', 'r') > >found = [] > >for word in f: > word = word.strip('\n') > if len(word)==len(sorted_anagram): > sorted_word = list(word) > sorted_word.sort() > sorted_word = ''.join(sorted_word) > if sorted_word == sorted_anagram: > found.append(word) > >print "Anagrams of %s:" % anagram > >for word in found: > print word >### > >Hopefully it is fast enough. > If doing more than one anagram, I think making a set of anagram dictionaries (one for each word length) and pickling them in separate files for later re-use might help. E.g., (untested!!) to make a dict (by wordlength) of anagram dicts, something like d = {} for word in open('/usr/share/dict/words'): word = word.strip().lower() d.setdefault(len(word), {}).setdefault(''.join(sorted(word)), []).append(word))) Then for wlen, adic in d.items(): pickle_file_name = 'anagram_%s'% wlen # pickle adic and write it out to the file ... Then the anagram utility can look for the appropriate pickle file per word length, (i.e., 'anagram_%s'%len(word.strip())) and just load it to anadict and print anadict(''.join(sorted(word.strip().lower())). That's a sketch. Untested!! Gotta go ;-/ Regards, Bengt Richter -- http://mail.python.org/mailman/listinfo/python-list
Re: Performance issue
In <[EMAIL PROTECTED]>, Tom Carrick wrote: > In my attempted learning of python, I've decided to recode an old > anagram solving program I made in C++. The C++ version runs in less > than a second, while the python takes 30 seconds. I'm not willing to > think it's just python being slow, so I was hoping someone could find > a faster way of doing this. Also, I was wondering if there was a more > builtin, or just nicer way of converting a string to a list (or using > the sort function on a list) than making a function for it. > > The words.txt here is just a copy of FreeBSD's /usr/share/dict/words Here's my attempt which builds an anagram dictionary ("sorted word" -> list of anagrams) for fast lookup of anagrams:: #!/usr/bin/env python2.4 from itertools import imap, ifilter WORDS = '/usr/share/dict/words' def make_anagram_map(words): anagram_map = dict() for word in imap(lambda w: w.strip().lower(), words): sorted_word = ''.join(sorted(list(word))) anagram_map.setdefault(sorted_word, list()).append(word) return dict(ifilter(lambda x: len(x[1]) > 1, anagram_map.iteritems())) def main(): words_file = open(WORDS, 'r') anagram_map = make_anagram_map(words_file) words_file.close() while True: word = raw_input('Find anagrams of word (just enter to end): ') if not word: break try: print anagram_map[''.join(sorted(list(word.strip().lower(] except KeyError: print 'No anagrams found for %r' % word # # Print all anagrams sorted by number of anagrams. # print '\n'.join(map(str, sorted(anagram_map.values(), key=len))) # print len(anagram_map) if __name__ == '__main__': main() Ciao, Marc 'BlackJack' Rintsch -- http://mail.python.org/mailman/listinfo/python-list
Re: Performance issue
Tom Carrick wrote: Hi, In my attempted learning of python, I've decided to recode an old anagram solving program I made in C++. The C++ version runs in less than a second, while the python takes 30 seconds. I'm not willing to think it's just python being slow, so I was hoping someone could find a faster way of doing this. Also, I was wondering if there was a more builtin, or just nicer way of converting a string to a list (or using the sort function on a list) than making a function for it. Others have already given a good commentary and alternate suggestions. Here is some more (and some disagreements): * Know your data structures (and how long different operations take). Like don't do del L[0] unless required. This generally comes from experience (and asking on this list). * list(any_sequence_here) will build a list from the sequence. There are usually easy ways of converting built-in types - the Python docs will help you here. * Try to write code as close to an english description of the problem as possible. For example 'for word in words:' rather than using counters and []. This is usually faster, clearer and IMO an important ingredient of being 'Pythonic'. Anyway here's my rendition of your program: ### anagram = raw_input("Find anagrams of word: ") lanagram = list(anagram) lanagram.sort() sorted_anagram = ''.join(lanagram).lower() f = open('/usr/share/dict/words', 'r') found = [] for word in f: word = word.strip('\n') if len(word)==len(sorted_anagram): sorted_word = list(word) sorted_word.sort() sorted_word = ''.join(sorted_word) if sorted_word == sorted_anagram: found.append(word) print "Anagrams of %s:" % anagram for word in found: print word ### Hopefully it is fast enough. Shalabh -- http://mail.python.org/mailman/listinfo/python-list
Re: Performance issue
Tom Carrick wrote: Hi, In my attempted learning of python, I've decided to recode an old anagram solving program I made in C++. The C++ version runs in less than a second, while the python takes 30 seconds. I'm not willing to think it's just python being slow, so I was hoping someone could find a faster way of doing this. Also, I was wondering if there was a more builtin, or just nicer way of converting a string to a list (or using the sort function on a list) than making a function for it. Others have already given a good commentary and alternate suggestions. Here is some more (and some disagreements): * Know your data structures (and how long different operations take). Like don't do del L[0] unless required. This generally comes from experience (and asking on this list). * list(any_sequence_here) will build a list from the sequence. There are usually easy ways of converting built-in types - the Python docs will help you here. * Try to write code as close to an english description of the problem as possible. For example 'for word in words:' rather than using counters and []. This is usually faster, clearer and IMO an important ingredient of being 'Pythonic'. Anyway here's my rendition of your program: ### anagram = raw_input("Find anagrams of word: ") lanagram = list(anagram) lanagram.sort() sorted_anagram = ''.join(lanagram).lower() f = open('/usr/share/dict/words', 'r') found = [] for word in f: word = word.strip('\n') if len(word)==len(sorted_anagram): sorted_word = list(word) sorted_word.sort() sorted_word = ''.join(sorted_word) if sorted_word == sorted_anagram: found.append(word) print "Anagrams of %s:" % anagram for word in found: print word ### Hopefully it is fast enough. Shalabh -- http://mail.python.org/mailman/listinfo/python-list
Re: Performance issue
Tom Carrick <[EMAIL PROTECTED]> writes: > In my attempted learning of python, I've decided to recode an old > anagram solving program I made in C++. The C++ version runs in less > than a second, while the python takes 30 seconds. Indeed, your program can be improved to run about ten times as fast, which (on my system, with 96274 entries in /usr/share/dict/words) is below a second. In general you should try to move the loops into C code, i.e. use built-in functions instead of long 'for' blocks. Some comments on your version: > import string > > # Need a function to convert a string to a list to be > # able to use the sort() function > def string2list(s): [snipped] list() achieves the same thing a lot faster. > words = [] You do not need to initialize 'words' here, as you're overwriting it a few lines afterwards. > found = [] > > anagram = raw_input("Find anagrams of word: ") > > f = open('words.txt', 'r') > file = f.read() > f.close() > > words = file.splitlines() Try to avoid assigning to the names of built-in functions if you can. Names like 'file', 'list', 'dict', 'map' etc. are often an obvious choice, but overwriting them means that you don't "just" know what a later use refers to. > sorted_anagram = anagram.lower() > sorted_anagram = string2list(anagram) > sorted_anagram.sort(lambda x, y: cmp(x, y)) Unless you *really* have to, don't use comparison functions with sort(), as they slow the operation considerably. In this (as in most) cases, a plain sorted_anagram.sort() does the trick, and in version 2.4 you can achieve custom sort orders with the optional 'key' argument. The sorted() built-in also comes in handy here. > while words: > if len(words[0]) == len(sorted_anagram): > wordlist = string2list(words[0]) > wordlist.sort(lambda x, y: cmp(x, y)) > sorted_wordlist = wordlist > if sorted_anagram == sorted_wordlist: > found.append(words[0]) > del words[0] Avoid this style of looping at all times! Removing the first element of a list is O(n), so looping through the whole list as above is O(n**2). In most cases you should use a for loop: for word in words: # do something which is O(n) of course. If you do have to loop destructively, pop() from the end (which is the default) like so: while words: word = words.pop() # do something This is also O(n), because removing the *last* element of a list is O(1) (amortized; I suppose the implementation will occasionally shrink the underlying array at linear cost). > print "Anagrams of " + anagram + ": " > while found: > print found[0] + " " > del found[0] I assume you meant not to print a newline between the words, which 'print' does by default. The best solution in that case is " ".join(found). A better version (2.4+ only): -- 8< -- 8< -- anagram = raw_input("Find anagrams of word: ") words = open('words.txt', 'r') sorted_anagram = sorted(anagram.lower()) found = [] for word in words.read().splitlines(): if len(word) == len(anagram) and sorted(word) == sorted_anagram: found.append(word) print "Anagrams of %s: %s" % (anagram, ' '.join(found)) -- >8 -- >8 -- Interestingly, the length comparison makes quite a difference! I removed it at first, thinking it was unnecessary. Here are some timings: * Your original version (for comparison): $ time echo stop | python2.4 anagram_slow.py [...] real0m9.090s user0m8.790s sys 0m0.013s * Your version, but with the O(n**2) loop replaced by an O(n) 'for': $ time echo stop | python2.4 anagram_forloop.py [...] real0m0.221s user0m0.134s sys 0m0.014s * My version but with the length comparison removed: $ time echo stop | python2.4 anagram_no_lencmp.py [...] real0m0.408s user0m0.353s sys 0m0.010s * My version as above: $ time echo stop | python2.4 anagram_fast.py [...] real0m0.144s user0m0.099s sys 0m0.008s Hope that helps :-) - Thomas -- If you want to reply by mail, substitute my first and last name for 'foo' and 'bar', respectively, and remove '.invalid'. -- http://mail.python.org/mailman/listinfo/python-list
Re: Performance issue
"Tom Carrick" <[EMAIL PROTECTED]> schrieb im Newsbeitrag news:[EMAIL PROTECTED] | Hi, | | In my attempted learning of python, I've decided to recode an old | anagram solving program I made in C++. The C++ version runs in less | than a second, while the python takes 30 seconds. I'm not willing to | think it's just python being slow, so I was hoping someone could find | a faster way of doing this. Also, I was wondering if there was a more | builtin, or just nicer way of converting a string to a list (or using | the sort function on a list) than making a function for it. | | The words.txt here is just a copy of FreeBSD's /usr/share/dict/words | | Anyway, the code: | | import string You're importing string, but never use it, so you can omit that line. | | # Need a function to convert a string to a list to be | # able to use the sort() function | def string2list(s): |l = [] |for i in range(0, len(s)): |l.append(s[i]) |return l No need to write your own function. list(s) already does the trick. | | words = [] | found = [] | | anagram = raw_input("Find anagrams of word: ") | | f = open('words.txt', 'r') | file = f.read() | f.close() I don't have a copy of words.txt, but here's what I would try (untested): anagram = raw_input("Find anagrams of word: ") sorted_anagram = list(sorted(anagram.lower())) # If you're Python is pre 2.4 ise # sorted_anagram = list(anagram.lower()) # sorted_anagram.sort() #--sort list in place found = [] # assuming "words.txt" contains a word per line # iterate over the lines of the file for line in open("/path/to/words.txt"): word = line[:-1]# Get rid of trailing newline sorted_word = list(sorted(word.lower())) if sorted_word == sorted_anagram: found.append(word) if found: print "Anagrams of %s:" % anagram for w in found: print w else: print "No anagrams for %s" % anagram -- Vincent Wehren -- http://mail.python.org/mailman/listinfo/python-list
Re: Performance issue
In <[EMAIL PROTECTED]>, Irmen de Jong wrote: >> words = file.splitlines() > > You can obtain this list without reading the file in its entirety, > by using the readlines method of file objects: > > words=open("words.txt").readlines() This leaves the newline characters at the end of each line while `str.splitlines()` removes them. Ciao, Marc 'BlackJack' Rintsch -- http://mail.python.org/mailman/listinfo/python-list
Re: Performance issue
In <[EMAIL PROTECTED]>, Tom Carrick wrote: > [â] Also, I was wondering if there was a more > builtin, or just nicer way of converting a string to a list (or using > the sort function on a list) than making a function for it. Use the `list()` builtin on the string and *just* the `sort()` method:: In [2]: characters = list('hello') In [3]: characters Out[3]: ['h', 'e', 'l', 'l', 'o'] In [4]: characters.sort() In [5]: characters Out[5]: ['e', 'h', 'l', 'l', 'o'] > sorted_anagram = anagram.lower() > sorted_anagram = string2list(anagram) > sorted_anagram.sort(lambda x, y: cmp(x, y)) sorted_anagram = list(anagram.lower()) sorted_anagram.sort() > while words: > if len(words[0]) == len(sorted_anagram): > wordlist = string2list(words[0]) > wordlist.sort(lambda x, y: cmp(x, y)) > sorted_wordlist = wordlist > if sorted_anagram == sorted_wordlist: > found.append(words[0]) > del words[0] And here's the performance issue. Deleting the first element of a list results in moving all remaining elements one index down. Better iterate over the words in a for loop:: for word in words: # use `word` instead of `word[0]` in the loop body. ... Ciao, Marc 'BlackJack' Rintsch -- http://mail.python.org/mailman/listinfo/python-list
Re: Performance issue
Tom Carrick wrote: > Hi, > > In my attempted learning of python, I've decided to recode an old > anagram solving program I made in C++. The C++ version runs in less > than a second, while the python takes 30 seconds. I'm not willing to > think it's just python being slow, so I was hoping someone could find > a faster way of doing this. I like your attitude, not thinking that it's just Python that is slow :-) > Also, I was wondering if there was a more > builtin, or just nicer way of converting a string to a list (or using > the sort function on a list) than making a function for it. String to list: list("irmen") # --> ['i','r','m','e','n'] Sorted list: sorted("irmen") # --> ['e', 'i', 'm', 'n', 'r'] (the latter works in Python 2.4+) > > The words.txt here is just a copy of FreeBSD's /usr/share/dict/words > > Anyway, the code: > > import string > > # Need a function to convert a string to a list to be > # able to use the sort() function > def string2list(s): > l = [] > for i in range(0, len(s)): > l.append(s[i]) > return l ... see above... just replace string2list(s) with sorted(s) > > words = [] > found = [] > > anagram = raw_input("Find anagrams of word: ") > > f = open('words.txt', 'r') > file = f.read() > f.close() Style: don't use 'file' as a variable name, you're hiding the builtin 'file' function > > words = file.splitlines() You can obtain this list without reading the file in its entirety, by using the readlines method of file objects: words=open("words.txt").readlines() > > sorted_anagram = anagram.lower() > sorted_anagram = string2list(anagram) > sorted_anagram.sort(lambda x, y: cmp(x, y)) The lambda is optional and only slows it down :-) But to get a sorted list of letters, just use sorted(s) if you're on Python 2.4+ > while words: > if len(words[0]) == len(sorted_anagram): > wordlist = string2list(words[0]) > wordlist.sort(lambda x, y: cmp(x, y)) > sorted_wordlist = wordlist (same here.. replacing this by sorted(words[0]) probably will speed it up rather significantly, partly because it avoids the creation of those temporary lists) > if sorted_anagram == sorted_wordlist: > found.append(words[0]) > del words[0] > > print "Anagrams of " + anagram + ": " > while found: > print found[0] + " " > del found[0] print " ".join(found) Cheers --Irmen de Jong -- http://mail.python.org/mailman/listinfo/python-list
Performance issue
Hi, In my attempted learning of python, I've decided to recode an old anagram solving program I made in C++. The C++ version runs in less than a second, while the python takes 30 seconds. I'm not willing to think it's just python being slow, so I was hoping someone could find a faster way of doing this. Also, I was wondering if there was a more builtin, or just nicer way of converting a string to a list (or using the sort function on a list) than making a function for it. The words.txt here is just a copy of FreeBSD's /usr/share/dict/words Anyway, the code: import string # Need a function to convert a string to a list to be # able to use the sort() function def string2list(s): l = [] for i in range(0, len(s)): l.append(s[i]) return l words = [] found = [] anagram = raw_input("Find anagrams of word: ") f = open('words.txt', 'r') file = f.read() f.close() words = file.splitlines() sorted_anagram = anagram.lower() sorted_anagram = string2list(anagram) sorted_anagram.sort(lambda x, y: cmp(x, y)) while words: if len(words[0]) == len(sorted_anagram): wordlist = string2list(words[0]) wordlist.sort(lambda x, y: cmp(x, y)) sorted_wordlist = wordlist if sorted_anagram == sorted_wordlist: found.append(words[0]) del words[0] print "Anagrams of " + anagram + ": " while found: print found[0] + " " del found[0] -- http://mail.python.org/mailman/listinfo/python-list