subject:"Performance issue"

Re: Ref-strings in logging messages (was: Performance issue with CPython 3.10 + Cython)

2022-10-24 Thread Barry Scott



> On 8 Oct 2022, at 11:50, Weatherby,Gerard  wrote:
> 
> Logging does support passing a callable, if indirectly. It only calls __str__ 
> on the object passed if debugging is enabled.
>  
> class Defer:
> 
> def __init__(self,fn):
> self.fn = fn
> 
> def __str__(self):
> return self.fn()
> 
> def some_expensive_function():
> return "hello"
> 
> logging.basicConfig()
> logging.debug(Defer(some_expensive_function))

Oh what a clever hack. Took a few minutes of code reading to see why this works.
You are exploiting the str(msg) that is in class LogRecords getMessage().

```
def getMessage(self):
"""
Return the message for this LogRecord.

Return the message for this LogRecord after merging any user-supplied
arguments with the message.
"""
msg = str(self.msg)
if self.args:
msg = msg % self.args
return msg
```

Barry


>  
>  
> From: Python-list  <mailto:python-list-bounces+gweatherby=uchc@python.org>> on behalf of 
> Barry mailto:ba...@barrys-emacs.org>>
> Date: Friday, October 7, 2022 at 1:30 PM
> To: MRAB mailto:pyt...@mrabarnett.plus.com>>
> Cc: python-list@python.org <mailto:python-list@python.org> 
> mailto:python-list@python.org>>
> Subject: Re: Ref-strings in logging messages (was: Performance issue with 
> CPython 3.10 + Cython)
> 
> *** Attention: This is an external email. Use caution responding, opening 
> attachments or clicking on links. ***
> 
> > On 7 Oct 2022, at 18:16, MRAB  wrote:
> >
> > On 2022-10-07 16:45, Skip Montanaro wrote:
> >>> On Fri, Oct 7, 2022 at 9:42 AM Andreas Ames 
> >>> 
> >>> wrote:
> >>> 1. The culprit was me. As lazy as I am, I have used f-strings all over the
> >>> place in calls to `logging.logger.debug()` and friends, evaluating all
> >>> arguments regardless of whether the logger was enabled or not.
> >>>
> >> I thought there was some discussion about whether and how to efficiently
> >> admit f-strings to the logging package. I'm guessing that's not gone
> >> anywhere (yet).
> > Letting you pass in a callable to call might help because that you could 
> > use lambda.
> 
> Yep, that’s the obvious way to avoid expensive log data generation.
> Would need logging module to support that use case.
> 
> Barry
> 
> > --
> > https://urldefense.com/v3/__https://mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!mrESxAj9YCHsdtNAfkNiY-Zf6U3WTIqaNrgBmbw1ELlQy51ilob43dD0ONsqvg4a94MEdOdwomgyqfyABbvRnA$
> >  
> > <https://urldefense.com/v3/__https:/mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!mrESxAj9YCHsdtNAfkNiY-Zf6U3WTIqaNrgBmbw1ELlQy51ilob43dD0ONsqvg4a94MEdOdwomgyqfyABbvRnA$>
> >
> 
> --
> https://urldefense.com/v3/__https://mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!mrESxAj9YCHsdtNAfkNiY-Zf6U3WTIqaNrgBmbw1ELlQy51ilob43dD0ONsqvg4a94MEdOdwomgyqfyABbvRnA$
>  
> <https://urldefense.com/v3/__https:/mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!mrESxAj9YCHsdtNAfkNiY-Zf6U3WTIqaNrgBmbw1ELlQy51ilob43dD0ONsqvg4a94MEdOdwomgyqfyABbvRnA$>
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Ref-strings in logging messages (was: Performance issue with CPython 3.10 + Cython)

2022-10-08 Thread Weatherby,Gerard

Logging does support passing a callable, if indirectly. It only calls __str__ 
on the object passed if debugging is enabled.

class Defer:

def __init__(self,fn):
self.fn = fn

def __str__(self):
return self.fn()

def some_expensive_function():
return "hello"

logging.basicConfig()
logging.debug(Defer(some_expensive_function))


From: Python-list  on 
behalf of Barry 
Date: Friday, October 7, 2022 at 1:30 PM
To: MRAB 
Cc: python-list@python.org 
Subject: Re: Ref-strings in logging messages (was: Performance issue with 
CPython 3.10 + Cython)
*** Attention: This is an external email. Use caution responding, opening 
attachments or clicking on links. ***

> On 7 Oct 2022, at 18:16, MRAB  wrote:
>
> On 2022-10-07 16:45, Skip Montanaro wrote:
>>> On Fri, Oct 7, 2022 at 9:42 AM Andreas Ames 
>>> wrote:
>>> 1. The culprit was me. As lazy as I am, I have used f-strings all over the
>>> place in calls to `logging.logger.debug()` and friends, evaluating all
>>> arguments regardless of whether the logger was enabled or not.
>>>
>> I thought there was some discussion about whether and how to efficiently
>> admit f-strings to the logging package. I'm guessing that's not gone
>> anywhere (yet).
> Letting you pass in a callable to call might help because that you could use 
> lambda.

Yep, that’s the obvious way to avoid expensive log data generation.
Would need logging module to support that use case.

Barry

> --
> https://urldefense.com/v3/__https://mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!mrESxAj9YCHsdtNAfkNiY-Zf6U3WTIqaNrgBmbw1ELlQy51ilob43dD0ONsqvg4a94MEdOdwomgyqfyABbvRnA$<https://urldefense.com/v3/__https:/mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!mrESxAj9YCHsdtNAfkNiY-Zf6U3WTIqaNrgBmbw1ELlQy51ilob43dD0ONsqvg4a94MEdOdwomgyqfyABbvRnA$>
>

--
https://urldefense.com/v3/__https://mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!mrESxAj9YCHsdtNAfkNiY-Zf6U3WTIqaNrgBmbw1ELlQy51ilob43dD0ONsqvg4a94MEdOdwomgyqfyABbvRnA$<https://urldefense.com/v3/__https:/mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!mrESxAj9YCHsdtNAfkNiY-Zf6U3WTIqaNrgBmbw1ELlQy51ilob43dD0ONsqvg4a94MEdOdwomgyqfyABbvRnA$>
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Ref-strings in logging messages (was: Performance issue with CPython 3.10 + Cython)

2022-10-07 Thread Julian Smith

On Fri, 7 Oct 2022 18:28:06 +0100
Barry  wrote:

> > On 7 Oct 2022, at 18:16, MRAB  wrote:
> > 
> > On 2022-10-07 16:45, Skip Montanaro wrote:  
> >>> On Fri, Oct 7, 2022 at 9:42 AM Andreas Ames 
> >>> 
> >>> wrote:
> >>> 1. The culprit was me. As lazy as I am, I have used f-strings all over the
> >>> place in calls to `logging.logger.debug()` and friends, evaluating all
> >>> arguments regardless of whether the logger was enabled or not.
> >>>   
> >> I thought there was some discussion about whether and how to efficiently
> >> admit f-strings to the logging package. I'm guessing that's not gone
> >> anywhere (yet).  
> > Letting you pass in a callable to call might help because that you could 
> > use lambda.  
> 
> Yep, that’s the obvious way to avoid expensive log data generation.
> Would need logging module to support that use case.

I have some logging code that uses eval() to evaluate expressions using
locals and globals in a parent stack frame, together with a parser to
find `{...}` items in a string.

I guess this constitutes a (basic) runtime implementation of f-strings.
As such it can avoid expensive evaluation/parsing when disabled, though
it's probably slow when enabled compared to native f-strings. It seems
to work quite well in practise, and also allows one to add some extra
formatting features.

For details see:

https://git.ghostscript.com/?p=mupdf.git;a=blob;f=scripts/jlib.py;h=e85e9f2c4;hb=HEAD#l41

- Jules

-- 
http://op59.net
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Ref-strings in logging messages (was: Performance issue with CPython 3.10 + Cython)

2022-10-07 Thread Barry


> On 7 Oct 2022, at 19:09, Weatherby,Gerard  wrote:
> The obvious way to avoid log generation is:
> 
> if logger.isEnableFor(logging.DEBUG):
>logger.debug( expensive processing )
> 
> 
> Of course, having logging alter program flow could lead to hard to debug bugs.

Altered flow is less of an issue the the verbosity of the above.
We discussed ways to improve this pattern a few years ago.
That lead to no changes.

What I have used is a class that defines __bool__ to report if logging is 
enabled and __call__ to log. Then you can do this:

log_debug = logger_from(DEBUG)

log_debug and log_debug(‘expensive %s’ % (complex(),))

Barry

> 
> From: Python-list  on 
> behalf of Barry 
> Date: Friday, October 7, 2022 at 1:30 PM
> To: MRAB 
> Cc: python-list@python.org 
> Subject: Re: Ref-strings in logging messages (was: Performance issue with 
> CPython 3.10 + Cython)
> *** Attention: This is an external email. Use caution responding, opening 
> attachments or clicking on links. ***
> 
>> On 7 Oct 2022, at 18:16, MRAB  wrote:
>> 
>> On 2022-10-07 16:45, Skip Montanaro wrote:
>>>> On Fri, Oct 7, 2022 at 9:42 AM Andreas Ames 
>>>> wrote:
>>>> 1. The culprit was me. As lazy as I am, I have used f-strings all over the
>>>> place in calls to `logging.logger.debug()` and friends, evaluating all
>>>> arguments regardless of whether the logger was enabled or not.
>>> I thought there was some discussion about whether and how to efficiently
>>> admit f-strings to the logging package. I'm guessing that's not gone
>>> anywhere (yet).
>> Letting you pass in a callable to call might help because that you could use 
>> lambda.
> 
> Yep, that’s the obvious way to avoid expensive log data generation.
> Would need logging module to support that use case.
> 
> Barry
> 
>> --
>> https://urldefense.com/v3/__https://mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!mrESxAj9YCHsdtNAfkNiY-Zf6U3WTIqaNrgBmbw1ELlQy51ilob43dD0ONsqvg4a94MEdOdwomgyqfyABbvRnA$<https://urldefense.com/v3/__https:/mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!mrESxAj9YCHsdtNAfkNiY-Zf6U3WTIqaNrgBmbw1ELlQy51ilob43dD0ONsqvg4a94MEdOdwomgyqfyABbvRnA$>
> 
> --
> https://urldefense.com/v3/__https://mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!mrESxAj9YCHsdtNAfkNiY-Zf6U3WTIqaNrgBmbw1ELlQy51ilob43dD0ONsqvg4a94MEdOdwomgyqfyABbvRnA$<https://urldefense.com/v3/__https:/mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!mrESxAj9YCHsdtNAfkNiY-Zf6U3WTIqaNrgBmbw1ELlQy51ilob43dD0ONsqvg4a94MEdOdwomgyqfyABbvRnA$>
> -- 
> https://mail.python.org/mailman/listinfo/python-list

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Ref-strings in logging messages (was: Performance issue with CPython 3.10 + Cython)

2022-10-07 Thread Weatherby,Gerard

The obvious way to avoid log generation is:

if logger.isEnableFor(logging.DEBUG):
logger.debug( expensive processing )


Of course, having logging alter program flow could lead to hard to debug bugs.

From: Python-list  on 
behalf of Barry 
Date: Friday, October 7, 2022 at 1:30 PM
To: MRAB 
Cc: python-list@python.org 
Subject: Re: Ref-strings in logging messages (was: Performance issue with 
CPython 3.10 + Cython)
*** Attention: This is an external email. Use caution responding, opening 
attachments or clicking on links. ***

> On 7 Oct 2022, at 18:16, MRAB  wrote:
>
> On 2022-10-07 16:45, Skip Montanaro wrote:
>>> On Fri, Oct 7, 2022 at 9:42 AM Andreas Ames 
>>> wrote:
>>> 1. The culprit was me. As lazy as I am, I have used f-strings all over the
>>> place in calls to `logging.logger.debug()` and friends, evaluating all
>>> arguments regardless of whether the logger was enabled or not.
>>>
>> I thought there was some discussion about whether and how to efficiently
>> admit f-strings to the logging package. I'm guessing that's not gone
>> anywhere (yet).
> Letting you pass in a callable to call might help because that you could use 
> lambda.

Yep, that’s the obvious way to avoid expensive log data generation.
Would need logging module to support that use case.

Barry

> --
> https://urldefense.com/v3/__https://mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!mrESxAj9YCHsdtNAfkNiY-Zf6U3WTIqaNrgBmbw1ELlQy51ilob43dD0ONsqvg4a94MEdOdwomgyqfyABbvRnA$<https://urldefense.com/v3/__https:/mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!mrESxAj9YCHsdtNAfkNiY-Zf6U3WTIqaNrgBmbw1ELlQy51ilob43dD0ONsqvg4a94MEdOdwomgyqfyABbvRnA$>
>

--
https://urldefense.com/v3/__https://mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!mrESxAj9YCHsdtNAfkNiY-Zf6U3WTIqaNrgBmbw1ELlQy51ilob43dD0ONsqvg4a94MEdOdwomgyqfyABbvRnA$<https://urldefense.com/v3/__https:/mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!mrESxAj9YCHsdtNAfkNiY-Zf6U3WTIqaNrgBmbw1ELlQy51ilob43dD0ONsqvg4a94MEdOdwomgyqfyABbvRnA$>
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Ref-strings in logging messages (was: Performance issue with CPython 3.10 + Cython)

2022-10-07 Thread Barry



> On 7 Oct 2022, at 18:16, MRAB  wrote:
> 
> On 2022-10-07 16:45, Skip Montanaro wrote:
>>> On Fri, Oct 7, 2022 at 9:42 AM Andreas Ames 
>>> wrote:
>>> 1. The culprit was me. As lazy as I am, I have used f-strings all over the
>>> place in calls to `logging.logger.debug()` and friends, evaluating all
>>> arguments regardless of whether the logger was enabled or not.
>>> 
>> I thought there was some discussion about whether and how to efficiently
>> admit f-strings to the logging package. I'm guessing that's not gone
>> anywhere (yet).
> Letting you pass in a callable to call might help because that you could use 
> lambda.

Yep, that’s the obvious way to avoid expensive log data generation.
Would need logging module to support that use case.

Barry

> -- 
> https://mail.python.org/mailman/listinfo/python-list
> 

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Ref-strings in logging messages (was: Performance issue with CPython 3.10 + Cython)

2022-10-07 Thread MRAB


On 2022-10-07 16:45, Skip Montanaro wrote:

On Fri, Oct 7, 2022 at 9:42 AM Andreas Ames 
wrote:


1. The culprit was me. As lazy as I am, I have used f-strings all over the
place in calls to `logging.logger.debug()` and friends, evaluating all
arguments regardless of whether the logger was enabled or not.



I thought there was some discussion about whether and how to efficiently
admit f-strings to the logging package. I'm guessing that's not gone
anywhere (yet).

Letting you pass in a callable to call might help because that you could 
use lambda.

--
https://mail.python.org/mailman/listinfo/python-list

Re: Ref-strings in logging messages (was: Performance issue with CPython 3.10 + Cython)

2022-10-07 Thread Barry



> On 7 Oct 2022, at 16:48, Skip Montanaro  wrote:
> 
> On Fri, Oct 7, 2022 at 9:42 AM Andreas Ames 
> wrote:
> 
>> 1. The culprit was me. As lazy as I am, I have used f-strings all over the
>> place in calls to `logging.logger.debug()` and friends, evaluating all
>> arguments regardless of whether the logger was enabled or not.
>> 
> 
> I thought there was some discussion about whether and how to efficiently
> admit f-strings to the logging package. I'm guessing that's not gone
> anywhere (yet).

That cannot be done as the f-string is computed before the log call.

Maybe you are thinking of the lazy expression idea for this. That idea
seems to have got no where as its not clear how to implement it without
performance issues.

Barry

> 
> Skip
> -- 
> https://mail.python.org/mailman/listinfo/python-list
> 

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Ref-strings in logging messages (was: Performance issue with CPython 3.10 + Cython)

2022-10-07 Thread Skip Montanaro

Dang autocorrect. Subject first word was supposed to be "f-strings" not
"ref-strings." Sorry about that.

S

On Fri, Oct 7, 2022, 10:45 AM Skip Montanaro 
wrote:

>
>
> On Fri, Oct 7, 2022 at 9:42 AM Andreas Ames 
> wrote:
>
>> 1. The culprit was me. As lazy as I am, I have used f-strings all over the
>> place in calls to `logging.logger.debug()` and friends, evaluating all
>> arguments regardless of whether the logger was enabled or not.
>>
>
> I thought there was some discussion about whether and how to efficiently
> admit f-strings to the logging package. I'm guessing that's not gone
> anywhere (yet).
>
> Skip
>
-- 
https://mail.python.org/mailman/listinfo/python-list

Ref-strings in logging messages (was: Performance issue with CPython 3.10 + Cython)

2022-10-07 Thread Skip Montanaro

On Fri, Oct 7, 2022 at 9:42 AM Andreas Ames 
wrote:

> 1. The culprit was me. As lazy as I am, I have used f-strings all over the
> place in calls to `logging.logger.debug()` and friends, evaluating all
> arguments regardless of whether the logger was enabled or not.
>

I thought there was some discussion about whether and how to efficiently
admit f-strings to the logging package. I'm guessing that's not gone
anywhere (yet).

Skip
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Performance issue with CPython 3.10 + Cython

2022-10-07 Thread Andreas Ames

Answering to myself, just for the records:

1. The culprit was me. As lazy as I am, I have used f-strings all over the
place in calls to `logging.logger.debug()` and friends, evaluating all
arguments regardless of whether the logger was enabled or not.  Replacing
these f-strings by regular printf-like format strings solved the issue.
Now the application bowls happily along, consistently below 0.02 seconds
per second application time.
2. Valgrind + callgrind is an awesome toolchain to spot performance issues,
even on VMs.


Am Di., 4. Okt. 2022 um 11:05 Uhr schrieb Andreas Ames <
andreas.0815.qwe...@gmail.com>:

> Hi all,
>
> I am wrapping an embedded application (, which does not use any dynamic
> memory management,) using Cython to call it from CPython.  The wrapped
> application uses a cyclic executive, i.e. everything is done in the
> input-logic-output design, typical for some real-time related domains.
> Consequentially, every application cycle executes more or less the very
> same code.  As I am still in a prototyping stadium, the wrapped process is
> completely CPU-bound, i.e. except of some logging output there is no I/O
> whatsoever.
>
> During one second of "application time", I am doing exactly 120 calls into
> the application through three Cython-wrapped API functions.  The
> application uses some platform-dependent APIs, which I have also wrapped
> with Cython, so that there are numerous callbacks into the Python realm per
> call into the application. What I am observing now, is that the performance
> per "application second" decreases (remember: executing code that does the
> same thing on every cycle) and extending the number of loop iterations does
> not seem to cause any bound to this decrease.  In the log ouput below, you
> can see the GC counts, which look innocent to me.  The "real time" is
> measured using "time.time()". The "top" utility does not suggest any memory
> leak either.  I am developing on WSL2, but I have checked that this
> performance effect also happens on physical machines.  Right now, I am
> staring at "kcachegrind", but I have no idea, how to determine time series
> for the performance of functions (I am not looking for those functions,
> which need the most time, but for those, which consume more and more
> execution time).
>
> One more thing to look for could be memory fragmentation, but before that
> I would like to ask the experts here for their ideas and experiences and/or
> for tools, which could help to find the culprit.
>
> 2022-10-04 10:40:50|INFO|__main__   |Execution loop 0 done. GC
>> counts = (381, 9, 3); 1 second of application time corresponds to
>> 0.06862711906433105 seconds real time.
>> 2022-10-04 10:40:51|INFO|__main__   |Execution loop 100 done. GC
>> counts = (381, 9, 3); 1 second of application time corresponds to
>> 0.08224177360534668 seconds real time.
>> 2022-10-04 10:40:52|INFO|__main__   |Execution loop 200 done. GC
>> counts = (381, 9, 3); 1 second of application time corresponds to
>> 0.08225250244140625 seconds real time.
>> 2022-10-04 10:40:53|INFO|__main__   |Execution loop 300 done. GC
>> counts = (381, 9, 3); 1 second of application time corresponds to
>> 0.10176873207092285 seconds real time.
>> 2022-10-04 10:40:54|INFO|__main__   |Execution loop 400 done. GC
>> counts = (381, 9, 3); 1 second of application time corresponds to
>> 0.10900592803955078 seconds real time.
>> 2022-10-04 10:40:55|INFO|__main__   |Execution loop 500 done. GC
>> counts = (381, 9, 3); 1 second of application time corresponds to
>> 0.12233948707580566 seconds real time.
>> 2022-10-04 10:40:56|INFO|__main__   |Execution loop 600 done. GC
>> counts = (381, 9, 3); 1 second of application time corresponds to
>> 0.14058256149291992 seconds real time.
>> 2022-10-04 10:40:58|INFO|__main__   |Execution loop 700 done. GC
>> counts = (381, 9, 3); 1 second of application time corresponds to
>> 0.14777183532714844 seconds real time.
>> 2022-10-04 10:40:59|INFO|__main__   |Execution loop 800 done. GC
>> counts = (381, 9, 3); 1 second of application time corresponds to
>> 0.15729451179504395 seconds real time.
>> 2022-10-04 10:41:01|INFO|__main__   |Execution loop 900 done. GC
>> counts = (381, 9, 3); 1 second of application time corresponds to
>> 0.17365813255310059 seconds real time.
>> 2022-10-04 10:41:03|INFO|__main__   |Execution loop 1000 done. GC
>> counts = (381, 9, 3); 1 second of application time corresponds to
>> 0.17772984504699707 seconds real time.
>> 2022-10-04 10:41:05|INFO|__main__   |Execution loop 1100 done. GC
>> counts = (381, 9, 3); 1 second of application time corresponds to
>> 0.1955263614654541 seconds real time.
>> 2022-10-04 10:41:07|INFO|__main__   |Execution loop 1200 done. GC
>> counts = (381, 9, 3); 1 second of application time corresponds to
>> 0.20046710968017578 seconds real time.
>> 2022-10-04 10:41:09|INFO|__main__   |Executio

Performance issue with CPython 3.10 + Cython

2022-10-04 Thread Andreas Ames

Hi all,

I am wrapping an embedded application (, which does not use any dynamic
memory management,) using Cython to call it from CPython.  The wrapped
application uses a cyclic executive, i.e. everything is done in the
input-logic-output design, typical for some real-time related domains.
Consequentially, every application cycle executes more or less the very
same code.  As I am still in a prototyping stadium, the wrapped process is
completely CPU-bound, i.e. except of some logging output there is no I/O
whatsoever.

During one second of "application time", I am doing exactly 120 calls into
the application through three Cython-wrapped API functions.  The
application uses some platform-dependent APIs, which I have also wrapped
with Cython, so that there are numerous callbacks into the Python realm per
call into the application. What I am observing now, is that the performance
per "application second" decreases (remember: executing code that does the
same thing on every cycle) and extending the number of loop iterations does
not seem to cause any bound to this decrease.  In the log ouput below, you
can see the GC counts, which look innocent to me.  The "real time" is
measured using "time.time()". The "top" utility does not suggest any memory
leak either.  I am developing on WSL2, but I have checked that this
performance effect also happens on physical machines.  Right now, I am
staring at "kcachegrind", but I have no idea, how to determine time series
for the performance of functions (I am not looking for those functions,
which need the most time, but for those, which consume more and more
execution time).

One more thing to look for could be memory fragmentation, but before that I
would like to ask the experts here for their ideas and experiences and/or
for tools, which could help to find the culprit.

2022-10-04 10:40:50|INFO|__main__   |Execution loop 0 done. GC
> counts = (381, 9, 3); 1 second of application time corresponds to
> 0.06862711906433105 seconds real time.
> 2022-10-04 10:40:51|INFO|__main__   |Execution loop 100 done. GC
> counts = (381, 9, 3); 1 second of application time corresponds to
> 0.08224177360534668 seconds real time.
> 2022-10-04 10:40:52|INFO|__main__   |Execution loop 200 done. GC
> counts = (381, 9, 3); 1 second of application time corresponds to
> 0.08225250244140625 seconds real time.
> 2022-10-04 10:40:53|INFO|__main__   |Execution loop 300 done. GC
> counts = (381, 9, 3); 1 second of application time corresponds to
> 0.10176873207092285 seconds real time.
> 2022-10-04 10:40:54|INFO|__main__   |Execution loop 400 done. GC
> counts = (381, 9, 3); 1 second of application time corresponds to
> 0.10900592803955078 seconds real time.
> 2022-10-04 10:40:55|INFO|__main__   |Execution loop 500 done. GC
> counts = (381, 9, 3); 1 second of application time corresponds to
> 0.12233948707580566 seconds real time.
> 2022-10-04 10:40:56|INFO|__main__   |Execution loop 600 done. GC
> counts = (381, 9, 3); 1 second of application time corresponds to
> 0.14058256149291992 seconds real time.
> 2022-10-04 10:40:58|INFO|__main__   |Execution loop 700 done. GC
> counts = (381, 9, 3); 1 second of application time corresponds to
> 0.14777183532714844 seconds real time.
> 2022-10-04 10:40:59|INFO|__main__   |Execution loop 800 done. GC
> counts = (381, 9, 3); 1 second of application time corresponds to
> 0.15729451179504395 seconds real time.
> 2022-10-04 10:41:01|INFO|__main__   |Execution loop 900 done. GC
> counts = (381, 9, 3); 1 second of application time corresponds to
> 0.17365813255310059 seconds real time.
> 2022-10-04 10:41:03|INFO|__main__   |Execution loop 1000 done. GC
> counts = (381, 9, 3); 1 second of application time corresponds to
> 0.17772984504699707 seconds real time.
> 2022-10-04 10:41:05|INFO|__main__   |Execution loop 1100 done. GC
> counts = (381, 9, 3); 1 second of application time corresponds to
> 0.1955263614654541 seconds real time.
> 2022-10-04 10:41:07|INFO|__main__   |Execution loop 1200 done. GC
> counts = (381, 9, 3); 1 second of application time corresponds to
> 0.20046710968017578 seconds real time.
> 2022-10-04 10:41:09|INFO|__main__   |Execution loop 1300 done. GC
> counts = (381, 9, 3); 1 second of application time corresponds to
> 0.22513842582702637 seconds real time.
> 2022-10-04 10:41:11|INFO|__main__   |Execution loop 1400 done. GC
> counts = (381, 9, 3); 1 second of application time corresponds to
> 0.23578548431396484 seconds real time.
> 2022-10-04 10:41:13|INFO|__main__   |Execution loop 1500 done. GC
> counts = (381, 9, 3); 1 second of application time corresponds to
> 0.24581527709960938 seconds real time.
> 2022-10-04 10:41:16|INFO|__main__   |Execution loop 1600 done. GC
> counts = (381, 9, 3); 1 second of application time corresponds to
> 0.2541334629058838 seconds real time.
> 2022-10-04 10:41:19|INFO|__mai

Re: A performance issue when using default value

2010-01-31 Thread Steven D'Aprano

On Sun, 31 Jan 2010 20:58:50 -0800, keakon wrote:

> I've found strange performance issue when using default value, the test
> code is list below:
> 
> from timeit import Timer
> 
> def f(x):
>   y = x
>   y.append(1)
>   return y
> 
> def g(x=[]):
>   y = []
>   y.append(1)
>   return y
> 
> def h(x=[]):
>   y = x
>   y.append(1)
>   return y
> 
> def f2(x):
>   y = x
>   y.append(1)
>   return y + []
> 
> def g2(x=[]):
>   y = []
>   y.append(1)
>   return y + []
> 
> def h2(x=[]):
>   y = x
>   y.append(1)
>   return y + []
> 
> TIMES = 1
> print Timer('f([])','from __main__ import f, g, h').timeit(TIMES) print
> Timer('g()','from __main__ import f, g, h').timeit(TIMES) print
> Timer('h([])','from __main__ import f, g, h').timeit(TIMES) print
> Timer('h()','from __main__ import f, g, h').timeit(TIMES) print
> Timer('f2([])','from __main__ import f2, g2, h2').timeit(TIMES) print
> Timer('g2()','from __main__ import f2, g2, h2').timeit(TIMES) print
> Timer('h2([])','from __main__ import f2, g2, h2').timeit(TIMES) print
> Timer('h2()','from __main__ import f2, g2, h2').timeit(TIMES)
> 
> 
> I tested it with Python 2.5.4, 2.6.4 and 3.1.1 on Windows XP, and get
> almost the same result:
> 0.00449247041174
> 0.00439608944712
> 0.00455867994396
> 0.00327471787615
> 0.00791581052899
> 0.00684919452053
> 0.00734311204357
> 0.30974942346

You've only run those results once, haven't you?

That is not trustworthy. In a multi-tasking operating system, including 
Windows, some other process might have run for some milliseconds, 
throwing the results off.

More importantly, if you had run it multiple times, you'd see this:

>>> for i in range(7):
... Timer('h2()','from __main__ import f2, g2, h2').timeit(TIMES)
...
0.0190200805664
0.315117120743
0.941411972046
1.56618499756
2.18553495407
2.79832291603
3.44865894318

h2 (and probably h, but I haven't tested it) get slower and slower and 
slower every time you run it.

Why? I'm pretty sure this is in the FAQs, but default values are created 
once, not every time the function is called. So you create a function 
with the default value of []. Call the function, and that list gets 1 
appended to it, so the default value is now [1]. Call the function again, 
and it becomes [1, 1]. And so on, and so on, and so on.

The other functions create a new empty list each time the function is 
called, so you're comparing the time it takes to grow an ever-bigger list 
with the time it takes to create a tiny list.

-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: A performance issue when using default value

2010-01-31 Thread alex23

keakon  wrote:
> The default value is mutable, and can be reused by all each call.
> So each call it will append 1 to the default value, that's very
> different than C++.

Being different from C++ is one of the many reasons some of us choose
Python ;)

This tends to bite most newcomers, so it's mentioned in the FAQ:
http://www.python.org/doc/faq/general/#id52
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: A performance issue when using default value

2010-01-31 Thread keakon

On 2月1日, 下午1时20分, alex23  wrote:
> alex23  wrote:
> > keakon  wrote:
> > > def h2(x=[]):
> > >   y = x
> > >   y.append(1)
> > >   return y + []
>
> > Are you aware that 'y = x' _doesn't_ make a copy of [], that it
> > actually points to the same list as x?
>
> Sorry, I meant to suggest trying the following instead:
>
> def h2(x=None):
>   if x is None: x = []
>   x.append(1)
>   return x + []
>
> It's a common idiom to use None as a sentinel for this situation, esp.
> where you _don't_ want a default mutable object to be reused.

Thank you, I got it.

The default value is mutable, and can be reused by all each call.
So each call it will append 1 to the default value, that's very
different than C++.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: A performance issue when using default value

2010-01-31 Thread alex23

alex23  wrote:
> keakon  wrote:
> > def h2(x=[]):
> >   y = x
> >   y.append(1)
> >   return y + []
>
> Are you aware that 'y = x' _doesn't_ make a copy of [], that it
> actually points to the same list as x?

Sorry, I meant to suggest trying the following instead:

def h2(x=None):
  if x is None: x = []
  x.append(1)
  return x + []

It's a common idiom to use None as a sentinel for this situation, esp.
where you _don't_ want a default mutable object to be reused.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: A performance issue when using default value

2010-01-31 Thread Chris Rebert

On Sun, Jan 31, 2010 at 8:58 PM, keakon  wrote:
> I've found strange performance issue when using default value, the
> test code is list below:
>
> from timeit import Timer
>
> def f(x):
>  y = x
>  y.append(1)
>  return y
>
> def g(x=[]):
>  y = []
>  y.append(1)
>  return y
>
> def h(x=[]):
>  y = x
>  y.append(1)
>  return y
>
> def f2(x):
>  y = x
>  y.append(1)
>  return y + []
>
> def g2(x=[]):
>  y = []
>  y.append(1)
>  return y + []
>
> def h2(x=[]):
>  y = x
>  y.append(1)
>  return y + []
>
> TIMES = 1
> print Timer('f([])','from __main__ import f, g, h').timeit(TIMES)
> print Timer('g()','from __main__ import f, g, h').timeit(TIMES)
> print Timer('h([])','from __main__ import f, g, h').timeit(TIMES)
> print Timer('h()','from __main__ import f, g, h').timeit(TIMES)
> print Timer('f2([])','from __main__ import f2, g2, h2').timeit(TIMES)
> print Timer('g2()','from __main__ import f2, g2, h2').timeit(TIMES)
> print Timer('h2([])','from __main__ import f2, g2, h2').timeit(TIMES)
> print Timer('h2()','from __main__ import f2, g2, h2').timeit(TIMES)
>
>
> I tested it with Python 2.5.4, 2.6.4 and 3.1.1 on Windows XP, and get
> almost the same result:
> 0.00449247041174
> 0.00439608944712
> 0.00455867994396
> 0.00327471787615
> 0.00791581052899
> 0.00684919452053
> 0.00734311204357
> 0.30974942346
>
> h2() is about 42 times slower than h2([]), but h() is a litter faster
> than h([]).
>
> If change TIMES to 2, other results are 2 times than before, but h2
> () is 4 times(about 1.2 sec) than before.
>
> Is there any tricks in it?

Are you aware of the following pitfall?:

>>> def foo(a=[]):
... a.append(7)
... return a
>>>
>>> print foo()
[7]
>>> print foo()
[7, 7]
>>> print foo()
[7, 7, 7]

i.e. the default argument is only evaluated once (when the function is
defined) and is then reused every call when the caller doesn't provide
a value.

Cheers,
Chris
--
http://blog.rebertia.com
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: A performance issue when using default value

2010-01-31 Thread alex23

keakon  wrote:
> def h2(x=[]):
>   y = x
>   y.append(1)
>   return y + []

> h2() is about 42 times slower than h2([]), but h() is a litter faster
> than h([]).

Are you aware that 'y = x' _doesn't_ make a copy of [], that it
actually points to the same list as x?

My guess is that the slowdown you're seeing is due to the increasing
size of x _per timing iteration_. h2([]) will pass a new list each
time, while h2() will append to the same list _every_ time. The
difference between h & h2 is due to the concatenation of a new list to
the built one: the longer the default list grows, the longer this will
take, as extending a list takes O(k) time, with k being the number of
elements.
-- 
http://mail.python.org/mailman/listinfo/python-list

A performance issue when using default value

2010-01-31 Thread keakon

I've found strange performance issue when using default value, the
test code is list below:

from timeit import Timer

def f(x):
  y = x
  y.append(1)
  return y

def g(x=[]):
  y = []
  y.append(1)
  return y

def h(x=[]):
  y = x
  y.append(1)
  return y

def f2(x):
  y = x
  y.append(1)
  return y + []

def g2(x=[]):
  y = []
  y.append(1)
  return y + []

def h2(x=[]):
  y = x
  y.append(1)
  return y + []

TIMES = 1
print Timer('f([])','from __main__ import f, g, h').timeit(TIMES)
print Timer('g()','from __main__ import f, g, h').timeit(TIMES)
print Timer('h([])','from __main__ import f, g, h').timeit(TIMES)
print Timer('h()','from __main__ import f, g, h').timeit(TIMES)
print Timer('f2([])','from __main__ import f2, g2, h2').timeit(TIMES)
print Timer('g2()','from __main__ import f2, g2, h2').timeit(TIMES)
print Timer('h2([])','from __main__ import f2, g2, h2').timeit(TIMES)
print Timer('h2()','from __main__ import f2, g2, h2').timeit(TIMES)


I tested it with Python 2.5.4, 2.6.4 and 3.1.1 on Windows XP, and get
almost the same result:
0.00449247041174
0.00439608944712
0.00455867994396
0.00327471787615
0.00791581052899
0.00684919452053
0.00734311204357
0.30974942346

h2() is about 42 times slower than h2([]), but h() is a litter faster
than h([]).

If change TIMES to 2, other results are 2 times than before, but h2
() is 4 times(about 1.2 sec) than before.

Is there any tricks in it?
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Strange performance issue

2009-10-06 Thread Dan Stromberg


Ulrich Eckhardt wrote:

Dan Stromberg wrote:
  

My new version formats an SD card and preallocates some file space in
about 3 minutes with "Optimize Performance" selected, and in about 30
minutes with "Optimize for Quick Removal" selected.  Needless to say, I
don't like the 27 minute penalty much.



For performance, the driver will probably use delayed write operations,
buffering etc. For quick removal, it will probably make all operations
atomic, i.e. perform a write operation and not return until the data has
really reached the SD card.
  
The driver may delay writes, but I'd think it more likely that the 
filesystem or buffer cache would be doing so.

But the old version of the program formats and preallocates in 3 minutes
irrespective of whether the drive is in "optimize performance" or
"optimize for quick removal".



Somehow, I guess the new version does many small writes while the old one
doesn't. When optimizing for quick removal, those operations add up,
otherwise their overhead is negligible.
  

Nope, same block size.

one_meg = 'a'*2**20



That's one mib, not one meg. ;)

  

You're aware that a lot of people are ignoring the new units?

file = open('remove-me', 'w')



Try adding the 'b' flag, too. I wouldn't expect it to affect the IO speed,
but it is one factor that adds up.

This appears to have wholly accounted for the problem!  Thank you very much.

Adding a "b" to my open sped up the writes by a factor of about 15.

 Otherwise, I looked over your program
and couldn't find anything that would explain a problem. Just wondering,
what transfer speed do you get with the two versions? What is the
theoretical maximum for the SD card?
  
I don't know what the theoretical max write speeds of the USB bus, card 
reader and card are, but I was getting over 10x faster using the same 
card and card reader on an ancient Thinkpad running Ubuntu.  With the 
"b" specified, the XP system is now faster than the Thinkpad.


Interesting that the "b" would make such a performance difference.  
You'd think it'd just be, in C, looking for newlines and adding a 
carriage return after them into a buffer of potentially double the size, 
then writing as usual.  The data I've been writing so far -has- no newlines.


Thanks!

--
http://mail.python.org/mailman/listinfo/python-list

Re: Strange performance issue

2009-10-06 Thread Dan Stromberg


Steven D'Aprano wrote:

On Mon, 05 Oct 2009 22:31:05 -0700, Dan Stromberg wrote:

  

I'm rewriting 3 programs as one program - from Python with Tkinter to
Python with pygtk, both on Windows XP.

My new version formats an SD card and preallocates some file space in
about 3 minutes with "Optimize Performance" selected, and in about 30
minutes with "Optimize for Quick Removal" selected.  Needless to say, I
don't like the 27 minute penalty much.



I don't understand what that means. How are you formatting the SD card?
  

The FormatEx function via ctypes.
I'm guessing that Optimize for Quick Removal means that every write is 
immediately synced to disk. That will probably be slow, no matter what.
  
Yes, presumably Optimize for Quick Removal is a write-through cache 
(synchronous) and "Optimize for Performance" is a write-back cache 
(asynchronous).
  

But the old version of the program formats and preallocates in 3 minutes
irrespective of whether the drive is in "optimize performance" or
"optimize for quick removal".



I suspect that if there was no performance penalty in the old version, it 
was because you inadvertently were always using "Optimize Performance" no 
matter what.
  
I'm pretty confident that unless Windows was lying to me, that I did 
some tests with Optimize for Performance and some tests without.
BTW, if you want to pre-allocate some space, and you don't care what is 
in the file, you *may* be able to use file.truncate() to do so quickly. 
Despite the name, it doesn't just truncate files, it can also be used to 
extend them. (At least on POSIX systems.)
  
On a POSIX system, I'd normally just seek and write a single null to get 
a file with holes.  But this is not only Windows, but FAT32 - and the 
data will later be read without going through the filesystem.  It seems 
prudent to actually write the blocks this time.



Thanks, and see my other message (to be sent momentarily) for an 
apparent solution.

f = open('empty', 'w')
f.seek(1000)
f.truncate()
f.close()
f = open('empty', 'r')
len(f.read())


1000
  

f.close()





  


--
http://mail.python.org/mailman/listinfo/python-list

Re: Strange performance issue

2009-10-06 Thread Ulrich Eckhardt

Dan Stromberg wrote:
> My new version formats an SD card and preallocates some file space in
> about 3 minutes with "Optimize Performance" selected, and in about 30
> minutes with "Optimize for Quick Removal" selected.  Needless to say, I
> don't like the 27 minute penalty much.

For performance, the driver will probably use delayed write operations,
buffering etc. For quick removal, it will probably make all operations
atomic, i.e. perform a write operation and not return until the data has
really reached the SD card.

> But the old version of the program formats and preallocates in 3 minutes
> irrespective of whether the drive is in "optimize performance" or
> "optimize for quick removal".

Somehow, I guess the new version does many small writes while the old one
doesn't. When optimizing for quick removal, those operations add up,
otherwise their overhead is negligible.

> one_meg = 'a'*2**20

That's one mib, not one meg. ;)

> file = open('remove-me', 'w')

Try adding the 'b' flag, too. I wouldn't expect it to affect the IO speed,
but it is one factor that adds up. Otherwise, I looked over your program
and couldn't find anything that would explain a problem. Just wondering,
what transfer speed do you get with the two versions? What is the
theoretical maximum for the SD card?

Uli

-- 
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Strange performance issue

2009-10-06 Thread Steven D'Aprano

On Mon, 05 Oct 2009 22:31:05 -0700, Dan Stromberg wrote:

> I'm rewriting 3 programs as one program - from Python with Tkinter to
> Python with pygtk, both on Windows XP.
> 
> My new version formats an SD card and preallocates some file space in
> about 3 minutes with "Optimize Performance" selected, and in about 30
> minutes with "Optimize for Quick Removal" selected.  Needless to say, I
> don't like the 27 minute penalty much.

I don't understand what that means. How are you formatting the SD card?

I'm guessing that Optimize for Quick Removal means that every write is 
immediately synced to disk. That will probably be slow, no matter what.

> But the old version of the program formats and preallocates in 3 minutes
> irrespective of whether the drive is in "optimize performance" or
> "optimize for quick removal".

I suspect that if there was no performance penalty in the old version, it 
was because you inadvertently were always using "Optimize Performance" no 
matter what.

BTW, if you want to pre-allocate some space, and you don't care what is 
in the file, you *may* be able to use file.truncate() to do so quickly. 
Despite the name, it doesn't just truncate files, it can also be used to 
extend them. (At least on POSIX systems.)

>>> f = open('empty', 'w')
>>> f.seek(1000)
>>> f.truncate()
>>> f.close()
>>> f = open('empty', 'r')
>>> len(f.read())
1000
>>> f.close()

-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Strange performance issue

2009-10-05 Thread Dan Stromberg



I'm rewriting 3 programs as one program - from Python with Tkinter to 
Python with pygtk, both on Windows XP.


My new version formats an SD card and preallocates some file space in 
about 3 minutes with "Optimize Performance" selected, and in about 30 
minutes with "Optimize for Quick Removal" selected.  Needless to say, I 
don't like the 27 minute penalty much.


But the old version of the program formats and preallocates in 3 minutes 
irrespective of whether the drive is in "optimize performance" or 
"optimize for quick removal".


I'm baffled.  I've gone over the old code, and found nothing relevant 
looking, and I've never seen anything this puzzling in my years of work 
on *ix.


Can someone please elucidate?

(I've also written a pair of minimal programs in Python and C that 
exhibit the same behavior on this XP system - so it's not just my GUI app)


Thanks!

PS: Here's the minimal python test program I mentioned.  It's dog slow 
unless I Optimize for Performance the drive.:


#!/cygdrive/c/Python26/python

import os
import time

one_meg = 'a'*2**20

t0 = time.time()

i = 0

file = open('remove-me', 'w')
while True:
  i += 1
  file.write(one_meg)
  t1 = time.time()
  bits_written = i*2**20*8
  duration = t1-t0
  print i, str(duration)+'s', bits_written/(duration*1024*1024), 
'Mbps'

  if duration > 120:
  break
file.close()

os.unlink('remove-me')

--
http://mail.python.org/mailman/listinfo/python-list

Re: Levenshtein word comparison -performance issue

2009-02-19 Thread S.Selvam Siva

On Sat, Feb 14, 2009 at 3:01 PM, Peter Otten <__pete...@web.de> wrote:

> Gabriel Genellina wrote:
>
> > En Fri, 13 Feb 2009 08:16:00 -0200, S.Selvam Siva <
> s.selvams...@gmail.com>
> > escribió:
> >
> >> I need some help.
> >> I tried to find top n(eg. 5) similar words for a given word, from a
> >> dictionary of 50,000 words.
> >> I used python-levenshtein module,and sample code is as follow.
> >>
> >> def foo(searchword):
> >> disdict={}
> >> for word in self.dictionary-words:
> >>distance=Levenshtein.ratio(searchword,word)
> >>disdict[word]=distance
> >> """
> >>  sort the disdict dictionary by values in descending order
> >> """
> >> similarwords=sorted(disdict, key=disdict.__getitem__, reverse=True)
> >>
> >> return similarwords[:5]
> >
> > You may replace the last steps (sort + slice top 5) by heapq.nlargest -
> at
> > least you won't waste time sorting 49995 irrelevant words...
> > Anyway you should measure the time taken by the first part (Levenshtein),
> > it may be the most demanding. I think there is a C extension for this,
> > should be much faster than pure Python calculations.
> >
>
> [I didn't see the original post]
>
> You can use the distance instead of the ratio and put the words into bins
> of
> the same length. Then if you find enough words with a distance <= 1 in the
> bin with the same length as the search word you can stop looking.
>
> You might be able to generalize this approach to other properties that are
> fast to calculate and guarantee a minimum distance, e. g. set(word).
>
> Peter
> --
> http://mail.python.org/mailman/listinfo/python-list
>


Thank you all for your response,

[sorry,I was away for a while.]
I used functools,heapq modules but that helped me a little,
then i categorized the words depending on the length and
compares with a small set(each set 5/4=12,500),
so now its taking quarter of time as compared to older method.

Further, can i use Thread to achieve parallel comparison ?,as i have little
knowledge on python-thread.
Will the following code achive parallelism?

thread1= threading.Thread(target=self.findsimilar,
args=("1",searchword,dic-word-set1)
   thread2= threading.Thread(target=self.findsimilar,
args=("2",searchword,dic-word-set1)
   thread3= threading.Thread(target=self.findsimilar,
args=("3",searchword,dic-word-set1)
   thread1.start()
   thread2.start()
   thread3.start()
   thread1.join()
   thread2.join()
   thread3.join()

I would like to hear suggestion.
Note:The issue is i am developing spell checker for my local languge,i may
use more than 2.5 lakh words,so i need to have a best way to find out
alternative wordlist
-- 
Yours,
S.Selvam
--
http://mail.python.org/mailman/listinfo/python-list

Re: Levenshtein word comparison -performance issue

2009-02-14 Thread Peter Otten

Gabriel Genellina wrote:

> En Fri, 13 Feb 2009 08:16:00 -0200, S.Selvam Siva 
> escribió:
> 
>> I need some help.
>> I tried to find top n(eg. 5) similar words for a given word, from a
>> dictionary of 50,000 words.
>> I used python-levenshtein module,and sample code is as follow.
>>
>> def foo(searchword):
>> disdict={}
>> for word in self.dictionary-words:
>>distance=Levenshtein.ratio(searchword,word)
>>disdict[word]=distance
>> """
>>  sort the disdict dictionary by values in descending order
>> """
>> similarwords=sorted(disdict, key=disdict.__getitem__, reverse=True)
>>
>> return similarwords[:5]
> 
> You may replace the last steps (sort + slice top 5) by heapq.nlargest - at
> least you won't waste time sorting 49995 irrelevant words...
> Anyway you should measure the time taken by the first part (Levenshtein),
> it may be the most demanding. I think there is a C extension for this,
> should be much faster than pure Python calculations.
> 

[I didn't see the original post]

You can use the distance instead of the ratio and put the words into bins of
the same length. Then if you find enough words with a distance <= 1 in the
bin with the same length as the search word you can stop looking.

You might be able to generalize this approach to other properties that are
fast to calculate and guarantee a minimum distance, e. g. set(word).

Peter
--
http://mail.python.org/mailman/listinfo/python-list

Re: Levenshtein word comparison -performance issue

2009-02-13 Thread Basilisk96

On Feb 13, 5:42 am, "Gabriel Genellina" 
wrote:
> You may replace the last steps (sort + slice top 5) by heapq.nlargest - at  
> least you won't waste time sorting 49995 irrelevant words...
> Anyway you should measure the time taken by the first part (Levenshtein),  
> it may be the most demanding. I think there is a C extension for this,  
> should be much faster than pure Python calculations.

subdist: http://pypi.python.org/pypi/subdist/0.2.1
It uses a modified "fuzzy" version of the Levenshtein algorithm, which
I found more useful than the strict version. The only quirk to it is
that it accepts nothing but unicode. Other than that, it's a keeper.
It is extremely fast.

Cheers,
-Basilisk96
--
http://mail.python.org/mailman/listinfo/python-list

Re: Levenshtein word comparison -performance issue

2009-02-13 Thread Terry Reedy


Gabriel Genellina wrote:
En Fri, 13 Feb 2009 08:16:00 -0200, S.Selvam Siva 
 escribió:



I need some help.
I tried to find top n(eg. 5) similar words for a given word, from a
dictionary of 50,000 words.
I used python-levenshtein module,and sample code is as follow.

def foo(searchword):
disdict={}
for word in self.dictionary-words:
   distance=Levenshtein.ratio(searchword,word)
   disdict[word]=distance
"""
 sort the disdict dictionary by values in descending order
"""
similarwords=sorted(disdict, key=disdict.__getitem__, reverse=True)

return similarwords[:5]


You may replace the last steps (sort + slice top 5) by heapq.nlargest - 
at least you won't waste time sorting 49995 irrelevant words...


There is also no need to build the 5 entry word-distance dictionary.

import heapq, functools

def foo(searchword, n):
  distance = functools.partial(Levenshtein.ratio, searchword)
  return heapq.nlargest(n, words, distance)

If the distances are wanted along with the similar words, I strongly 
suspect that it would be faster to recalculate a small number than to 
generate the dict of 5 pairs.


Anyway you should measure the time taken by the first part 
(Levenshtein), it may be the most demanding. I think there is a C 
extension for this, should be much faster than pure Python calculations.


And such could be dropped into the code above.

Terry Jan Reedy


--
http://mail.python.org/mailman/listinfo/python-list

Re: Levenshtein word comparison -performance issue

2009-02-13 Thread Gabriel Genellina

En Fri, 13 Feb 2009 08:16:00 -0200, S.Selvam Siva   
escribió:



I need some help.
I tried to find top n(eg. 5) similar words for a given word, from a
dictionary of 50,000 words.
I used python-levenshtein module,and sample code is as follow.

def foo(searchword):
disdict={}
for word in self.dictionary-words:
   distance=Levenshtein.ratio(searchword,word)
   disdict[word]=distance
"""
 sort the disdict dictionary by values in descending order
"""
similarwords=sorted(disdict, key=disdict.__getitem__, reverse=True)

return similarwords[:5]


You may replace the last steps (sort + slice top 5) by heapq.nlargest - at  
least you won't waste time sorting 49995 irrelevant words...
Anyway you should measure the time taken by the first part (Levenshtein),  
it may be the most demanding. I think there is a C extension for this,  
should be much faster than pure Python calculations.


--
Gabriel Genellina

--
http://mail.python.org/mailman/listinfo/python-list

Levenshtein word comparison -performance issue

2009-02-13 Thread S.Selvam Siva

Hi all,

I need some help.
I tried to find top n(eg. 5) similar words for a given word, from a
dictionary of 50,000 words.
I used python-levenshtein module,and sample code is as follow.

def foo(searchword):
disdict={}
for word in self.dictionary-words:
   distance=Levenshtein.ratio(searchword,word)
   disdict[word]=distance
"""
 sort the disdict dictionary by values in descending order
"""
similarwords=sorted(disdict, key=disdict.__getitem__, reverse=True)

return similarwords[:5]

foo() takes a search word and compares it with dictionary of 50,000 and
assigns each word a value(lies between 0 to 1).
Then after sorting in descending order it returns top 5 similar words.

The problem is, it* takes long time* for processing(as i need to pass more
search words within a loop),i guess the code could be improved to work
efficiently.Your suggestions are welcome...
-- 
Yours,
S.Selvam
--
http://mail.python.org/mailman/listinfo/python-list

Re: cgitb performance issue

2008-05-15 Thread Ethan Furman



Gabriel Genellina wrote:

En Wed, 14 May 2008 13:51:40 -0300, Ethan Furman 
<[EMAIL PROTECTED]>  escribió:



Gabriel Genellina wrote:

En Mon, 05 May 2008 15:56:26 -0300, Ethan Furman  
<[EMAIL PROTECTED]> escribió:



I tried adding a form to our website for uploading large files.
Personally, I dislike the forms that tell you you did something wrong
and make you re-enter *all* your data again, so this one cycles and
remembers your answers, and only prompts for the file once the rest of
the entered data is good.  However, the first time the form loads 
it  can

take up to 30 seconds... any ideas why?



Hard to tell without looking at the code... And what has cgitb to 
do  with this?



H... excellent question, and the answer is -- Nothing.  My
apologies.  I'm importing cgi (cgitb was while I was debugging it), as
well as a bunch of others.

Here's the source, with a bunch of the actual html stripped out.  The -u
as well as the last two lines were an attempt to eliminate the 30-second
pause while it loads, as it seems to get all data transferred, then just
waits for a while.  Any ideas appreciated.  My apologies for the 
ugly  code.



*When* do you see the 30-secs delay? After uploading the files? I see  
they're transferred using SMTP, but I think you're not talking about 
the  time it takes to send the mail.
I'd use the old-fashioned "print" statement to see *what* gets 
executed  and how long it takes each step.


The initial load of the page in question is a static html page (done as 
a crutch so there's *something* on screen while the rest of the thing 
loads).  After five seconds the static html page redirects to the python 
cgi script -- and that's where the delay occurs.  I just double checked, 
and it also occurs everytime it has to reload the script.  I'll try the 
print statements and see if I can narrow down where the bottleneck is 
located.  The sending of the file via smtp happens on our server after 
the file has been uploaded, the delays are happening long before that.  
Thanks for your help!

--
Ethan

--
http://mail.python.org/mailman/listinfo/python-list

Re: cgitb performance issue

2008-05-15 Thread Gabriel Genellina

En Wed, 14 May 2008 13:51:40 -0300, Ethan Furman <[EMAIL PROTECTED]>  
escribió:

Gabriel Genellina wrote:
En Mon, 05 May 2008 15:56:26 -0300, Ethan Furman  
<[EMAIL PROTECTED]> escribió:



I tried adding a form to our website for uploading large files.
Personally, I dislike the forms that tell you you did something wrong
and make you re-enter *all* your data again, so this one cycles and
remembers your answers, and only prompts for the file once the rest of
the entered data is good.  However, the first time the form loads it  
can

take up to 30 seconds... any ideas why?


Hard to tell without looking at the code... And what has cgitb to do  
with this?



H... excellent question, and the answer is -- Nothing.  My
apologies.  I'm importing cgi (cgitb was while I was debugging it), as
well as a bunch of others.

Here's the source, with a bunch of the actual html stripped out.  The -u
as well as the last two lines were an attempt to eliminate the 30-second
pause while it loads, as it seems to get all data transferred, then just
waits for a while.  Any ideas appreciated.  My apologies for the ugly  
code.


*When* do you see the 30-secs delay? After uploading the files? I see  
they're transferred using SMTP, but I think you're not talking about the  
time it takes to send the mail.
I'd use the old-fashioned "print" statement to see *what* gets executed  
and how long it takes each step.


--
Gabriel Genellina

--
http://mail.python.org/mailman/listinfo/python-list

Re: cgitb performance issue

2008-05-14 Thread Ethan Furman



Gabriel Genellina wrote:


En Mon, 05 May 2008 15:56:26 -0300, Ethan Furman <[EMAIL PROTECTED]> escribió:

 


I tried adding a form to our website for uploading large files.
Personally, I dislike the forms that tell you you did something wrong
and make you re-enter *all* your data again, so this one cycles and
remembers your answers, and only prompts for the file once the rest of
the entered data is good.  However, the first time the form loads it can
take up to 30 seconds... any ideas why?
   



Hard to tell without looking at the code... And what has cgitb to do with this?
 

H... excellent question, and the answer is -- Nothing.  My 
apologies.  I'm importing cgi (cgitb was while I was debugging it), as 
well as a bunch of others.


Here's the source, with a bunch of the actual html stripped out.  The -u 
as well as the last two lines were an attempt to eliminate the 30-second 
pause while it loads, as it seems to get all data transferred, then just 
waits for a while.  Any ideas appreciated.  My apologies for the ugly code.

--
Ethan


#!/usr/local/bin/python -u

import cgi
#import cgitb; cgitb.enable()
import smtplib
import mimetypes
import os
import sys
from email.Encoders import encode_base64
from email.MIMEBase import MIMEBase
from email.MIMEText import MIMEText
from email.MIMEMultipart import MIMEMultipart

HEADER = """Content-type: text/html\n
  
  """
SELECTFILES = """
  
"""
FILESRECEIVED = """
  
"""
VERIFIED = """
  
"""
FOOTER = """
  
  
"""

REQUIRED = '*required'
form = cgi.FieldStorage()

clientname = clientemail = clientcsr = subject = body = ""
csrs = {"frank":{'html':'selected="yes"', 'name':'', 
'email':''}, \
  "kathy":{'html':'', 'name':'', 'email':'removed>'}, \
  "chris":{'html':'', 'name':'', 'email':'removed>'}, \
  "tracy":{'html':'', 'name':'', 'email':'removed>'}, \
  "susi": {'html':'', 'name':'', 'email':'removed>'}, \
  "bill": {'html':'', 'name':'', 'email':'removed>'}}

incomplete = False
files = []
filelist = []


if form:
  if form.has_key("name"):
  clientname = form["name"].value
  if form.has_key("email"):
  clientemail = form["email"].value
  if form.has_key("CSR"):
  clientcsr = form["CSR"].value
  csrs["frank"]['html'] = ''  # clear 
default

  csrs[clientcsr]['html'] = 'selected="yes"'
  if form.has_key("subject"):
  subject = form["subject"].value
  if form.has_key("body"):
  body = form["body"].value
  incomplete = not (clientname and clientemail and clientcsr and subject)
if form.has_key("file1"):
  file1 = form["file1"]
  if file1.filename:
  files.append(file1.filename + "")
  filelist.append(file1)
  if form.has_key("file2"):
  file2 = form["file2"]
  if file2.filename:
  files.append(file2.filename + "")
  filelist.append(file2)
  if form.has_key("file3"):
  file3 = form["file3"]
  if file3.filename:
  files.append(file3.filename + "")
  filelist.append(file3)
  if form.has_key("file4"):
  file4 = form["file4"]
  if file4.filename:
  files.append(file4.filename + "")
  filelist.append(file4)
  if form.has_key("file5"):
  file5 = form["file5"]
  if file5.filename:
  files.append(file5.filename + "")
  filelist.append(file5)
 def csrform():
  cn_present = ''
  ce_present = ''
  sb_present = ''

  if incomplete:
  if not clientname:
  cn_present = REQUIRED
  if not clientemail:
  ce_present = REQUIRED
  if not subject:
  sb_present = REQUIRED
print """
 
  method="POST" enctype="multipart/form-data">

  
  Please fill in your name, e-mail address, 
subject, and select your CSR.  You will then be able to upload files
  on the next screen.  Any additional notes may go 
in the Message field.

  
  
"""
  print """
  Your namesize="40" maxlength="50" name="name" value="%s">%s""" \

  % (clientname, cn_present)
  print """
  Your e-mailsize="40" maxlength="100" name="email" value="%s">%s""" \

  % (clientemail, ce_present)
  print """
  Subject linesize="40" maxlength="100" name="subject" value="%s">%s""" \

  % (subject, sb_present)
  print """
  Your CSR
  Frank
  Kathy
  Chris
  Tracy
  Susi
  Bill
  """ \
  % (csrs["frank"]['html'], 
csrs["kathy"]['html'], csrs["chris"]['html'], csrs["tracy"]['html'], 
csrs["susi"]['html'], csrs["bill"]['html'])

  print """
  """ % (body)
 def uploadform():
  try:
  msg = MIMEMultipart()
  msg['To'] =

Re: cgitb performance issue

2008-05-11 Thread Gabriel Genellina

En Mon, 05 May 2008 15:56:26 -0300, Ethan Furman <[EMAIL PROTECTED]> escribió:

> I tried adding a form to our website for uploading large files.
> Personally, I dislike the forms that tell you you did something wrong
> and make you re-enter *all* your data again, so this one cycles and
> remembers your answers, and only prompts for the file once the rest of
> the entered data is good.  However, the first time the form loads it can
> take up to 30 seconds... any ideas why?

Hard to tell without looking at the code... And what has cgitb to do with this?

-- 
Gabriel Genellina

--
http://mail.python.org/mailman/listinfo/python-list

cgitb performance issue

2008-05-05 Thread Ethan Furman


Greetings!

I tried adding a form to our website for uploading large files.  
Personally, I dislike the forms that tell you you did something wrong 
and make you re-enter *all* your data again, so this one cycles and 
remembers your answers, and only prompts for the file once the rest of 
the entered data is good.  However, the first time the form loads it can 
take up to 30 seconds... any ideas why?


My server is using Apache 2.0.53 and Python 2.4.1 on a FreeBSD 5.4 system.

Thanks!
--
Ethan
--
http://mail.python.org/mailman/listinfo/python-list

Re: Random Drawing Simulation -- performance issue

2006-09-13 Thread David J. Braden

David J. Braden wrote:
> Travis E. Oliphant wrote:
>> Brendon Towle wrote:
>>> I need to simulate scenarios like the following: "You have a deck of  
>>> 3 orange cards, 5 yellow cards, and 2 blue cards. You draw a card,  
>>> replace it, and repeat N times."
>>>
>>
>> Thinking about the problem as drawing sample froms a discrete 
>> distribution defined by the population might help.
>>
>> For example, in SciPy you can define your own discrete random variable:
>>
>> var = scipy.stats.rv_discrete(name='sample', 
>> values=([0,1,2],[3/10.,5/10.,2/10.]))
>>
>> Then
>>
>> var.rvs(size=1)
>>
>> would quickly return an array of "draws"
>>
>> If you didn't want to install SciPy, but were willing to install 
>> NumPy, then the crux of the algorithm is to generate an entire array 
>> of uniform random variables:  numpy.random.rand(count) and then map 
>> them through the inverse cumulative distribution function to generate 
>> your samples.  The inverse cumulative distribution function is just a 
>> series of steps whose width depends on the probablity distribution.
>>
>> Thus, the population defines the draw boundaries.  To make it easy, 
>> just multiply the uniform random number by the total number of cards 
>> and then the boundaries are on the integers of the number of each kind 
>> of card.
>>
>> Here is an implementation.  I find this version to be 2x - 5x faster 
>> depending on how many draws are used.
>>
>>
>> import numpy as N
>>
>> def randomDrawing_N(count, population):
>> probs = [x[0] for x in population]
>> res = [[0, item[1]] for item in population]
>> nums = N.random.rand(count)
>> cprobs = N.cumsum([0]+probs)
>> nums *= cprobs[-1]
>> for k, val in enumerate(probs):
>> res[k][0] += ((nums > cprobs[k]) & (nums < cprobs[k+1])).sum()
>> return res
>>
>>
>> -Travis Oliphant
>>
> 
> In response to what Travis and Simon wrote -
> (1) Where the heck can I find a description of scipy's stat functions? 
> Documentation on these seems sparse.
> 
> (2) How does one set up a good timer for Python as implemented for 
> Windows? (i.e., how can I make API calls to Windows from Python?)
> 
> Please bear with me, or not; I am /just/ starting off with Python.
> 
> Thanks,
> Dave Braden

Oops, or rather, duh.
Rereading Robert's post clarified for me where multinomial is, and that 
to get help I need to specify more than simply help(multinomial). 
Sheesh, I'm dim today.

My question re timing code stands.

TIA,
DaveB
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Random Drawing Simulation -- performance issue

2006-09-13 Thread David J. Braden

Travis E. Oliphant wrote:
> Brendon Towle wrote:
>> I need to simulate scenarios like the following: "You have a deck of  
>> 3 orange cards, 5 yellow cards, and 2 blue cards. You draw a card,  
>> replace it, and repeat N times."
>>
> 
> Thinking about the problem as drawing sample froms a discrete 
> distribution defined by the population might help.
> 
> For example, in SciPy you can define your own discrete random variable:
> 
> var = scipy.stats.rv_discrete(name='sample', 
> values=([0,1,2],[3/10.,5/10.,2/10.]))
> 
> Then
> 
> var.rvs(size=1)
> 
> would quickly return an array of "draws"
> 
> If you didn't want to install SciPy, but were willing to install NumPy, 
> then the crux of the algorithm is to generate an entire array of uniform 
> random variables:  numpy.random.rand(count) and then map them through 
> the inverse cumulative distribution function to generate your samples. 
>  The inverse cumulative distribution function is just a series of steps 
> whose width depends on the probablity distribution.
> 
> Thus, the population defines the draw boundaries.  To make it easy, just 
> multiply the uniform random number by the total number of cards and then 
> the boundaries are on the integers of the number of each kind of card.
> 
> Here is an implementation.  I find this version to be 2x - 5x faster 
> depending on how many draws are used.
> 
> 
> import numpy as N
> 
> def randomDrawing_N(count, population):
> probs = [x[0] for x in population]
> res = [[0, item[1]] for item in population]
> nums = N.random.rand(count)
> cprobs = N.cumsum([0]+probs)
> nums *= cprobs[-1]
> for k, val in enumerate(probs):
> res[k][0] += ((nums > cprobs[k]) & (nums < cprobs[k+1])).sum()
> return res
> 
> 
> -Travis Oliphant
> 

In response to what Travis and Simon wrote -
(1) Where the heck can I find a description of scipy's stat functions? 
Documentation on these seems sparse.

(2) How does one set up a good timer for Python as implemented for 
Windows? (i.e., how can I make API calls to Windows from Python?)

Please bear with me, or not; I am /just/ starting off with Python.

Thanks,
Dave Braden
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Random Drawing Simulation -- performance issue

2006-09-13 Thread Brendon Towle

On 13 Sep 2006, at 1:01 AM, [EMAIL PROTECTED] wrote:

> Date: 12 Sep 2006 20:17:47 -0700
> From: Paul Rubin <http://[EMAIL PROTECTED]>
> Subject: Re: Random Drawing Simulation -- performance issue
> To: python-list@python.org
>
> "Travis E. Oliphant" <[EMAIL PROTECTED]> writes:
>>> I need to simulate scenarios like the following: "You have a deck of
>>> 3 orange cards, 5 yellow cards, and 2 blue cards. You draw a card,
>>> replace it, and repeat N times."
>>>
>> Thinking about the problem as drawing sample froms a discrete
>> distribution defined by the population might help.
>
> Is there some important reason you want to do this as a simulation?
> And is the real problem more complicated?  If you draw from the
> distribution 100,000 times with replacement and sum the results, per
> the Central Limit Theorem you'll get something very close to a normal
> distribution whose parameters you can determine analytically.  There
> is probably also some statistics formula to find the precise error.
> So you can replace the 100,000 draws with a single draw.

The real problem is not substantially more complicated. (The real  
code is, because it's embedded in a bunch of other stuff, but that's  
not the point.)

I guess the essential reason that I want to do it as a simulation,  
and not as a statistics formula, is that I'd like the code to be  
readable (and modifiable) by a programmer who doesn't have a  
statistics background. I could dredge up enough of my college stats  
to do as you suggest (although I might not enjoy it), but I don't  
think I want to make that a requirement.

On the other hand (quote somewhat snipped):

> Date: Tue, 12 Sep 2006 22:46:04 -0500
> From: Robert Kern <[EMAIL PROTECTED]>
> Subject: Re: Random Drawing Simulation -- performance issue
> To: python-list@python.org
>
> Along the lines of what you're trying to get at, the problem that  
> the OP is
> describing is one of sampling from a multinomial distribution.
>
> numpy has a function that will do the sampling for you:
>
> In [4]: numpy.random.multinomial?
> Docstring:
>  Multinomial distribution.
>
>  multinomial(n, pvals, size=None) -> random values
>
>  pvals is a sequence of probabilities that should sum to 1  
> (however, the
>  last element is always assumed to account for the remaining  
> probability
>  as long as sum(pvals[:-1]) <= 1).

Here, I'm torn. I do want the code to be accessible to non-stats  
people, but this just might do the trick. Must ponder.

Thanks, everyone, for your helpful suggestions!

B.

-- 
Brendon Towle, PhD
Cognitive Scientist
+1-412-690-2442x127
Carnegie Learning, Inc.
The Cognitive Tutor Company ®
Helping over 375,000 students in 1000 school districts succeed in math.


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Random Drawing Simulation -- performance issue

2006-09-13 Thread Brendon Towle


On 12 Sep 2006, at 6:33 PM, [EMAIL PROTECTED] wrote:

> Date: 12 Sep 2006 15:23:51 -0700
> From: "Simon Forman" <[EMAIL PROTECTED]>
> Subject: Re: Random Drawing Simulation -- performance issue
>
> Brendon Towle wrote:
>> I need to simulate scenarios like the following: "You have a deck of
>> 3 orange cards, 5 yellow cards, and 2 blue cards. You draw a card,
>> replace it, and repeat N times."
>>  [my original code snipped]
>
> I got nearly a 2x speed up with this variant:
>
> def randomDrawing3(count, population):
> res = [[0, item[1]] for item in population]
> mapping = []
> for i in xrange(len(population)):
> mapping.extend([i]*population[i][0])
>
> n = len(mapping)
> for i in xrange(count):
> index = int(n * random.random())
> res[mapping[index]][0] += 1
>
> return res

Excellent! For some reason, the speedup I get is only ~1.5x, but  
that's still non-trivial.

Thanks much for the pointer-

B.


-- 
Brendon Towle, PhD
Cognitive Scientist
+1-412-690-2442x127
Carnegie Learning, Inc.
The Cognitive Tutor Company ®
Helping over 375,000 students in 1000 school districts succeed in math.


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Random Drawing Simulation -- performance issue

2006-09-12 Thread Robert Kern

Paul Rubin wrote:
> "Travis E. Oliphant" <[EMAIL PROTECTED]> writes:
>>> I need to simulate scenarios like the following: "You have a deck of
>>> 3 orange cards, 5 yellow cards, and 2 blue cards. You draw a card,
>>> replace it, and repeat N times."
>>>
>> Thinking about the problem as drawing sample froms a discrete
>> distribution defined by the population might help.
> 
> Is there some important reason you want to do this as a simulation?
> And is the real problem more complicated?  If you draw from the
> distribution 100,000 times with replacement and sum the results, per
> the Central Limit Theorem you'll get something very close to a normal
> distribution whose parameters you can determine analytically.  There
> is probably also some statistics formula to find the precise error.
> So you can replace the 100,000 draws with a single draw.

Along the lines of what you're trying to get at, the problem that the OP is 
describing is one of sampling from a multinomial distribution.

   http://en.wikipedia.org/wiki/Multinomial_distribution

numpy has a function that will do the sampling for you:

In [4]: numpy.random.multinomial?
Type:   builtin_function_or_method
Base Class: 
String Form:
Namespace:  Interactive
Docstring:
 Multinomial distribution.

 multinomial(n, pvals, size=None) -> random values

 pvals is a sequence of probabilities that should sum to 1 (however, the
 last element is always assumed to account for the remaining probability
 as long as sum(pvals[:-1]) <= 1).

Sampling from the multinomial distribution is quite simply implemented given a 
binomial sampler. Unfortunately, the standard library's random module does not 
have one. If the number of samples is high enough, then one might be able to 
approximate the binomial distribution with a normal one, but you'd be better 
off 
just installing numpy.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
  that is made terrible by our own mad attempt to interpret it as though it had
  an underlying truth."
   -- Umberto Eco

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Random Drawing Simulation -- performance issue

2006-09-12 Thread Paul Rubin

"Travis E. Oliphant" <[EMAIL PROTECTED]> writes:
> > I need to simulate scenarios like the following: "You have a deck of
> > 3 orange cards, 5 yellow cards, and 2 blue cards. You draw a card,
> > replace it, and repeat N times."
> >
> Thinking about the problem as drawing sample froms a discrete
> distribution defined by the population might help.

Is there some important reason you want to do this as a simulation?
And is the real problem more complicated?  If you draw from the
distribution 100,000 times with replacement and sum the results, per
the Central Limit Theorem you'll get something very close to a normal
distribution whose parameters you can determine analytically.  There
is probably also some statistics formula to find the precise error.
So you can replace the 100,000 draws with a single draw.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Random Drawing Simulation -- performance issue

2006-09-12 Thread Travis E. Oliphant

Brendon Towle wrote:
> I need to simulate scenarios like the following: "You have a deck of  
> 3 orange cards, 5 yellow cards, and 2 blue cards. You draw a card,  
> replace it, and repeat N times."
> 

Thinking about the problem as drawing sample froms a discrete 
distribution defined by the population might help.

For example, in SciPy you can define your own discrete random variable:

var = scipy.stats.rv_discrete(name='sample', 
values=([0,1,2],[3/10.,5/10.,2/10.]))

Then

var.rvs(size=1)

would quickly return an array of "draws"

If you didn't want to install SciPy, but were willing to install NumPy, 
then the crux of the algorithm is to generate an entire array of uniform 
random variables:  numpy.random.rand(count) and then map them through 
the inverse cumulative distribution function to generate your samples. 
  The inverse cumulative distribution function is just a series of steps 
whose width depends on the probablity distribution.

Thus, the population defines the draw boundaries.  To make it easy, just 
multiply the uniform random number by the total number of cards and then 
the boundaries are on the integers of the number of each kind of card.

Here is an implementation.  I find this version to be 2x - 5x faster 
depending on how many draws are used.

import numpy as N

def randomDrawing_N(count, population):
 probs = [x[0] for x in population]
 res = [[0, item[1]] for item in population]
 nums = N.random.rand(count)
 cprobs = N.cumsum([0]+probs)
 nums *= cprobs[-1]
 for k, val in enumerate(probs):
 res[k][0] += ((nums > cprobs[k]) & (nums < cprobs[k+1])).sum()
 return res

-Travis Oliphant

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Random Drawing Simulation -- performance issue

2006-09-12 Thread Simon Forman

Brendon Towle wrote:
> I need to simulate scenarios like the following: "You have a deck of
> 3 orange cards, 5 yellow cards, and 2 blue cards. You draw a card,
> replace it, and repeat N times."
>
> So, I wrote the following code, which works, but it seems quite slow
> to me. Can anyone point out some obvious thing that I'm doing
> inefficiently? Or, is this more or less as good as it gets?
>
> For reference, my typical numbers look like this:
>
>2 <= len(population) <= 7
>4 <= len(mapping) <= 50
>10 <= count <= 100
>
> B.
>
> 
> #!/usr/bin/env python
>
> import random
>
> def randomDrawing(count, population):
>  """Simulates drawing  items from , with
> replacement.
>  population is a list of lists: [[count1, type1], [count2,
> type2], ...]
>
>  Typical examples:
>  >>>randomDrawing(100, [[3, 'orange'], [5, 'yellow'], [2, 'blue']])
>  [[28, 'orange'], [57, 'yellow'], [15, 'blue']]
>
>  >>>randomDrawing(10, [[3, 'orange'], [5, 'yellow'], [2,
> 'blue']])
>  [[29923, 'orange'], [50208, 'yellow'], [19869, 'blue']]
>
>  """
>  res = [[0, item[1]] for item in population]
>  mapping = []
>  for i in xrange(len(population)):
>  mapping.extend([i]*population[i][0])
>  for i in xrange(count):
>  index = random.choice(mapping)
>  res[index][0] += 1
>  return res
>
> 
>
> --
> Brendon Towle, PhD
> Cognitive Scientist
> +1-412-690-2442x127
> Carnegie Learning, Inc.
> The Cognitive Tutor Company ®
> Helping over 375,000 students in 1000 school districts succeed in math.

I got nearly a 2x speed up with this variant:

def randomDrawing3(count, population):
res = [[0, item[1]] for item in population]
mapping = []
for i in xrange(len(population)):
mapping.extend([i]*population[i][0])

n = len(mapping)
for i in xrange(count):
index = int(n * random.random())
res[mapping[index]][0] += 1

return res


Apparently "int(n * random.random())"  is faster than random.choice()
or random.randint()

[EMAIL PROTECTED]:~ $ python -mtimeit -s'import random; n=10' 'int(n *
random.random())'
10 loops, best of 3: 3.22 usec per loop

[EMAIL PROTECTED]:~ $ python -mtimeit -s'import random; n=10'
'random.randint(1, n)'
10 loops, best of 3: 13 usec per loop

[EMAIL PROTECTED]:~ $ python -mtimeit -s'import random; n=range(10)'
'random.choice(n)'
10 loops, best of 3: 6.07 usec per loop

(Note that the first and last examples produce values 0..9 while the
middle one produces 1..10)


I don't know for sure, but I think the random, uh, spread or whatever
will be the same for random() as for choice().  If it's important, you
should verify that.  ;-)

Peace,
~Simon

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Random Drawing Simulation -- performance issue

2006-09-12 Thread David J. Braden

Brendon Towle wrote:
> I need to simulate scenarios like the following: "You have a deck of 3 
> orange cards, 5 yellow cards, and 2 blue cards. You draw a card, replace 
> it, and repeat N times."
> 
> So, I wrote the following code, which works, but it seems quite slow to 
> me. Can anyone point out some obvious thing that I'm doing 
> inefficiently? Or, is this more or less as good as it gets?
> 
> For reference, my typical numbers look like this:
> 
>   2 <= len(population) <= 7
>   4 <= len(mapping) <= 50
>   10 <= count <= 100
> 
> B.
> 
> 
> #!/usr/bin/env python
> 
> import random
> 
> def randomDrawing(count, population):
> """Simulates drawing  items from , with replacement.
> population is a list of lists: [[count1, type1], [count2, type2], ...]
> 
> Typical examples:
> >>>randomDrawing(100, [[3, 'orange'], [5, 'yellow'], [2, 'blue']])
> [[28, 'orange'], [57, 'yellow'], [15, 'blue']]
> 
> >>>randomDrawing(10, [[3, 'orange'], [5, 'yellow'], [2, 'blue']])
> [[29923, 'orange'], [50208, 'yellow'], [19869, 'blue']]
> 
> """
> res = [[0, item[1]] for item in population]
> mapping = []
> for i in xrange(len(population)):
> mapping.extend([i]*population[i][0])
> for i in xrange(count):
> index = random.choice(mapping)
> res[index][0] += 1
> return res
> 
> 
> 
> --Brendon Towle, PhD
> Cognitive Scientist
> +1-412-690-2442x127
> Carnegie Learning, Inc.
> The Cognitive Tutor Company ®
> Helping over 375,000 students in 1000 school districts succeed in math.
> 
> 

Given the data structure, might it not be faster to generate binomial 
variates for n-1 types, and set nth count = #draws - sum(other 
outcomes)? And for a "large" count, could you get by with a normal 
approximation?  If you *do* feel the need for exact binomial, there are 
  short, somewhat sluggish routines in Devroye (1986 - on the web!, pg 
525), and Random_binomial1, compiled by Alan Miller(AMrandom.zip0, 
available, xlated, at  http://www.esbconsult.com/. Even though they are 
relatively slow, generating a few of these would seem to be a lot faster 
than your current approach.

Sorry I can't help more - I'm just starting to learn Python, have yet to 
even get IDLE going.

Regards,
Dave Braden
-- 
http://mail.python.org/mailman/listinfo/python-list

Random Drawing Simulation -- performance issue

2006-09-12 Thread Brendon Towle

I need to simulate scenarios like the following: "You have a deck of  
3 orange cards, 5 yellow cards, and 2 blue cards. You draw a card,  
replace it, and repeat N times."

So, I wrote the following code, which works, but it seems quite slow  
to me. Can anyone point out some obvious thing that I'm doing  
inefficiently? Or, is this more or less as good as it gets?

For reference, my typical numbers look like this:

   2 <= len(population) <= 7
   4 <= len(mapping) <= 50
   10 <= count <= 100

B.


#!/usr/bin/env python

import random

def randomDrawing(count, population):
 """Simulates drawing  items from , with  
replacement.
 population is a list of lists: [[count1, type1], [count2,  
type2], ...]

 Typical examples:
 >>>randomDrawing(100, [[3, 'orange'], [5, 'yellow'], [2, 'blue']])
 [[28, 'orange'], [57, 'yellow'], [15, 'blue']]

 >>>randomDrawing(10, [[3, 'orange'], [5, 'yellow'], [2,  
'blue']])
 [[29923, 'orange'], [50208, 'yellow'], [19869, 'blue']]

 """
 res = [[0, item[1]] for item in population]
 mapping = []
 for i in xrange(len(population)):
 mapping.extend([i]*population[i][0])
 for i in xrange(count):
 index = random.choice(mapping)
 res[index][0] += 1
 return res



-- 
Brendon Towle, PhD
Cognitive Scientist
+1-412-690-2442x127
Carnegie Learning, Inc.
The Cognitive Tutor Company ®
Helping over 375,000 students in 1000 school districts succeed in math.


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance issue

2005-04-02 Thread Steven Bethard

Marc 'BlackJack' Rintsch wrote:
def make_anagram_map(words):
anagram_map = dict()
for word in imap(lambda w: w.strip().lower(), words):
sorted_word = ''.join(sorted(list(word)))
anagram_map.setdefault(sorted_word, list()).append(word)

return dict(ifilter(lambda x: len(x[1]) > 1, anagram_map.iteritems()))
Or if you're afraid of map and filter like me, you can try:
def make_anagram_map(words):
anagram_map = {}
for word in (w.strip().lower() for w in words):
anagram_map.setdefault(''.join(sorted(word)), []).append(word)
return dict(sortedword_wordlist
for sortedword_wordlist in anagram_map.iteritems()
if len(sortedword_wordlist[1]) > 1)
py> make_anagram_map(['owers', 'pest', 'rowse', 'pets', 'sower', 'step'])
{'epst': ['pest', 'pets', 'step'], 'eorsw': ['owers', 'rowse', 'sower']}
STeVe
--
http://mail.python.org/mailman/listinfo/python-list

Re: Performance issue

2005-04-02 Thread Bengt Richter

On Sat, 02 Apr 2005 10:29:19 -0800, Shalabh Chaturvedi <[EMAIL PROTECTED]> 
wrote:

>Tom Carrick wrote:
>> Hi,
>> 
>> In my attempted learning of python, I've decided to recode an old
>> anagram solving program I made in C++. The C++ version runs in less
>> than a second, while the python takes 30 seconds. I'm not willing to
>> think it's just python being slow, so I was hoping someone could find
>> a faster way of doing this. Also, I was wondering if there was a more
>> builtin, or just nicer way of converting a string to a list (or using
>> the sort function on a list) than making a function for it.
>
>
>
>Others have already given a good commentary and alternate suggestions. 
>Here is some more (and some disagreements):
>
>* Know your data structures (and how long different operations take). 
>Like don't do del L[0] unless required. This generally comes from 
>experience (and asking on this list).
>
>* list(any_sequence_here) will build a list from the sequence. There are 
>usually easy ways of converting built-in types - the Python docs will 
>help you here.
>
>* Try to write code as close to an english description of the problem as 
>possible. For example 'for word in words:' rather than using counters 
>and []. This is usually faster, clearer and IMO an important ingredient 
>of being 'Pythonic'.
>
>Anyway here's my rendition of your program:
>
>###
>anagram = raw_input("Find anagrams of word: ")
>lanagram = list(anagram)
>lanagram.sort()
>sorted_anagram = ''.join(lanagram).lower()
>
>f = open('/usr/share/dict/words', 'r')
>
>found = []
>
>for word in f:
> word = word.strip('\n')
> if len(word)==len(sorted_anagram):
>   sorted_word = list(word)
>   sorted_word.sort()
>   sorted_word = ''.join(sorted_word)
> if sorted_word == sorted_anagram:
>   found.append(word)
>
>print "Anagrams of %s:" % anagram
>
>for word in found:
> print word
>###
>
>Hopefully it is fast enough.
>
If doing more than one anagram, I think making a set of anagram
dictionaries (one for each word length) and pickling them in
separate files for later re-use might help.

E.g., (untested!!) to make a dict (by wordlength) of anagram dicts, something 
like

d = {}
for word in open('/usr/share/dict/words'):
word = word.strip().lower()
d.setdefault(len(word), {}).setdefault(''.join(sorted(word)), 
[]).append(word)))

Then
for wlen, adic in d.items():
pickle_file_name = 'anagram_%s'% wlen
# pickle adic and write it out to the file
...

Then the anagram utility can look for the appropriate pickle file per word 
length,
(i.e., 'anagram_%s'%len(word.strip())) and just load it to anadict and
print anadict(''.join(sorted(word.strip().lower())).

That's a sketch. Untested!! Gotta go ;-/

Regards,
Bengt Richter
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance issue

2005-04-02 Thread Marc 'BlackJack' Rintsch

In <[EMAIL PROTECTED]>, Tom Carrick
wrote:

> In my attempted learning of python, I've decided to recode an old
> anagram solving program I made in C++. The C++ version runs in less
> than a second, while the python takes 30 seconds. I'm not willing to
> think it's just python being slow, so I was hoping someone could find
> a faster way of doing this. Also, I was wondering if there was a more
> builtin, or just nicer way of converting a string to a list (or using
> the sort function on a list) than making a function for it.
> 
> The words.txt here is just a copy of FreeBSD's /usr/share/dict/words

Here's my attempt which builds an anagram dictionary ("sorted word" ->
list of anagrams) for fast lookup of anagrams::

#!/usr/bin/env python2.4
from itertools import imap, ifilter

WORDS = '/usr/share/dict/words'

def make_anagram_map(words):
anagram_map = dict()
for word in imap(lambda w: w.strip().lower(), words):
sorted_word = ''.join(sorted(list(word)))
anagram_map.setdefault(sorted_word, list()).append(word)

return dict(ifilter(lambda x: len(x[1]) > 1, anagram_map.iteritems()))


def main():
words_file = open(WORDS, 'r')
anagram_map = make_anagram_map(words_file)
words_file.close()

while True:
word = raw_input('Find anagrams of word (just enter to end): ')
if not word:
break
try:
print anagram_map[''.join(sorted(list(word.strip().lower(]
except KeyError:
print 'No anagrams found for %r' % word

# # Print all anagrams sorted by number of anagrams.
# print '\n'.join(map(str, sorted(anagram_map.values(), key=len)))
# print len(anagram_map)


if __name__ == '__main__':
main()


Ciao,
Marc 'BlackJack' Rintsch
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance issue

2005-04-02 Thread Shalabh Chaturvedi

Tom Carrick wrote:
Hi,
In my attempted learning of python, I've decided to recode an old
anagram solving program I made in C++. The C++ version runs in less
than a second, while the python takes 30 seconds. I'm not willing to
think it's just python being slow, so I was hoping someone could find
a faster way of doing this. Also, I was wondering if there was a more
builtin, or just nicer way of converting a string to a list (or using
the sort function on a list) than making a function for it.

Others have already given a good commentary and alternate suggestions. 
Here is some more (and some disagreements):

* Know your data structures (and how long different operations take). 
Like don't do del L[0] unless required. This generally comes from 
experience (and asking on this list).

* list(any_sequence_here) will build a list from the sequence. There are 
usually easy ways of converting built-in types - the Python docs will 
help you here.

* Try to write code as close to an english description of the problem as 
possible. For example 'for word in words:' rather than using counters 
and []. This is usually faster, clearer and IMO an important ingredient 
of being 'Pythonic'.

Anyway here's my rendition of your program:
###
anagram = raw_input("Find anagrams of word: ")
lanagram = list(anagram)
lanagram.sort()
sorted_anagram = ''.join(lanagram).lower()
f = open('/usr/share/dict/words', 'r')
found = []
for word in f:
word = word.strip('\n')
if len(word)==len(sorted_anagram):
sorted_word = list(word)
sorted_word.sort()
sorted_word = ''.join(sorted_word)
if sorted_word == sorted_anagram:
found.append(word)
print "Anagrams of %s:" % anagram
for word in found:
print word
###
Hopefully it is fast enough.
Shalabh
--
http://mail.python.org/mailman/listinfo/python-list

Re: Performance issue

2005-04-02 Thread Shalabh Chaturvedi

Tom Carrick wrote:
Hi,
In my attempted learning of python, I've decided to recode an old
anagram solving program I made in C++. The C++ version runs in less
than a second, while the python takes 30 seconds. I'm not willing to
think it's just python being slow, so I was hoping someone could find
a faster way of doing this. Also, I was wondering if there was a more
builtin, or just nicer way of converting a string to a list (or using
the sort function on a list) than making a function for it.

Others have already given a good commentary and alternate suggestions.
Here is some more (and some disagreements):
* Know your data structures (and how long different operations take).
Like don't do del L[0] unless required. This generally comes from
experience (and asking on this list).
* list(any_sequence_here) will build a list from the sequence. There are
usually easy ways of converting built-in types - the Python docs will
help you here.
* Try to write code as close to an english description of the problem as
possible. For example 'for word in words:' rather than using counters
and []. This is usually faster, clearer and IMO an important ingredient
of being 'Pythonic'.
Anyway here's my rendition of your program:
###
anagram = raw_input("Find anagrams of word: ")
lanagram = list(anagram)
lanagram.sort()
sorted_anagram = ''.join(lanagram).lower()
f = open('/usr/share/dict/words', 'r')
found = []
for word in f:
word = word.strip('\n')
if len(word)==len(sorted_anagram):
sorted_word = list(word)
sorted_word.sort()
sorted_word = ''.join(sorted_word)
if sorted_word == sorted_anagram:
found.append(word)
print "Anagrams of %s:" % anagram
for word in found:
print word
###
Hopefully it is fast enough.
Shalabh
--
http://mail.python.org/mailman/listinfo/python-list

Re: Performance issue

2005-04-02 Thread Thomas Rast

Tom Carrick <[EMAIL PROTECTED]> writes:

> In my attempted learning of python, I've decided to recode an old
> anagram solving program I made in C++. The C++ version runs in less
> than a second, while the python takes 30 seconds.

Indeed, your program can be improved to run about ten times as fast,
which (on my system, with 96274 entries in /usr/share/dict/words) is
below a second.

In general you should try to move the loops into C code, i.e. use
built-in functions instead of long 'for' blocks.

Some comments on your version:

> import string
>
> # Need a function to convert a string to a list to be
> # able to use the sort() function
> def string2list(s): [snipped]

list() achieves the same thing a lot faster.

> words = []

You do not need to initialize 'words' here, as you're overwriting it a
few lines afterwards.

> found = []
>
> anagram = raw_input("Find anagrams of word: ")
>
> f = open('words.txt', 'r')
> file = f.read()
> f.close()
>
> words = file.splitlines()

Try to avoid assigning to the names of built-in functions if you can.
Names like 'file', 'list', 'dict', 'map' etc. are often an obvious
choice, but overwriting them means that you don't "just" know what a
later use refers to.

> sorted_anagram = anagram.lower()
> sorted_anagram = string2list(anagram)
> sorted_anagram.sort(lambda x, y: cmp(x, y))

Unless you *really* have to, don't use comparison functions with
sort(), as they slow the operation considerably.  In this (as in most)
cases, a plain sorted_anagram.sort() does the trick, and in version
2.4 you can achieve custom sort orders with the optional 'key'
argument.  The sorted() built-in also comes in handy here.

> while words:
> if len(words[0]) == len(sorted_anagram):
> wordlist = string2list(words[0])
> wordlist.sort(lambda x, y: cmp(x, y))
> sorted_wordlist = wordlist
> if sorted_anagram == sorted_wordlist:
> found.append(words[0])
> del words[0]

Avoid this style of looping at all times!  Removing the first element
of a list is O(n), so looping through the whole list as above is
O(n**2).  In most cases you should use a for loop:

for word in words:
  # do something

which is O(n) of course.  If you do have to loop destructively, pop()
from the end (which is the default) like so:

while words:
  word = words.pop()
  # do something

This is also O(n), because removing the *last* element of a list is
O(1) (amortized; I suppose the implementation will occasionally shrink
the underlying array at linear cost).

> print "Anagrams of " + anagram + ": "
> while found:
> print found[0] + " "
> del found[0]

I assume you meant not to print a newline between the words, which
'print' does by default.  The best solution in that case is "
".join(found).

A better version (2.4+ only):

-- 8< -- 8< --
anagram = raw_input("Find anagrams of word: ")

words = open('words.txt', 'r')

sorted_anagram = sorted(anagram.lower())

found = []

for word in words.read().splitlines():
if len(word) == len(anagram) and sorted(word) == sorted_anagram:
found.append(word)

print "Anagrams of %s: %s" % (anagram, ' '.join(found))
-- >8 -- >8 --

Interestingly, the length comparison makes quite a difference!  I
removed it at first, thinking it was unnecessary.  Here are some
timings:

* Your original version (for comparison):

  $ time echo stop | python2.4 anagram_slow.py
  [...]
  real0m9.090s
  user0m8.790s
  sys 0m0.013s

* Your version, but with the O(n**2) loop replaced by an O(n) 'for':

  $ time echo stop | python2.4 anagram_forloop.py
  [...]
  real0m0.221s
  user0m0.134s
  sys 0m0.014s

* My version but with the length comparison removed:

  $ time echo stop | python2.4 anagram_no_lencmp.py
  [...]
  real0m0.408s
  user0m0.353s
  sys 0m0.010s

* My version as above:

  $ time echo stop | python2.4 anagram_fast.py
  [...]
  real0m0.144s
  user0m0.099s
  sys 0m0.008s

Hope that helps :-)

- Thomas

-- 
If you want to reply by mail, substitute my first and last name for
'foo' and 'bar', respectively, and remove '.invalid'.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance issue

2005-04-02 Thread vincent wehren

"Tom Carrick" <[EMAIL PROTECTED]> schrieb im Newsbeitrag 
news:[EMAIL PROTECTED]
| Hi,
|
| In my attempted learning of python, I've decided to recode an old
| anagram solving program I made in C++. The C++ version runs in less
| than a second, while the python takes 30 seconds. I'm not willing to
| think it's just python being slow, so I was hoping someone could find
| a faster way of doing this. Also, I was wondering if there was a more
| builtin, or just nicer way of converting a string to a list (or using
| the sort function on a list) than making a function for it.
|
| The words.txt here is just a copy of FreeBSD's /usr/share/dict/words
|
| Anyway, the code:
|
| import string

You're importing string, but never use it, so you can omit that line.

|
| # Need a function to convert a string to a list to be
| # able to use the sort() function
| def string2list(s):
|l = []
|for i in range(0, len(s)):
|l.append(s[i])
|return l

No need to write your own function. list(s) already does the trick.

|
| words = []
| found = []
|
| anagram = raw_input("Find anagrams of word: ")
|
| f = open('words.txt', 'r')
| file = f.read()
| f.close()


I don't have a copy of words.txt, but here's what I would try
(untested):

anagram = raw_input("Find anagrams of word: ")
sorted_anagram = list(sorted(anagram.lower()))
# If you're Python is pre 2.4 ise
# sorted_anagram = list(anagram.lower())
# sorted_anagram.sort() #--sort list in place


found = []
# assuming "words.txt" contains a word per line
# iterate over the lines of the file

for line in open("/path/to/words.txt"):
word = line[:-1]# Get rid of trailing newline
sorted_word = list(sorted(word.lower()))
if sorted_word == sorted_anagram:
found.append(word)
if found:
print "Anagrams of %s:" % anagram
for w in found:
print w
else:
print "No anagrams for %s" % anagram


--

Vincent Wehren




-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance issue

2005-04-02 Thread Marc 'BlackJack' Rintsch

In <[EMAIL PROTECTED]>, Irmen de Jong wrote:

>> words = file.splitlines()
> 
> You can obtain this list without reading the file in its entirety,
> by using the readlines method of file objects:
> 
> words=open("words.txt").readlines()

This leaves the newline characters at the end of each line while
`str.splitlines()` removes them.

Ciao,
Marc 'BlackJack' Rintsch
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance issue

2005-04-02 Thread Marc 'BlackJack' Rintsch

In <[EMAIL PROTECTED]>, Tom Carrick
wrote:

> [â] Also, I was wondering if there was a more
> builtin, or just nicer way of converting a string to a list (or using
> the sort function on a list) than making a function for it.

Use the `list()` builtin on the string and *just* the `sort()` method::

  In [2]: characters = list('hello')

  In [3]: characters
  Out[3]: ['h', 'e', 'l', 'l', 'o']

  In [4]: characters.sort()

  In [5]: characters
  Out[5]: ['e', 'h', 'l', 'l', 'o']

> sorted_anagram = anagram.lower()
> sorted_anagram = string2list(anagram)
> sorted_anagram.sort(lambda x, y: cmp(x, y))

sorted_anagram = list(anagram.lower())
sorted_anagram.sort()

> while words:
> if len(words[0]) == len(sorted_anagram):
> wordlist = string2list(words[0])
> wordlist.sort(lambda x, y: cmp(x, y))
>     sorted_wordlist = wordlist
> if sorted_anagram == sorted_wordlist:
> found.append(words[0])
> del words[0]

And here's the performance issue.  Deleting the first element of a list
results in moving all remaining elements one index down.  Better iterate
over the words in a for loop::

  for word in words:
# use `word` instead of `word[0]` in the loop body.
...

Ciao,
Marc 'BlackJack' Rintsch

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Performance issue

2005-04-02 Thread Irmen de Jong

Tom Carrick wrote:
> Hi,
> 
> In my attempted learning of python, I've decided to recode an old
> anagram solving program I made in C++. The C++ version runs in less
> than a second, while the python takes 30 seconds. I'm not willing to
> think it's just python being slow, so I was hoping someone could find
> a faster way of doing this.

I like your attitude, not thinking that it's just Python that is slow :-)

> Also, I was wondering if there was a more
> builtin, or just nicer way of converting a string to a list (or using
> the sort function on a list) than making a function for it.

String to list:   list("irmen")  # --> ['i','r','m','e','n']
Sorted list:  sorted("irmen")   # --> ['e', 'i', 'm', 'n', 'r']
(the latter works in Python 2.4+)

> 
> The words.txt here is just a copy of FreeBSD's /usr/share/dict/words
> 
> Anyway, the code:
> 
> import string
> 
> # Need a function to convert a string to a list to be
> # able to use the sort() function
> def string2list(s):
> l = []
> for i in range(0, len(s)):
> l.append(s[i])
> return l

... see above... just replace string2list(s) with sorted(s)

> 
> words = []
> found = []
> 
> anagram = raw_input("Find anagrams of word: ")
> 
> f = open('words.txt', 'r')
> file = f.read()
> f.close()

Style: don't use 'file' as a variable name, you're hiding the
builtin 'file' function

> 
> words = file.splitlines()

You can obtain this list without reading the file in its entirety,
by using the readlines method of file objects:

words=open("words.txt").readlines()

> 
> sorted_anagram = anagram.lower()
> sorted_anagram = string2list(anagram)
> sorted_anagram.sort(lambda x, y: cmp(x, y))

The lambda is optional and only slows it down :-)
But to get a sorted list of letters, just use sorted(s)
if you're on Python 2.4+

> while words:
> if len(words[0]) == len(sorted_anagram):
> wordlist = string2list(words[0])
> wordlist.sort(lambda x, y: cmp(x, y))
> sorted_wordlist = wordlist

(same here.. replacing this by sorted(words[0]) probably
will speed it up rather significantly, partly because
it avoids the creation of those temporary lists)

> if sorted_anagram == sorted_wordlist:
> found.append(words[0])
> del words[0]
> 
> print "Anagrams of " + anagram + ": "
> while found:
> print found[0] + " "
> del found[0]

print " ".join(found)


Cheers

--Irmen de Jong
-- 
http://mail.python.org/mailman/listinfo/python-list

Performance issue

2005-04-02 Thread Tom Carrick

Hi,

In my attempted learning of python, I've decided to recode an old
anagram solving program I made in C++. The C++ version runs in less
than a second, while the python takes 30 seconds. I'm not willing to
think it's just python being slow, so I was hoping someone could find
a faster way of doing this. Also, I was wondering if there was a more
builtin, or just nicer way of converting a string to a list (or using
the sort function on a list) than making a function for it.

The words.txt here is just a copy of FreeBSD's /usr/share/dict/words

Anyway, the code:

import string

# Need a function to convert a string to a list to be
# able to use the sort() function
def string2list(s):
l = []
for i in range(0, len(s)):
l.append(s[i])
return l

words = []
found = []

anagram = raw_input("Find anagrams of word: ")

f = open('words.txt', 'r')
file = f.read()
f.close()

words = file.splitlines()

sorted_anagram = anagram.lower()
sorted_anagram = string2list(anagram)
sorted_anagram.sort(lambda x, y: cmp(x, y))


while words:
if len(words[0]) == len(sorted_anagram):
wordlist = string2list(words[0])
wordlist.sort(lambda x, y: cmp(x, y))
sorted_wordlist = wordlist
if sorted_anagram == sorted_wordlist:
found.append(words[0])
del words[0]

print "Anagrams of " + anagram + ": "
while found:
print found[0] + " "
del found[0]
-- 
http://mail.python.org/mailman/listinfo/python-list

56 matches

Mail list logo