Re: Strange Execution Times

2005-05-27 Thread John Machin
Elliot Temple wrote:
> 
> On May 26, 2005, at 3:22 PM, John Machin wrote:
> 
>>
>> Then post your summarised results back to the newsgroup for the  
>> benefit of all -- there's this vague hope that folk actually read  
>> other peoples' posts before firing off questions :-)
> 
> 
> Here is my new version.  It runs in about .65 seconds.  The trick?   
> Reading lines one at a time.  Please let me know if there's any bad  
> coding practices in it!
> 
> 

> for line in f:
> start, end = line.find(p1) + adjust, line.find(p2)
> if end != -1:
> digest = md5.new(line[start:end]).hexdigest()
> out.write(line[:start] + digest + line[end:])
> else:
> out.write(line)
> 

Hmmm ... simple, elegant *and* runs fast!

Only two minor points:

1. Your code assumes that there can be no more than one password per 
line and that there are no "syntax errors". I'd add at least a comment 
to that effect.

2. The scan for p2 is wasted if the scan for p1 finds nothing. If p1 is 
found, you scan for p2 from the beginning of the line. Depending on the 
average length of a line, etc, that could make a difference. Try this:

for line in f:
 start = line.find(p1)
 if start == -1:
 out.write(line)
 else:
 start += adjust
 end = line.find(p2, start)
 if end == -1:
 raise CannotHappenError
 digest = md5.new(line[start:end]).hexdigest()
 out.write(line[:start] + digest + line[end:])

Cheers,
John
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Strange Execution Times

2005-05-26 Thread Elliot Temple

On May 26, 2005, at 3:22 PM, John Machin wrote:
>
> Then post your summarised results back to the newsgroup for the  
> benefit of all -- there's this vague hope that folk actually read  
> other peoples' posts before firing off questions :-)

Here is my new version.  It runs in about .65 seconds.  The trick?   
Reading lines one at a time.  Please let me know if there's any bad  
coding practices in it!


def main():

 import md5
 import time

 f = open("data.xml", "rU")
 out = open("out.xml", "w")
 p1 = ""
 p2 = ""
 adjust = len(p1)

 t1 = time.clock()
 for line in f:
 start, end = line.find(p1) + adjust, line.find(p2)
 if end != -1:
 digest = md5.new(line[start:end]).hexdigest()
 out.write(line[:start] + digest + line[end:])
 else:
 out.write(line)

 t2 = time.clock()
 print round(t2-t1, 5)

 f.close()
 out.close()

if __name__ == '__main__': main()


-- Elliot Temple
http://www.curi.us/


---
[This E-mail scanned for viruses by Declude Virus]

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Strange Execution Times

2005-05-26 Thread Peter Hansen
John Machin wrote:
> Then post your summarised results back to the newsgroup for the benefit 
> of all -- there's this vague hope that folk actually read other peoples' 
> posts before firing off questions :-)

+1 QOTW

:-)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Strange Execution Times

2005-05-26 Thread John Machin
Elliot Temple wrote:
[copying Elliot's e-mail reply back to the list because it's educational 
  and scarcely private]

> 
> 
> 
> On 5/26/05, John Machin <[EMAIL PROTECTED]> wrote:
>  > [EMAIL PROTECTED] wrote:
>  >
>  > > I am running two functions in a row that do the same thing.
>  >
>  > 1. I see no functions here.
>  >
>  > You should set out a script like this:
>  >
>  > def main():
>  >  your_code_goes_here()
>  >
>  > if __name__ == '__main__':
>  >  main()
>  >
>  > for two reasons (a) your code will be referring to locals instead of
>  > globals; this is faster, which might appeal to you (b) if somebody
>  > accidentally imports the script, nothing happens.
> 
> Oops, I meant code blocks not functions.  Good advice, thanks.
> 
>  > 2. The two loops to which you refer do *not* do the same thing;  see 
> later.
> 
> 
> 
>  > General questions: what platform? what version of Python? how  large is
>  > the file? how much free memory do you have? how many passwords are
>  > there? what is the average length of a password?
> 
> OS X 10.4.1Python 2.3.5  (I wonder why they bundled an old  
> version..)  The file is 4 megs, about 8000 passwords.  I have 375  megs 
> of RAM free.  the passwords are mostly about 5-6 chars long.

Huh-uh -- evidently (from what you said later) a *GUESS* on the password 
size; measurement on the actual file that you were using would have 
given the answer "Oops, mean = 32, standard dev = 0".

> 
> 
>  > Ignoring the superficial-but-meaningless differences (i vs j, md5
>  > [aarrgghh!!] vs m), jo vs join), these two loops differ in the  
> following
>  > respects:
> 
> Sorry, I wrote a nicer version of the program with things named well,  
> but it was only getting the fast time, so I copied it into the old  
> version of the program and then I had to write join=jo etc to avoid  
> changing it.

Avoid changing what? And did you get the message that doing (in effect)

import md5
m = md5.new
md5 = m

is a horrifyingly dangerous disgusting and ugly stunt?

> 
>  >
>  > (1) 'data' is a copy of 'a'
>  > (2) the first loop's body is effectively: digest = RHS; LHS = digest
>  > whereas the 2nd loop's body is: LHS = RHS
>  > (3) the first loop uses starts[j]+1 whereas the second loop uses  
> starts[j]
> 
> oops, 3 is because the nicer version created a slightly different  index 
> list.  Hey, turns out that matters (see end)
> 
>  > Item (1) may affect the timing if file is large compared with  available
>  > memory -- could be 'a' has to be swapped out, and 'data' swapped in.
>  >
>  > Item (2) should make the 2nd loop very slightly faster, so we'll  ignore
>  > that :-)
> 
> yeah
> 
>  > Item (3) means you are not comparing like with like. It means that  the
>  > 1st loop has less work to do. So this could make an observable
>  > difference for very short passwords -- but still nothing like 0.14
>  > compared with 56.
>  >
>  > So, some more questions:
>  >
>  > The 56.56 is suspiciously precise -- you ran it a few times and it
>  > printed exactly 56.56 each time?
> 
> No, it got 55 or 56 something.
> 
>  >
>  > Did you try putting the 2nd loop first [refer to Item (1) above]?
> 
> Yes, that didn't change which was fast.
> 
>  > Did you try putting in a switch so that your script runs either  1st 
> loop
>  > or 2nd loop but not both?
> 
> No, good idea.   OK tried it, and it didn't change how fast  
> each loop ran.  I also changed it so they both work on the same list  in 
> the version with a switch, and that didn't matter.
> 
>  > Note that each loop is making its target list
>  > expand in situ; this may after a while (like inside loop 2) cause the
>  > memory arena to become so fragmented that swapping will occur.  This of
>  > course can vary wildly depending on the platform; Win95 used to be  the
>  > most usual suspect but you're obviously not running on that.
> 
> Nod
> 
>  > Some observations:
>  >
>  > (1) 's' is already a string, so ''.join(s[x:y]) is a slow way of  doing
>  > s[x:y]
> 
> Oops!  That happened because it used to be ''.join(the_list[x:y]) but  
> then i realised i could just grab sections of the original string but  
> didn't fully change it.
> 
>  > (2) 'a' ends up as a list of one-byte strings, via a very circuitous
>  > process: a = array.array('c', s).tolist()
>  >
>  > A shorter route would be: a = list(s)
> 
> Oh cool.  I looked for a string-to-list function a little, but didn't  
> find that.  I thought I tried that exact one too, but I guess not.

Be aware of list comprehensions; when list(s) escaped your scan of the 
manuals, you could have done this: a = [x for x in s]

NOTE: a string is an iterable! (see later)

> 
>  > However what's wrong with what you presumably tried out first i.e.  a =
>  > array.array('c', s) ?? It doesn't need the final ''.join() before
>  > writing to disk, and it takes up less memory.
> 
> The problem was I couldn't put the new passwords in as a single  
> element.

Indeed. It's annoying enough th

Re: Strange Execution Times

2005-05-26 Thread Elliot Temple
hey FYI i found the problem:  i accidentally copied an output file for
my test data.  so all the passwords were exactly 32 chars long.  so
when replacing them with new 32 char passwords, it went much much
faster, I guess because the list kept the same number of chars in it
and didn't have to copy lots of data around.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Strange Execution Times

2005-05-26 Thread John Machin
[EMAIL PROTECTED] wrote:
> I am running two functions in a row that do the same thing.

1. I see no functions here.

You should set out a script like this:

def main():
 your_code_goes_here()

if __name__ == '__main__':
 main()

for two reasons (a) your code will be referring to locals instead of 
globals; this is faster, which might appeal to you (b) if somebody 
accidentally imports the script, nothing happens.

2. The two loops to which you refer do *not* do the same thing; see later.

> One runs
> in .14 seconds, the other 56.  I'm confused.  I wrote another version
> of the program and couldn't get the slow behavior again, only the fast.
>  I'm not sure what is causing it.  Can anyone figure it out?
> 
> Here is my code (sorry it's a bit of a mess, but my cleaned up version
> isn't slow!).  Just skim to the bottom where the timing is.  The first
> time printed out is .14, the seond is 56.56.
> 
> 

[snip]


[following has extraneous blank lines and comments removed]
> t1 = time.clock()
> for j in r:
> digest = m(jo(s[starts[j]+1:ends[j]])).hexdigest()
> a[starts[j]+1:ends[j]] = digest
> t2 = time.clock()
> print "time is", round(t2-t1, 5)
> 
> t1 = time.clock()
> for i in r:
> data[starts[i]:ends[i]] = \
> md5(join(s[starts[i]:ends[i]])).hexdigest()
> t2 = time.clock()
> print "second time is", round(t2-t1, 5)

General questions: what platform? what version of Python? how large is 
the file? how much free memory do you have? how many passwords are 
there? what is the average length of a password?

Ignoring the superficial-but-meaningless differences (i vs j, md5 
[aarrgghh!!] vs m), jo vs join), these two loops differ in the following 
respects:

(1) 'data' is a copy of 'a'
(2) the first loop's body is effectively: digest = RHS; LHS = digest 
whereas the 2nd loop's body is: LHS = RHS
(3) the first loop uses starts[j]+1 whereas the second loop uses starts[j]

Item (1) may affect the timing if file is large compared with available 
memory -- could be 'a' has to be swapped out, and 'data' swapped in.

Item (2) should make the 2nd loop very slightly faster, so we'll ignore 
that :-)

Item (3) means you are not comparing like with like. It means that the 
1st loop has less work to do. So this could make an observable 
difference for very short passwords -- but still nothing like 0.14 
compared with 56.

So, some more questions:

The 56.56 is suspiciously precise -- you ran it a few times and it 
printed exactly 56.56 each time?

Did you try putting the 2nd loop first [refer to Item (1) above]?
Did you try putting in a switch so that your script runs either 1st loop 
or 2nd loop but not both? Note that each loop is making its target list 
expand in situ; this may after a while (like inside loop 2) cause the 
memory arena to become so fragmented that swapping will occur. This of 
course can vary wildly depending on the platform; Win95 used to be the 
most usual suspect but you're obviously not running on that.

Some observations:

(1) 's' is already a string, so ''.join(s[x:y]) is a slow way of doing 
s[x:y]

(2) 'a' ends up as a list of one-byte strings, via a very circuitous 
process: a = array.array('c', s).tolist()

A shorter route would be: a = list(s)

However what's wrong with what you presumably tried out first i.e. a = 
array.array('c', s) ?? It doesn't need the final ''.join() before 
writing to disk, and it takes up less memory. NOTE: the array variety 
takes up 1 byte per character. The list variety takes up at least 4 
bytes per character (on a machine where sizeof(PyObject *) == 4); to the 
extent that the file contains characters that are not interned (i.e. not 
   [A-Za-z_] AFAIK), much more memory is required as a separate object 
will be created for each such character. Was it consistently slower?

(3) If memory is your problem, you could rewrite the whole thing to 
simply do one write per password; that way you only need 1.x copy of the 
   file contents in memory, not 2.x.

Hoping some of this helps,
John
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Strange Execution Times

2005-05-26 Thread Fredrik Lundh
<[EMAIL PROTECTED]> wrote:

>I am running two functions in a row that do the same thing.  One runs
> in .14 seconds, the other 56.  I'm confused.  I wrote another version
> of the program and couldn't get the slow behavior again, only the fast.
> I'm not sure what is causing it.  Can anyone figure it out?

it would be a lot easier to help if you posted a self-contained example.

 



-- 
http://mail.python.org/mailman/listinfo/python-list


Strange Execution Times

2005-05-25 Thread curi42
I am running two functions in a row that do the same thing.  One runs
in .14 seconds, the other 56.  I'm confused.  I wrote another version
of the program and couldn't get the slow behavior again, only the fast.
 I'm not sure what is causing it.  Can anyone figure it out?

Here is my code (sorry it's a bit of a mess, but my cleaned up version
isn't slow!).  Just skim to the bottom where the timing is.  The first
time printed out is .14, the seond is 56.56.


f = open("/Users/curi/data.xml")


o = open("/Users/curi/out2.xml", "w")


import md5
import array


p1 = ""
p2 = ""

cnt = 0

m = md5.new
jo = "".join


adjust = len(p1) - 1

i = 1
s = f.read()
a = array.array('c', s).tolist()
spot = 0
k = 0
find = s.find

starts = []
ends = []

while k != -1:

#print len(s)
k = find(p2, spot)
if k != -1:
starts.append(find(p1, spot) + adjust)
ends.append(k)
spot = k + 1

#s = "".join([s[:j+1], md5.new(s[j+1:k-1]).hexdigest(), s[k:]])

#if k != -1: a[j+1:k-1] = m(jo(a[j+1:k-1])).hexdigest()



r = range(len(starts))
#r = range(20)
r.reverse()
import time


data = a[:]

md5 = m
join = jo





t1 = time.clock()
for j in r:
#print jo(s[starts[j]+1:ends[j]])
digest = m(jo(s[starts[j]+1:ends[j]])).hexdigest()

a[starts[j]+1:ends[j]] = digest
#cnt += 1
#if cnt % 100 == 0: print cnt


t2 = time.clock()
print "time is", round(t2-t1, 5)



t1 = time.clock()
for i in r:
data[starts[i]:ends[i]] =
md5(join(s[starts[i]:ends[i]])).hexdigest()
t2 = time.clock()
print "second time is", round(t2-t1, 5)


o.write(jo(a))

-- 
http://mail.python.org/mailman/listinfo/python-list