Re: Strange Execution Times
Elliot Temple wrote: > > On May 26, 2005, at 3:22 PM, John Machin wrote: > >> >> Then post your summarised results back to the newsgroup for the >> benefit of all -- there's this vague hope that folk actually read >> other peoples' posts before firing off questions :-) > > > Here is my new version. It runs in about .65 seconds. The trick? > Reading lines one at a time. Please let me know if there's any bad > coding practices in it! > > > for line in f: > start, end = line.find(p1) + adjust, line.find(p2) > if end != -1: > digest = md5.new(line[start:end]).hexdigest() > out.write(line[:start] + digest + line[end:]) > else: > out.write(line) > Hmmm ... simple, elegant *and* runs fast! Only two minor points: 1. Your code assumes that there can be no more than one password per line and that there are no "syntax errors". I'd add at least a comment to that effect. 2. The scan for p2 is wasted if the scan for p1 finds nothing. If p1 is found, you scan for p2 from the beginning of the line. Depending on the average length of a line, etc, that could make a difference. Try this: for line in f: start = line.find(p1) if start == -1: out.write(line) else: start += adjust end = line.find(p2, start) if end == -1: raise CannotHappenError digest = md5.new(line[start:end]).hexdigest() out.write(line[:start] + digest + line[end:]) Cheers, John -- http://mail.python.org/mailman/listinfo/python-list
Re: Strange Execution Times
On May 26, 2005, at 3:22 PM, John Machin wrote: > > Then post your summarised results back to the newsgroup for the > benefit of all -- there's this vague hope that folk actually read > other peoples' posts before firing off questions :-) Here is my new version. It runs in about .65 seconds. The trick? Reading lines one at a time. Please let me know if there's any bad coding practices in it! def main(): import md5 import time f = open("data.xml", "rU") out = open("out.xml", "w") p1 = "" p2 = "" adjust = len(p1) t1 = time.clock() for line in f: start, end = line.find(p1) + adjust, line.find(p2) if end != -1: digest = md5.new(line[start:end]).hexdigest() out.write(line[:start] + digest + line[end:]) else: out.write(line) t2 = time.clock() print round(t2-t1, 5) f.close() out.close() if __name__ == '__main__': main() -- Elliot Temple http://www.curi.us/ --- [This E-mail scanned for viruses by Declude Virus] -- http://mail.python.org/mailman/listinfo/python-list
Re: Strange Execution Times
John Machin wrote: > Then post your summarised results back to the newsgroup for the benefit > of all -- there's this vague hope that folk actually read other peoples' > posts before firing off questions :-) +1 QOTW :-) -- http://mail.python.org/mailman/listinfo/python-list
Re: Strange Execution Times
Elliot Temple wrote: [copying Elliot's e-mail reply back to the list because it's educational and scarcely private] > > > > On 5/26/05, John Machin <[EMAIL PROTECTED]> wrote: > > [EMAIL PROTECTED] wrote: > > > > > I am running two functions in a row that do the same thing. > > > > 1. I see no functions here. > > > > You should set out a script like this: > > > > def main(): > > your_code_goes_here() > > > > if __name__ == '__main__': > > main() > > > > for two reasons (a) your code will be referring to locals instead of > > globals; this is faster, which might appeal to you (b) if somebody > > accidentally imports the script, nothing happens. > > Oops, I meant code blocks not functions. Good advice, thanks. > > > 2. The two loops to which you refer do *not* do the same thing; see > later. > > > > > General questions: what platform? what version of Python? how large is > > the file? how much free memory do you have? how many passwords are > > there? what is the average length of a password? > > OS X 10.4.1Python 2.3.5 (I wonder why they bundled an old > version..) The file is 4 megs, about 8000 passwords. I have 375 megs > of RAM free. the passwords are mostly about 5-6 chars long. Huh-uh -- evidently (from what you said later) a *GUESS* on the password size; measurement on the actual file that you were using would have given the answer "Oops, mean = 32, standard dev = 0". > > > > Ignoring the superficial-but-meaningless differences (i vs j, md5 > > [aarrgghh!!] vs m), jo vs join), these two loops differ in the > following > > respects: > > Sorry, I wrote a nicer version of the program with things named well, > but it was only getting the fast time, so I copied it into the old > version of the program and then I had to write join=jo etc to avoid > changing it. Avoid changing what? And did you get the message that doing (in effect) import md5 m = md5.new md5 = m is a horrifyingly dangerous disgusting and ugly stunt? > > > > > (1) 'data' is a copy of 'a' > > (2) the first loop's body is effectively: digest = RHS; LHS = digest > > whereas the 2nd loop's body is: LHS = RHS > > (3) the first loop uses starts[j]+1 whereas the second loop uses > starts[j] > > oops, 3 is because the nicer version created a slightly different index > list. Hey, turns out that matters (see end) > > > Item (1) may affect the timing if file is large compared with available > > memory -- could be 'a' has to be swapped out, and 'data' swapped in. > > > > Item (2) should make the 2nd loop very slightly faster, so we'll ignore > > that :-) > > yeah > > > Item (3) means you are not comparing like with like. It means that the > > 1st loop has less work to do. So this could make an observable > > difference for very short passwords -- but still nothing like 0.14 > > compared with 56. > > > > So, some more questions: > > > > The 56.56 is suspiciously precise -- you ran it a few times and it > > printed exactly 56.56 each time? > > No, it got 55 or 56 something. > > > > > Did you try putting the 2nd loop first [refer to Item (1) above]? > > Yes, that didn't change which was fast. > > > Did you try putting in a switch so that your script runs either 1st > loop > > or 2nd loop but not both? > > No, good idea. OK tried it, and it didn't change how fast > each loop ran. I also changed it so they both work on the same list in > the version with a switch, and that didn't matter. > > > Note that each loop is making its target list > > expand in situ; this may after a while (like inside loop 2) cause the > > memory arena to become so fragmented that swapping will occur. This of > > course can vary wildly depending on the platform; Win95 used to be the > > most usual suspect but you're obviously not running on that. > > Nod > > > Some observations: > > > > (1) 's' is already a string, so ''.join(s[x:y]) is a slow way of doing > > s[x:y] > > Oops! That happened because it used to be ''.join(the_list[x:y]) but > then i realised i could just grab sections of the original string but > didn't fully change it. > > > (2) 'a' ends up as a list of one-byte strings, via a very circuitous > > process: a = array.array('c', s).tolist() > > > > A shorter route would be: a = list(s) > > Oh cool. I looked for a string-to-list function a little, but didn't > find that. I thought I tried that exact one too, but I guess not. Be aware of list comprehensions; when list(s) escaped your scan of the manuals, you could have done this: a = [x for x in s] NOTE: a string is an iterable! (see later) > > > However what's wrong with what you presumably tried out first i.e. a = > > array.array('c', s) ?? It doesn't need the final ''.join() before > > writing to disk, and it takes up less memory. > > The problem was I couldn't put the new passwords in as a single > element. Indeed. It's annoying enough th
Re: Strange Execution Times
hey FYI i found the problem: i accidentally copied an output file for my test data. so all the passwords were exactly 32 chars long. so when replacing them with new 32 char passwords, it went much much faster, I guess because the list kept the same number of chars in it and didn't have to copy lots of data around. -- http://mail.python.org/mailman/listinfo/python-list
Re: Strange Execution Times
[EMAIL PROTECTED] wrote: > I am running two functions in a row that do the same thing. 1. I see no functions here. You should set out a script like this: def main(): your_code_goes_here() if __name__ == '__main__': main() for two reasons (a) your code will be referring to locals instead of globals; this is faster, which might appeal to you (b) if somebody accidentally imports the script, nothing happens. 2. The two loops to which you refer do *not* do the same thing; see later. > One runs > in .14 seconds, the other 56. I'm confused. I wrote another version > of the program and couldn't get the slow behavior again, only the fast. > I'm not sure what is causing it. Can anyone figure it out? > > Here is my code (sorry it's a bit of a mess, but my cleaned up version > isn't slow!). Just skim to the bottom where the timing is. The first > time printed out is .14, the seond is 56.56. > > [snip] [following has extraneous blank lines and comments removed] > t1 = time.clock() > for j in r: > digest = m(jo(s[starts[j]+1:ends[j]])).hexdigest() > a[starts[j]+1:ends[j]] = digest > t2 = time.clock() > print "time is", round(t2-t1, 5) > > t1 = time.clock() > for i in r: > data[starts[i]:ends[i]] = \ > md5(join(s[starts[i]:ends[i]])).hexdigest() > t2 = time.clock() > print "second time is", round(t2-t1, 5) General questions: what platform? what version of Python? how large is the file? how much free memory do you have? how many passwords are there? what is the average length of a password? Ignoring the superficial-but-meaningless differences (i vs j, md5 [aarrgghh!!] vs m), jo vs join), these two loops differ in the following respects: (1) 'data' is a copy of 'a' (2) the first loop's body is effectively: digest = RHS; LHS = digest whereas the 2nd loop's body is: LHS = RHS (3) the first loop uses starts[j]+1 whereas the second loop uses starts[j] Item (1) may affect the timing if file is large compared with available memory -- could be 'a' has to be swapped out, and 'data' swapped in. Item (2) should make the 2nd loop very slightly faster, so we'll ignore that :-) Item (3) means you are not comparing like with like. It means that the 1st loop has less work to do. So this could make an observable difference for very short passwords -- but still nothing like 0.14 compared with 56. So, some more questions: The 56.56 is suspiciously precise -- you ran it a few times and it printed exactly 56.56 each time? Did you try putting the 2nd loop first [refer to Item (1) above]? Did you try putting in a switch so that your script runs either 1st loop or 2nd loop but not both? Note that each loop is making its target list expand in situ; this may after a while (like inside loop 2) cause the memory arena to become so fragmented that swapping will occur. This of course can vary wildly depending on the platform; Win95 used to be the most usual suspect but you're obviously not running on that. Some observations: (1) 's' is already a string, so ''.join(s[x:y]) is a slow way of doing s[x:y] (2) 'a' ends up as a list of one-byte strings, via a very circuitous process: a = array.array('c', s).tolist() A shorter route would be: a = list(s) However what's wrong with what you presumably tried out first i.e. a = array.array('c', s) ?? It doesn't need the final ''.join() before writing to disk, and it takes up less memory. NOTE: the array variety takes up 1 byte per character. The list variety takes up at least 4 bytes per character (on a machine where sizeof(PyObject *) == 4); to the extent that the file contains characters that are not interned (i.e. not [A-Za-z_] AFAIK), much more memory is required as a separate object will be created for each such character. Was it consistently slower? (3) If memory is your problem, you could rewrite the whole thing to simply do one write per password; that way you only need 1.x copy of the file contents in memory, not 2.x. Hoping some of this helps, John -- http://mail.python.org/mailman/listinfo/python-list
Re: Strange Execution Times
<[EMAIL PROTECTED]> wrote: >I am running two functions in a row that do the same thing. One runs > in .14 seconds, the other 56. I'm confused. I wrote another version > of the program and couldn't get the slow behavior again, only the fast. > I'm not sure what is causing it. Can anyone figure it out? it would be a lot easier to help if you posted a self-contained example. -- http://mail.python.org/mailman/listinfo/python-list
Strange Execution Times
I am running two functions in a row that do the same thing. One runs in .14 seconds, the other 56. I'm confused. I wrote another version of the program and couldn't get the slow behavior again, only the fast. I'm not sure what is causing it. Can anyone figure it out? Here is my code (sorry it's a bit of a mess, but my cleaned up version isn't slow!). Just skim to the bottom where the timing is. The first time printed out is .14, the seond is 56.56. f = open("/Users/curi/data.xml") o = open("/Users/curi/out2.xml", "w") import md5 import array p1 = "" p2 = "" cnt = 0 m = md5.new jo = "".join adjust = len(p1) - 1 i = 1 s = f.read() a = array.array('c', s).tolist() spot = 0 k = 0 find = s.find starts = [] ends = [] while k != -1: #print len(s) k = find(p2, spot) if k != -1: starts.append(find(p1, spot) + adjust) ends.append(k) spot = k + 1 #s = "".join([s[:j+1], md5.new(s[j+1:k-1]).hexdigest(), s[k:]]) #if k != -1: a[j+1:k-1] = m(jo(a[j+1:k-1])).hexdigest() r = range(len(starts)) #r = range(20) r.reverse() import time data = a[:] md5 = m join = jo t1 = time.clock() for j in r: #print jo(s[starts[j]+1:ends[j]]) digest = m(jo(s[starts[j]+1:ends[j]])).hexdigest() a[starts[j]+1:ends[j]] = digest #cnt += 1 #if cnt % 100 == 0: print cnt t2 = time.clock() print "time is", round(t2-t1, 5) t1 = time.clock() for i in r: data[starts[i]:ends[i]] = md5(join(s[starts[i]:ends[i]])).hexdigest() t2 = time.clock() print "second time is", round(t2-t1, 5) o.write(jo(a)) -- http://mail.python.org/mailman/listinfo/python-list