[issue26118] String performance issue using single quotes
New submission from poostenr: There appears to be a significant performance issue between the following two statements. Unable to explain performance impact. s = "{0},".format(columnvalue) # fast s = "'{0}',".format(columnvalue) # ~30x slower So far, no luck trying to find other statements to improve performance, such as: s = "\'{0}\',".format(columnvalue) s = "'" + "%s" %(columnvalue) + "'"+"," s = "{0}{1}{2},".format("'",columnvalue,"'") -- components: Windows messages: 258243 nosy: paul.moore, poostenr, steve.dower, tim.golden, zach.ware priority: normal severity: normal status: open title: String performance issue using single quotes type: performance versions: Python 3.5 ___ Python tracker <http://bugs.python.org/issue26118> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue26118] String performance issue using single quotes
poostenr added the comment: My initial observations with my Python script using: s = "{0},".format(columnvalue) # fast Processed ~360MB of data from 2:16PM - 2:51PM (35 minutes, ~10MB/min) One particular file 6.5MB took ~1 minute. When I changed this line of code to: s = "'{0}',".format(columnvalue) # ~30x slower (1 min. vs 30 min.) Same particular file of 6.5MB took ~30 minutes (228KB/min). My Python environment is: C:\Data>python -V Python 3.5.1 :: Anaconda 2.4.1 (32-bit) I did some more testing with a very simplified piece of code, but is not conclusive. But there is a significant jump when I introduce the single quotes where I see a jump from 0m2.410s to 0m3.875s. $ python -V Python 3.5.1 // // s='test' // for x in range(1000): // y = "{0}".format(s) $ time python test.py real0m2.410s user0m2.356s sys 0m0.048s // s='test' // for x in range(1000): // y = "'%s'" % (s) $ time python test2.py real0m2.510s user0m2.453s sys 0m0.051s // s='test' // for x in range(1000): // y = "'{0}'".format(s) $ time python test3.py real0m3.875s user0m3.819s sys 0m0.048s -- ___ Python tracker <http://bugs.python.org/issue26118> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue26118] String performance issue using single quotes
poostenr added the comment: Eric, Steven, thank you for your feedback so far. I am using Windows7, Intel i7. That one particular file of 6.5MB took ~1 minute on my machine. When I ran that same test on Linux with Python 3.5.1, it took about 3 seconds. I was amazed to see a 20x difference. Steven suggested the idea that this phenomenon might be specific to Windows. And I agree, that is what it is looking like. Or is Python doing something in the background? The Python script is straight forward with a loop that reads a line from a CSV file, split the column values and saves each value as '' to another file. Basically building an SQL statement. I have had no issues until I added the encapsulating single quotes around the value. Because I can reproduce this performance difference at will by alternating which line I comment out, leads me to believe it cannot be HDD, AV or something outside the python script interfering. I repeated the simplified test, that I ran earlier on a Linux system, but this time on my Windows system. I don't see anything spectacular. I am just puzzled that using one statement or the other causes such a huge performance impact somehow. I will try some more tests and copy your examples. import time loopcount = 1000 # Using string value s="test 1" v="test 1" start_ms = int(round(time.time() * 1000)) for x in range (loopcount): y = "{0}".format(v) end_ms = int(round(time.time() * 1000)) print("Start {0}: {1}".format(s,start_ms)) print("End {0}: {1}".format(s,end_ms)) print("Diff {0}: {1} ms\n\n".format(s,end_ms-start_ms)) # Start test 1: 1452828394523 # End test 1: 1452828397957 # Diff test 1: 3434 ms s="test 2" v="test 2" start_ms = int(round(time.time() * 1000)) for x in range (loopcount): y = "'%s'" % (v) end_ms = int(round(time.time() * 1000)) print("Start {0}: {1}".format(s,start_ms)) print("End {0}: {1}".format(s,end_ms)) print("Diff {0}: {1} ms\n\n".format(s,end_ms-start_ms)) # Start test 2: 1452828397957 # End test 2: 1452828401233 # Diff test 2: 3276 ms s="test 3" v="test 3" start_ms = int(round(time.time() * 1000)) for x in range (loopcount): y = "'{0}'".format(v) end_ms = int(round(time.time() * 1000)) print("Start {0}: {1}".format(s,start_ms)) print("End {0}: {1}".format(s,end_ms)) print("Diff {0}: {1} ms\n\n".format(s,end_ms-start_ms)) # Start test 3: 1452828401233 # End test 3: 1452828406320 # Diff test 3: 5087 ms # Using integer value s="test 4" v=123456 start_ms = int(round(time.time() * 1000)) for x in range (loopcount): y = "{0}".format(v) end_ms = int(round(time.time() * 1000)) print("Start {0}: {1}".format(s,start_ms)) print("End {0}: {1}".format(s,end_ms)) print("Diff {0}: {1} ms\n\n".format(s,end_ms-start_ms)) # Start test 4: 1452828406320 # End test 4: 1452828411378 # Diff test 4: 5058 ms s="test 5" v=123456 start_ms = int(round(time.time() * 1000)) for x in range (loopcount): y = "'%s'" % (v) end_ms = int(round(time.time() * 1000)) print("Start {0}: {1}".format(s,start_ms)) print("End {0}: {1}".format(s,end_ms)) print("Diff {0}: {1} ms\n\n".format(s,end_ms-start_ms)) # Start test 5: 1452828411378 # End test 5: 1452828415264 # Diff test 5: 3886 ms s="test 6" v=123456 start_ms = int(round(time.time() * 1000)) for x in range (loopcount): y = "'{0}'".format(v) end_ms = int(round(time.time() * 1000)) print("Start {0}: {1}".format(s,start_ms)) print("End {0}: {1}".format(s,end_ms)) print("Diff {0}: {1} ms\n\n".format(s,end_ms-start_ms)) # Start test 6: 1452828415264 # End test 6: 1452828421292 # Diff test 6: 6028 ms -- ___ Python tracker <http://bugs.python.org/issue26118> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue26118] String performance issue using single quotes
poostenr added the comment: Eric, I just tried your examples. The loop count is 100x more, but the results are about a factor 10 off. Test1: My results: C:\Data>python -m timeit -s 'x=4' '",{0}".format(x)' 1 loops, best of 3: 0.0116 usec per loop Eric's results: $ python -m timeit -s 'x=4' '",{0}".format(x)' 100 loops, best of 3: 0.182 usec per loop Test2: My results: C:\Data>python -m timeit -s 'x=4' '"\'{0}\',".format(x)' 1 loops, best of 3: 0.0122 usec per loop Eric's results: $ python -m timeit -s 'x=4' '"'\''{0}'\'',".format(x)' 100 loops, best of 3: 0.205 usec per loop -- ___ Python tracker <http://bugs.python.org/issue26118> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue26118] String performance issue using single quotes
poostenr added the comment: Eric, Steven, During further testing I was not able to find any real evidence that the statement I was focused on had a real performance issue. As I did more testing I noticed that appending data to the file slowed down. The file grew initially with ~30-50KB increments and around 500KB it had slowed down to ~3-5KB/s, until around 1MB the file grew at ~1KB/s. I found this to be odd and because Steven had mentioned other processes, I started looking at some other statements. After quite a lot of trial and error, I was able to use single quotes and increase my performance to acceptable levels. Example 3 below is how I resolved it. Can you explain to me why there was a performance penalty in example 2 ? Python did something under the hood that I am overlooking. Did conv.escape_string() change something about columnvalue, so that adding a single quote before and after it introduced some add behavior with writing to file ? I am not an expert on Python and remember reading something about Dynamic typing. Example 1: Fast performance, variable s is not encapsulated with single quotes 6.5MB parsed in ~1 minute. for key in listkeys: keyvalue = self.recordstats[key] fieldtype = keyvalue[0] columnvalue = record[key] columnvalue = conv.escape_string(columnvalue) if (count > 1): s = "{0},".format(columnvalue) # No single quotes else s = "{0},".format(columnvalue) # No single quotes count -= 1 Append s to file. Example 2: Slow performance, pre- and post-fixed variable s with single quotes 6.5MB parsed in 35 minutes. for key in listkeys: keyvalue = self.recordstats[key] fieldtype = keyvalue[0] columnvalue = record[key] columnvalue = conv.escape_string(columnvalue) if (count > 1): s = "'{0}',".format(columnvalue) # Added single quotes else s = "'{0}',".format(columnvalue) # Added single quotes count -= 1 Append s to file. Example 3: Fast performance, variable columnvalue is pre- and post-fixed with single quotes 6.5MB parsed in !45 seconds. for key in listkeys: keyvalue = self.recordstats[key] fieldtype = keyvalue[0] columnvalue = record[key] columnvalue = conv.escape_string("'" + columnvalue + "'") # Moved single quotes to this statement. if (count > 1): s = "{0},".format(columnvalue) else s = "{0},".format(columnvalue) count -= 1 Append s to file. -- ___ Python tracker <http://bugs.python.org/issue26118> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue26118] String performance issue using single quotes
poostenr added the comment: Thank you for your feedback Victor and Steven. I just copied my scripts and 360MB of CSV files over to Linux. The entire process finished in 4 minutes exactly, using the original python scripts. So there is something different between my environments. If it was a fragmentation issue, then I would expect to always have a slow performance on the Windows system. But I can influence the performance by alternating between the two original statements: s = "{0},".format(columnvalue) # fast s = "'{0}',".format(columnvalue) # ~30x slower I apologize for not being able to provide the entire code. There is too much code to post at this time. I am opening a file like this: #logger = open(filename, rw, buffering, encoding) logger = open('output.sql', 'a', 1, 'iso-8859-1') I write to file: logger.write(text+'\n') I'm using a library to escape the string before saving to file. import pymysql.converters as conv <...> for key in listkeys: keyvalue = self.recordstats[key] fieldtype = keyvalue[0] columnvalue = record[key] columnvalue = conv.escape_string(columnvalue) if (count > 1): s = "{0},".format(columnvalue) # No single quotes else s = "{0},".format(columnvalue) # No single quotes count -= 1 logger.write(s+'\n') I appreciate the feedback and ideas so far. Trying the profiler is on my list to see if it provides more insight. I am not using Anaconda3 on Linux. Perhaps that has an impact somehow? I never suspected inserting the two single quotes to cause such a problem in performance. I noticed it when I parsed ~40GB of data and it took almost a week to complete instead of my expected 6-7 hrs. Just the other day I decided to remove the single quotes because it was the only thing left that I'd changed. I had discarded that change the past two weeks because that couldn't be causing the performance problem. Today, I wasn't expecting such a big difference between running my script on Linux or Windows. If I discover anything else, I will post an update. When I get the chance I can remove redundant code and post the source. -- ___ Python tracker <http://bugs.python.org/issue26118> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com