[issue26118] String performance issue using single quotes

2016-01-14 Thread poostenr

New submission from poostenr:

There appears to be a significant performance issue between the following two 
statements. Unable to explain performance impact.

s = "{0},".format(columnvalue)   # fast
s = "'{0}',".format(columnvalue) # ~30x slower

So far, no luck trying to find other statements to improve performance, such as:
s = "\'{0}\',".format(columnvalue)
s = "'" + "%s" %(columnvalue) + "'"+","
s = "{0}{1}{2},".format("'",columnvalue,"'")

--
components: Windows
messages: 258243
nosy: paul.moore, poostenr, steve.dower, tim.golden, zach.ware
priority: normal
severity: normal
status: open
title: String performance issue using single quotes
type: performance
versions: Python 3.5

___
Python tracker 
<http://bugs.python.org/issue26118>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26118] String performance issue using single quotes

2016-01-14 Thread poostenr

poostenr added the comment:

My initial observations with my Python script using:

s = "{0},".format(columnvalue)   # fast
Processed ~360MB of data from 2:16PM - 2:51PM (35 minutes, ~10MB/min)
One particular file 6.5MB took ~1 minute.

When I changed this line of code to:
s = "'{0}',".format(columnvalue) # ~30x slower (1 min. vs 30 min.)
Same particular file of 6.5MB took ~30 minutes (228KB/min).

My Python environment is:
C:\Data>python -V
Python 3.5.1 :: Anaconda 2.4.1 (32-bit)

I did some more testing with a very simplified piece of code, but is not 
conclusive. But there is a significant jump when I introduce the single quotes 
where I see a jump from 0m2.410s to 0m3.875s.

$ python -V
Python 3.5.1

// 
// s='test'
// for x in range(1000):
// y = "{0}".format(s)

$ time python test.py

real0m2.410s
user0m2.356s
sys 0m0.048s

// s='test'
// for x in range(1000):
// y = "'%s'" % (s)

$ time python test2.py 

real0m2.510s
user0m2.453s
sys 0m0.051s

// s='test'
// for x in range(1000):
// y = "'{0}'".format(s)

$ time python test3.py 

real0m3.875s
user0m3.819s
sys 0m0.048s

--

___
Python tracker 
<http://bugs.python.org/issue26118>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26118] String performance issue using single quotes

2016-01-14 Thread poostenr

poostenr added the comment:

Eric, Steven, thank you for your feedback so far.

I am using Windows7, Intel i7.
That one particular file of 6.5MB took ~1 minute on my machine.
When I ran that same test on Linux with Python 3.5.1, it took about 3 seconds. 
I was amazed to see a 20x difference.

Steven suggested the idea that this phenomenon might be specific to Windows. 
And I agree, that is what it is looking like. Or is Python doing something in 
the background?

The Python script is straight forward with a loop that reads a line from a CSV 
file, split the column values and saves each value as '' to another 
file. Basically building an SQL statement.
I have had no issues until I added the encapsulating single quotes around the 
value.

Because I can reproduce this performance difference at will by alternating 
which line I comment out, leads me to believe it cannot be HDD, AV or something 
outside the python script interfering.

I repeated the simplified test, that I ran earlier on a Linux system, but this 
time on my Windows system.
I don't see anything spectacular.
I am just puzzled that using one statement or the other causes such a huge 
performance impact somehow.

I will try some more tests and copy your examples.

import time
loopcount = 1000

# Using string value
s="test 1"
v="test 1"
start_ms = int(round(time.time() * 1000))
for x in range (loopcount):
y = "{0}".format(v)
end_ms = int(round(time.time() * 1000))
print("Start {0}: {1}".format(s,start_ms))
print("End   {0}: {1}".format(s,end_ms))
print("Diff  {0}: {1} ms\n\n".format(s,end_ms-start_ms))
# Start test 1: 1452828394523
# End   test 1: 1452828397957
# Diff  test 1: 3434 ms


s="test 2"
v="test 2"
start_ms = int(round(time.time() * 1000))
for x in range (loopcount):
y = "'%s'" % (v)
end_ms = int(round(time.time() * 1000))
print("Start {0}: {1}".format(s,start_ms))
print("End   {0}: {1}".format(s,end_ms))
print("Diff  {0}: {1} ms\n\n".format(s,end_ms-start_ms))
# Start test 2: 1452828397957
# End   test 2: 1452828401233
# Diff  test 2: 3276 ms


s="test 3"
v="test 3"
start_ms = int(round(time.time() * 1000))
for x in range (loopcount):
y = "'{0}'".format(v)
end_ms = int(round(time.time() * 1000))
print("Start {0}: {1}".format(s,start_ms))
print("End   {0}: {1}".format(s,end_ms))
print("Diff  {0}: {1} ms\n\n".format(s,end_ms-start_ms))
# Start test 3: 1452828401233
# End   test 3: 1452828406320
# Diff  test 3: 5087 ms

# Using integer value
s="test 4"
v=123456
start_ms = int(round(time.time() * 1000))
for x in range (loopcount):
y = "{0}".format(v)
end_ms = int(round(time.time() * 1000))
print("Start {0}: {1}".format(s,start_ms))
print("End   {0}: {1}".format(s,end_ms))
print("Diff  {0}: {1} ms\n\n".format(s,end_ms-start_ms))
# Start test 4: 1452828406320
# End   test 4: 1452828411378
# Diff  test 4: 5058 ms


s="test 5"
v=123456
start_ms = int(round(time.time() * 1000))
for x in range (loopcount):
y = "'%s'" % (v)
end_ms = int(round(time.time() * 1000))
print("Start {0}: {1}".format(s,start_ms))
print("End   {0}: {1}".format(s,end_ms))
print("Diff  {0}: {1} ms\n\n".format(s,end_ms-start_ms))
# Start test 5: 1452828411378
# End   test 5: 1452828415264
# Diff  test 5: 3886 ms

s="test 6"
v=123456
start_ms = int(round(time.time() * 1000))
for x in range (loopcount):
y = "'{0}'".format(v)
end_ms = int(round(time.time() * 1000))
print("Start {0}: {1}".format(s,start_ms))
print("End   {0}: {1}".format(s,end_ms))
print("Diff  {0}: {1} ms\n\n".format(s,end_ms-start_ms))
# Start test 6: 1452828415264
# End   test 6: 1452828421292
# Diff  test 6: 6028 ms

--

___
Python tracker 
<http://bugs.python.org/issue26118>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26118] String performance issue using single quotes

2016-01-14 Thread poostenr

poostenr added the comment:

Eric,

I just tried your examples. 
The loop count is 100x more, but the results are about a factor 10 off.

Test1:
My results:
C:\Data>python -m timeit -s 'x=4' '",{0}".format(x)'
1 loops, best of 3: 0.0116 usec per loop 

Eric's results:
$ python -m timeit -s 'x=4' '",{0}".format(x)'
100 loops, best of 3: 0.182 usec per loop

Test2:
My results:
C:\Data>python -m timeit -s 'x=4' '"\'{0}\',".format(x)'
1 loops, best of 3: 0.0122 usec per loop 

Eric's results:
$ python -m timeit -s 'x=4' '"'\''{0}'\'',".format(x)'
100 loops, best of 3: 0.205 usec per loop

--

___
Python tracker 
<http://bugs.python.org/issue26118>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26118] String performance issue using single quotes

2016-01-14 Thread poostenr

poostenr added the comment:

Eric, Steven,

During further testing I was not able to find any real evidence that the 
statement I was focused on had a real performance issue.

As I did more testing I noticed that appending data to the file slowed down. 
The file grew initially with ~30-50KB increments and around 500KB it had slowed 
down to ~3-5KB/s, until around 1MB the file grew at ~1KB/s. I found this to be 
odd and because Steven had mentioned other processes, I started looking at some 
other statements.

After quite a lot of trial and error, I was able to use single quotes and 
increase my performance to acceptable levels.
Example 3 below is how I resolved it.

Can you explain to me why there was a performance penalty in example 2 ?
Python did something under the hood that I am overlooking.

Did conv.escape_string() change something about columnvalue, so that adding a 
single quote before and after it introduced some add behavior with writing to 
file ? I am not an expert on Python and remember reading something about 
Dynamic typing. 

Example 1: Fast performance, variable s is not encapsulated with single quotes
6.5MB parsed in ~1 minute.
for key in listkeys:
keyvalue = self.recordstats[key]
fieldtype   = keyvalue[0]
columnvalue = record[key]
columnvalue = conv.escape_string(columnvalue)
if (count > 1):
s = "{0},".format(columnvalue)  # No single quotes
else
s = "{0},".format(columnvalue)  # No single quotes
count -= 1
Append s to file.

Example 2: Slow performance, pre- and post-fixed variable s with single quotes
6.5MB parsed in 35 minutes.
for key in listkeys:
keyvalue = self.recordstats[key]
fieldtype   = keyvalue[0]
columnvalue = record[key]
columnvalue = conv.escape_string(columnvalue)
if (count > 1):
s = "'{0}',".format(columnvalue) # Added single quotes
else
s = "'{0}',".format(columnvalue) # Added single quotes
count -= 1
Append s to file.

Example 3: Fast performance, variable columnvalue is pre- and post-fixed with 
single quotes
6.5MB parsed in !45 seconds.
for key in listkeys:
keyvalue = self.recordstats[key]
fieldtype   = keyvalue[0]
columnvalue = record[key]
columnvalue = conv.escape_string("'" + columnvalue + "'") # Moved single 
quotes to this statement.
if (count > 1):
s = "{0},".format(columnvalue)
else
s = "{0},".format(columnvalue)
count -= 1
Append s to file.

--

___
Python tracker 
<http://bugs.python.org/issue26118>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26118] String performance issue using single quotes

2016-01-15 Thread poostenr

poostenr added the comment:

Thank you for your feedback Victor and Steven.

I just copied my scripts and 360MB of CSV files over to Linux.
The entire process finished in 4 minutes exactly, using the original python 
scripts.
So there is something different between my environments.
If it was a fragmentation issue, then I would expect to always have a slow 
performance on the Windows system. But I can influence the performance by 
alternating between the two original statements:
s = "{0},".format(columnvalue)   # fast
s = "'{0}',".format(columnvalue) # ~30x slower

I apologize for not being able to provide the entire code.
There is too much code to post at this time.

I am opening a file like this:
#logger = open(filename, rw, buffering, encoding)
logger = open('output.sql', 'a', 1, 'iso-8859-1')

I write to file:
logger.write(text+'\n')

I'm using a library to escape the string before saving to file.
import pymysql.converters as conv
<...>
for key in listkeys:
keyvalue = self.recordstats[key]
fieldtype   = keyvalue[0]
columnvalue = record[key]
columnvalue = conv.escape_string(columnvalue)
if (count > 1):
s = "{0},".format(columnvalue)  # No single quotes
else
s = "{0},".format(columnvalue)  # No single quotes
count -= 1
logger.write(s+'\n')

I appreciate the feedback and ideas so far.
Trying the profiler is on my list to see if it provides more insight.
I am not using Anaconda3 on Linux. Perhaps that has an impact somehow?

I never suspected inserting the two single quotes to cause such a problem in 
performance. I noticed it when I parsed ~40GB of data and it took almost a week 
to complete instead of my expected 6-7 hrs.
Just the other day I decided to remove the single quotes because it was the 
only thing left that I'd changed. I had discarded that change the past two 
weeks because that couldn't be causing the performance problem.

Today, I wasn't expecting such a big difference between running my script on 
Linux or Windows.

If I discover anything else, I will post an update.
When I get the chance I can remove redundant code and post the source.

--

___
Python tracker 
<http://bugs.python.org/issue26118>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com