[issue35195] [Windows] Python 3.7 initializes LC_CTYPE locale at startup, causing performance issue on msvcrt isdigit()

2018-11-16 Thread Dragoljub


Dragoljub  added the comment:

Do we know if its possible to prevent the initialize LC_CTYPE on startup? Is 
there some combination of ENV-Var or CMD-Args that can avoid this slowdown on 
Windows?

What are the next step to get the issue assigned?

--

___
Python tracker 
<https://bugs.python.org/issue35195>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35195] [Windows] Python 3.7 initializes LC_CTYPE locale at startup, causing performance issue on msvcrt isdigit()

2018-11-13 Thread Dragoljub


Dragoljub  added the comment:

This is the default LC_CTYPE locale type I see on Windows10 and Python 3.7.1 vs 
3.5.2: Same: 'C'


'3.7.1 (v3.7.1:260ec2c36a, Oct 20 2018, 14:57:15) [MSC v.1915 64 bit (AMD64)]'

import _locale
_locale.setlocale(_locale.LC_CTYPE, None)

'C'


'3.5.2 (v3.5.2:4def2a2901a5, Jun 25 2016, 22:18:55) [MSC v.1900 64 bit (AMD64)]'

import _locale
_locale.setlocale(_locale.LC_CTYPE, None)

'C'

--

___
Python tracker 
<https://bugs.python.org/issue35195>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35195] [Windows] Python 3.7 initializes LC_CTYPE locale at startup, causing performance issue on msvcrt isdigit()

2018-11-13 Thread Dragoljub


Dragoljub  added the comment:

On Python 3.7.1 and Windows 10:

I attempted locale.setlocale(locale.LC_ALL, "POSIX") --> Errors Out
---
Error Traceback (most recent call last)
 in 
> 1 locale.setlocale(locale.LC_ALL, "POSIX")

D:\Python37\lib\locale.py in setlocale(category, locale)
602 # convert to string
603 locale = normalize(_build_localename(locale))
--> 604 return _setlocale(category, locale)
605 
606 def resetlocale(category=LC_ALL):

Error: unsupported locale setting


I was able to set the loacle to "C" but that does not improve the parsing 
performance.

locale.setlocale(locale.LC_ALL, "C") --> returns 'C'

--

___
Python tracker 
<https://bugs.python.org/issue35195>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35195] Pandas read_csv() is 3.5X Slower on Python 3.7.1 vs Python 3.6.7 & 3.5.2 On Windows 10

2018-11-12 Thread Dragoljub


Dragoljub  added the comment:

Here is a simple pure python example:

digits = ''.join([str(i) for i in range(10)]*1000)
%timeit digits.isdigit() # --> 2X+ slower on python 3.7.1

Basically in Pandas C-code parser we call the isdigit() function for each 
number that is to be parsed. so 12345.6789 calls isdigt() 9 times to determine 
if this is a digit character that can be converted to a float. The problem is 
in the latest version of Python with locale updates isdigit() takes a locale 
argument that seems to be passed over and over slowing down this check. Is it 
possible that we disable any local passing from Python down to lower-level C 
code, or simply set the default locale to 'C' to keep it from thrashing?

--

___
Python tracker 
<https://bugs.python.org/issue35195>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35195] Pandas read_csv() is 3.5X Slower on Python 3.7.1 vs Python 3.6.7 & 3.5.2 On Windows 10

2018-11-12 Thread Dragoljub


Dragoljub  added the comment:

@Vstinner,

Any way you can help test out a config setting to avoid the locale changes on 
Python 2.7.0a4+? It is currently causing the isdigit() low-level function to 
call the local-specific function on windows and update locals each call slowing 
down CSV Paring on Windows 3.5X

How can we configure python to not be different than 3.6.7 when it come to 
locale behavior?

--

___
Python tracker 
<https://bugs.python.org/issue35195>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35195] Pandas read_csv() is 3.5X Slower on Python 3.7.1 vs Python 3.6.7 & 3.5.2 On Windows 10

2018-11-10 Thread Dragoljub


Dragoljub  added the comment:

@cgohlke compared the statement df2 = pd.read_csv(csv) on Python 3.7.0a3 and a4 
in the Visual Studio profiler. The culprit is the isdigit function called in 
the parsers extension module. On 3.7.0a3 the function is fast at ~8% of 
samples. On 3.7.0a4 the function is slow at ~64% samples because it calls the 
_isdigit_l function, which seems to update and restore the locale in the 
current thread every time...

3.7.0a3:
Function Name   Inclusive Samples   Exclusive Samples   Inclusive 
Samples % Exclusive Samples % Module Name
 + [parsers.cp37-win_amd64.pyd] 705 347 28.52%  14.04%  
parsers.cp37-win_amd64.pyd
   isdigit  207 207 8.37%   8.37%   ucrtbase.dll
 - _errno   105 39  4.25%   1.58%   ucrtbase.dll
   toupper  24  24  0.97%   0.97%   ucrtbase.dll
   isspace  21  21  0.85%   0.85%   ucrtbase.dll
   [python37.dll]   1   1   0.04%   0.04%   python37.dll
3.7.0a4:
Function Name   Inclusive Samples   Exclusive Samples   Inclusive 
Samples % Exclusive Samples % Module Name
 + [parsers.cp37-win_amd64.pyd] 8,613   478 83.04%  4.61%   
parsers.cp37-win_amd64.pyd
 + isdigit  6,642   208 64.04%  2.01%   ucrtbase.dll
 + _isdigit_l   6,434   245 62.03%  2.36%   ucrtbase.dll
 + _LocaleUpdate::_LocaleUpdate 5,806   947 55.98%  9.13%   ucrtbase.dll
 + __acrt_getptd2,121   1,031   20.45%  9.94%   ucrtbase.dll
   FlsGetValue  647 647 6.24%   6.24%   KernelBase.dll
 - RtlSetLastWin32Error 296 235 2.85%   2.27%   ntdll.dll
   _guard_dispatch_icall_nop101 101 0.97%   0.97%   ucrtbase.dll
   GetLastError 46  46  0.44%   0.44%   KernelBase.dll
 + __acrt_update_multibyte_info 1,475   246 14.22%  2.37%   ucrtbase.dll
 - __crt_state_management::get_current_state_index  1,229   513 11.85%  
4.95%   ucrtbase.dll
 + __acrt_update_locale_info1,263   235 12.18%  2.27%   ucrtbase.dll
 - __crt_state_management::get_current_state_index  1,028   429 9.91%   
4.14%   ucrtbase.dll
   _ischartype_l383 383 3.69%   3.69%   ucrtbase.dll

--

___
Python tracker 
<https://bugs.python.org/issue35195>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35195] Pandas read_csv() is 3.5X Slower on Python 3.7.1 vs Python 3.6.7 & 3.5.2 On Windows 10

2018-11-09 Thread Dragoljub


Dragoljub  added the comment:

I tested this at runtime with sys._enablelegacywindowsfsencoding()

Also this was new in 3.6 and Py 3.6 does not have the slowdown issue.

New in version 3.6: See PEP 529 for more details.

--

___
Python tracker 
<https://bugs.python.org/issue35195>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35195] Pandas read_csv() is 3.5X Slower on Python 3.7.1 vs Python 3.6.7 & 3.5.2 On Windows 10

2018-11-09 Thread Dragoljub


Dragoljub  added the comment:

I tried playing around with the UTF-8 mode settings but did not get a speed 
improvement.

After reading through the PEP it appears that on Windoes:

"To allow for better cross-platform binary portability and to adjust 
automatically to future changes in locale availability, these checks will be 
implemented at runtime on all platforms other than Windows, rather than 
attempting to determine which locales to try at compile time."

So if i'm understanding this correctly the locale coercion would not be 
controllable from Windows after Python is compiled?

--

___
Python tracker 
<https://bugs.python.org/issue35195>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35195] Pandas read_csv() is 3.5X Slower on Python 3.7.1 vs Python 3.6.7 & 3.5.2 On Windows 10

2018-11-09 Thread Dragoljub


Dragoljub  added the comment:

After some more digging it appears that we see the 3.5x slowdown manifest in 
Python 3.7.0a4 and is not present in Python 3.7.0a3.

One guess is that 

https://docs.python.org/3.7/whatsnew/changelog.html#python-3-7-0-alpha-4

bpo-29240: Add a new UTF-8 mode: implementation of the PEP 540

may contribute to this slowdown on windows. Is there a way to ensure we disable 
any native to UTF conversion that may be happening in Python 3.7.a4?

--

___
Python tracker 
<https://bugs.python.org/issue35195>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35195] Pandas read_csv() is 3.5X Slower on Python 3.7.1 vs Python 3.6.7 & 3.5.2 On Windows 10

2018-11-09 Thread Dragoljub


Dragoljub  added the comment:

After some more benchmarks I'm seeing this line of code called in Python 3.7 
but not in Python 3.5:

{built-in method _thread.allocate_lock}

--

___
Python tracker 
<https://bugs.python.org/issue35195>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35195] Pandas read_csv() is 3.5X Slower on Python 3.7.1 vs Python 3.6.7 & 3.5.2 On Windows 10

2018-11-08 Thread Dragoljub


New submission from Dragoljub :

xref: https://github.com/pandas-dev/pandas/issues/23516

Example:
import io
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(100, 10), columns=('COL{}'.format(i) for 
i in range(10)))
csv = io.StringIO(df.to_csv(index=False))
df2 = pd.read_csv(csv) #3.5X slower on Python 3.7.1

pd.read_csv() reads data at 30MB/sec on Python 3.7.1 while at 100MB/sec on 
Python 3.6.7.

This issue seems to be only present on Windows 10 Builds both x86 & x64. 

Possibly some IO changes in Python 3.7 could have contributed to this slowdown 
on Windows but not on Linux?

--
components: IO
messages: 329490
nosy: Dragoljub
priority: normal
severity: normal
status: open
title: Pandas read_csv() is 3.5X Slower on Python 3.7.1 vs Python 3.6.7 & 3.5.2 
On Windows 10
type: performance
versions: Python 3.7

___
Python tracker 
<https://bugs.python.org/issue35195>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com