[issue35195] [Windows] Python 3.7 initializes LC_CTYPE locale at startup, causing performance issue on msvcrt isdigit()
Dragoljub added the comment: Do we know if its possible to prevent the initialize LC_CTYPE on startup? Is there some combination of ENV-Var or CMD-Args that can avoid this slowdown on Windows? What are the next step to get the issue assigned? -- ___ Python tracker <https://bugs.python.org/issue35195> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue35195] [Windows] Python 3.7 initializes LC_CTYPE locale at startup, causing performance issue on msvcrt isdigit()
Dragoljub added the comment: This is the default LC_CTYPE locale type I see on Windows10 and Python 3.7.1 vs 3.5.2: Same: 'C' '3.7.1 (v3.7.1:260ec2c36a, Oct 20 2018, 14:57:15) [MSC v.1915 64 bit (AMD64)]' import _locale _locale.setlocale(_locale.LC_CTYPE, None) 'C' '3.5.2 (v3.5.2:4def2a2901a5, Jun 25 2016, 22:18:55) [MSC v.1900 64 bit (AMD64)]' import _locale _locale.setlocale(_locale.LC_CTYPE, None) 'C' -- ___ Python tracker <https://bugs.python.org/issue35195> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue35195] [Windows] Python 3.7 initializes LC_CTYPE locale at startup, causing performance issue on msvcrt isdigit()
Dragoljub added the comment: On Python 3.7.1 and Windows 10: I attempted locale.setlocale(locale.LC_ALL, "POSIX") --> Errors Out --- Error Traceback (most recent call last) in > 1 locale.setlocale(locale.LC_ALL, "POSIX") D:\Python37\lib\locale.py in setlocale(category, locale) 602 # convert to string 603 locale = normalize(_build_localename(locale)) --> 604 return _setlocale(category, locale) 605 606 def resetlocale(category=LC_ALL): Error: unsupported locale setting I was able to set the loacle to "C" but that does not improve the parsing performance. locale.setlocale(locale.LC_ALL, "C") --> returns 'C' -- ___ Python tracker <https://bugs.python.org/issue35195> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue35195] Pandas read_csv() is 3.5X Slower on Python 3.7.1 vs Python 3.6.7 & 3.5.2 On Windows 10
Dragoljub added the comment: Here is a simple pure python example: digits = ''.join([str(i) for i in range(10)]*1000) %timeit digits.isdigit() # --> 2X+ slower on python 3.7.1 Basically in Pandas C-code parser we call the isdigit() function for each number that is to be parsed. so 12345.6789 calls isdigt() 9 times to determine if this is a digit character that can be converted to a float. The problem is in the latest version of Python with locale updates isdigit() takes a locale argument that seems to be passed over and over slowing down this check. Is it possible that we disable any local passing from Python down to lower-level C code, or simply set the default locale to 'C' to keep it from thrashing? -- ___ Python tracker <https://bugs.python.org/issue35195> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue35195] Pandas read_csv() is 3.5X Slower on Python 3.7.1 vs Python 3.6.7 & 3.5.2 On Windows 10
Dragoljub added the comment: @Vstinner, Any way you can help test out a config setting to avoid the locale changes on Python 2.7.0a4+? It is currently causing the isdigit() low-level function to call the local-specific function on windows and update locals each call slowing down CSV Paring on Windows 3.5X How can we configure python to not be different than 3.6.7 when it come to locale behavior? -- ___ Python tracker <https://bugs.python.org/issue35195> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue35195] Pandas read_csv() is 3.5X Slower on Python 3.7.1 vs Python 3.6.7 & 3.5.2 On Windows 10
Dragoljub added the comment: @cgohlke compared the statement df2 = pd.read_csv(csv) on Python 3.7.0a3 and a4 in the Visual Studio profiler. The culprit is the isdigit function called in the parsers extension module. On 3.7.0a3 the function is fast at ~8% of samples. On 3.7.0a4 the function is slow at ~64% samples because it calls the _isdigit_l function, which seems to update and restore the locale in the current thread every time... 3.7.0a3: Function Name Inclusive Samples Exclusive Samples Inclusive Samples % Exclusive Samples % Module Name + [parsers.cp37-win_amd64.pyd] 705 347 28.52% 14.04% parsers.cp37-win_amd64.pyd isdigit 207 207 8.37% 8.37% ucrtbase.dll - _errno 105 39 4.25% 1.58% ucrtbase.dll toupper 24 24 0.97% 0.97% ucrtbase.dll isspace 21 21 0.85% 0.85% ucrtbase.dll [python37.dll] 1 1 0.04% 0.04% python37.dll 3.7.0a4: Function Name Inclusive Samples Exclusive Samples Inclusive Samples % Exclusive Samples % Module Name + [parsers.cp37-win_amd64.pyd] 8,613 478 83.04% 4.61% parsers.cp37-win_amd64.pyd + isdigit 6,642 208 64.04% 2.01% ucrtbase.dll + _isdigit_l 6,434 245 62.03% 2.36% ucrtbase.dll + _LocaleUpdate::_LocaleUpdate 5,806 947 55.98% 9.13% ucrtbase.dll + __acrt_getptd2,121 1,031 20.45% 9.94% ucrtbase.dll FlsGetValue 647 647 6.24% 6.24% KernelBase.dll - RtlSetLastWin32Error 296 235 2.85% 2.27% ntdll.dll _guard_dispatch_icall_nop101 101 0.97% 0.97% ucrtbase.dll GetLastError 46 46 0.44% 0.44% KernelBase.dll + __acrt_update_multibyte_info 1,475 246 14.22% 2.37% ucrtbase.dll - __crt_state_management::get_current_state_index 1,229 513 11.85% 4.95% ucrtbase.dll + __acrt_update_locale_info1,263 235 12.18% 2.27% ucrtbase.dll - __crt_state_management::get_current_state_index 1,028 429 9.91% 4.14% ucrtbase.dll _ischartype_l383 383 3.69% 3.69% ucrtbase.dll -- ___ Python tracker <https://bugs.python.org/issue35195> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue35195] Pandas read_csv() is 3.5X Slower on Python 3.7.1 vs Python 3.6.7 & 3.5.2 On Windows 10
Dragoljub added the comment: I tested this at runtime with sys._enablelegacywindowsfsencoding() Also this was new in 3.6 and Py 3.6 does not have the slowdown issue. New in version 3.6: See PEP 529 for more details. -- ___ Python tracker <https://bugs.python.org/issue35195> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue35195] Pandas read_csv() is 3.5X Slower on Python 3.7.1 vs Python 3.6.7 & 3.5.2 On Windows 10
Dragoljub added the comment: I tried playing around with the UTF-8 mode settings but did not get a speed improvement. After reading through the PEP it appears that on Windoes: "To allow for better cross-platform binary portability and to adjust automatically to future changes in locale availability, these checks will be implemented at runtime on all platforms other than Windows, rather than attempting to determine which locales to try at compile time." So if i'm understanding this correctly the locale coercion would not be controllable from Windows after Python is compiled? -- ___ Python tracker <https://bugs.python.org/issue35195> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue35195] Pandas read_csv() is 3.5X Slower on Python 3.7.1 vs Python 3.6.7 & 3.5.2 On Windows 10
Dragoljub added the comment: After some more digging it appears that we see the 3.5x slowdown manifest in Python 3.7.0a4 and is not present in Python 3.7.0a3. One guess is that https://docs.python.org/3.7/whatsnew/changelog.html#python-3-7-0-alpha-4 bpo-29240: Add a new UTF-8 mode: implementation of the PEP 540 may contribute to this slowdown on windows. Is there a way to ensure we disable any native to UTF conversion that may be happening in Python 3.7.a4? -- ___ Python tracker <https://bugs.python.org/issue35195> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue35195] Pandas read_csv() is 3.5X Slower on Python 3.7.1 vs Python 3.6.7 & 3.5.2 On Windows 10
Dragoljub added the comment: After some more benchmarks I'm seeing this line of code called in Python 3.7 but not in Python 3.5: {built-in method _thread.allocate_lock} -- ___ Python tracker <https://bugs.python.org/issue35195> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue35195] Pandas read_csv() is 3.5X Slower on Python 3.7.1 vs Python 3.6.7 & 3.5.2 On Windows 10
New submission from Dragoljub : xref: https://github.com/pandas-dev/pandas/issues/23516 Example: import io import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(100, 10), columns=('COL{}'.format(i) for i in range(10))) csv = io.StringIO(df.to_csv(index=False)) df2 = pd.read_csv(csv) #3.5X slower on Python 3.7.1 pd.read_csv() reads data at 30MB/sec on Python 3.7.1 while at 100MB/sec on Python 3.6.7. This issue seems to be only present on Windows 10 Builds both x86 & x64. Possibly some IO changes in Python 3.7 could have contributed to this slowdown on Windows but not on Linux? -- components: IO messages: 329490 nosy: Dragoljub priority: normal severity: normal status: open title: Pandas read_csv() is 3.5X Slower on Python 3.7.1 vs Python 3.6.7 & 3.5.2 On Windows 10 type: performance versions: Python 3.7 ___ Python tracker <https://bugs.python.org/issue35195> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com