[issue17183] Small enhancements to Lib/_markupbase.py
Ezio Melotti added the comment: I did some macro-benchmarks and the proposed changes don't seem to affect the result (most likely because they are in _parse_doctype_element and _parse_doctype_attlist which should be called only once per document). I did some profiling, and this is the result: 4437196 function calls (4436748 primitive calls) in 36.582 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 929317.4000.000 17.0820.000 parser.py:320(parse_starttag) 2026.3630.032 36.2810.180 parser.py:171(goahead) 6732855.3020.0005.3020.000 {method 'match' of '_sre.SRE_Pattern' objects} 3694183.2720.0004.5540.000 _markupbase.py:48(updatepos) 832432.6980.0004.6390.000 parser.py:421(parse_endtag) 3088822.0060.0002.0060.000 {method 'group' of '_sre.SRE_Match' objects} 2700741.5210.0001.5210.000 {method 'search' of '_sre.SRE_Pattern' objects} 929311.1500.0002.6430.000 parser.py:378(check_for_whole_start_tag) 2910791.0280.0001.0280.000 {method 'count' of 'str' objects} 2958920.8830.0000.8830.000 {method 'startswith' of 'str' objects} 3874390.7330.0000.7330.000 {method 'lower' of 'str' objects} 4039220.6420.0000.6420.000 {method 'end' of '_sre.SRE_Match' objects} 1245120.4060.0001.1560.000 parser.py:504(unescape) 1867750.3260.0000.3260.000 {method 'start' of '_sre.SRE_Match' objects} 962130.2550.0000.2550.000 {method 'endswith' of 'str' objects} 595220.2530.0000.2530.000 {method 'rindex' of 'str' objects} 832260.2150.0000.2150.000 parser.py:164(clear_cdata_mode) 64280.1940.0000.3370.000 parser.py:507(replaceEntities) 1064870.1830.0000.1830.000 parser.py:484(handle_data) Excluding string and regex methods, the 3 slowest methods are parse_starttag, goahead, and updatepos. The attached patch adds a couple of simple optimizations to the first two -- I couldn't think a way to optimize updatepos. The resulting speedup is however fairly small, so I'm not sure it's worth applying the patch. I might try doing other benchmarks in future (should I add them somewhere in Tools?). -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue17183 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue17183] Small enhancements to Lib/_markupbase.py
Changes by Ezio Melotti ezio.melo...@gmail.com: -- keywords: +patch Added file: http://bugs.python.org/file29158/issue17183.diff ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue17183 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue17183] Small enhancements to Lib/_markupbase.py
Guido Reina added the comment: I am attaching a .tgz file with the tests I have performed. The .tgz file contains also a README.txt file with more detailed information. I have done the following test: The script loads the HTML file 'search.html' in 'rawdata' and searches '' in a loop from the position 'i', being i in: range(len(rawdata)). with the three variants: in + find (test1.py), find (test2.py), index (test3.py). Result: Script First run Second run Third run - test1.py2.332.322.33 test2.py0.750.740.76 test3.py0.750.740.74 I don't know if the test is representative and whether it helps. If you think that the test could be improved/changed, just let me know, I will be happy to help. -- Added file: http://bugs.python.org/file29084/test.tgz ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue17183 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue17183] Small enhancements to Lib/_markupbase.py
Ezio Melotti added the comment: We should add some benchmarks to see if there is any difference between the two forms. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue17183 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue17183] Small enhancements to Lib/_markupbase.py
Terry J. Reedy added the comment: 'Enhancement' issues are for visible behavior additions (or occasionally, changes). This is intended to be an invisible small speedup, hence it is a 'performance' issue, and gets a different title. As explained in #17170, the change will not be a speedup if the substring being looked for is usually not there. The reason is the .find lookup and function call versus the direct syntax. Even if it is faster, I strongly doubt it would be hardly noticeable in the context of this function, which itself is a small piece of parsing an entire document, and it is against our policy to make such micro-optimizations in working code. The complete block in question Lib/_markupbase.py, 254:7 is rawdata = self.rawdata if '' in rawdata[j:]: return rawdata.find(, j) + 1 return -1 [Ugh. Localizing rawdata negates some of whatever advantage is gained from the double scan.] If I were to rewrite it, I would replace it with try: return self.rawdata.index(, j) + 1 except ValueError: return -1 as better style, and a better example for readers, regardless of micro speed differences. But style-only changes in working code is also against our policy. So I would be closing this if Ezio had not grabbed it ;-). -- nosy: +terry.reedy ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue17183 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue17183] Small enhancements to Lib/_markupbase.py
Ezio Melotti added the comment: I would still do a benchmark, for these reasons: 1) IIRC rawdata might be the whole document (or at least everything that has not been parsed yet); 2) the '' is very likely to be found; This situation is fairly different from the one presented in #17170, where the strings are shorts and the character is not present in the majority of the strings. Profiling and improving html.parser (and hence _markupbase) was already on my todo list (even if admittedly not anywhere near the top :), so writing a benchmark for it might be useful for further enhancements too. (Note: HTMLParser is already fairly fast, parsing ~1.3MB/s according to http://www.crummy.com/2012/02/06/0, but I've never done anything to make it even faster, so there might still be room for improvements.) -- type: enhancement - performance ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue17183 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue17183] Small enhancements to Lib/_markupbase.py
New submission from Guido Reina: In the file: Lib/_markupbase.py, function: _parse_doctype_element there is: if '' in rawdata[j:]: return rawdata.find(, j) + 1 rawdata[j:] is being scanned twice. It would be better to do: pos = rawdata.find(, j) if pos != -1: return pos + 1 Same thing in the function: _parse_doctype_attlist: if ) in rawdata[j:]: j = rawdata.find(), j) + 1 else: return -1 It would be better to do: pos = rawdata.find(), j) if pos != -1: j = pos + 1 else: return -1 -- messages: 181903 nosy: guido priority: normal severity: normal status: open title: Small enhancements to Lib/_markupbase.py type: enhancement versions: Python 2.6, Python 2.7, Python 3.1, Python 3.2, Python 3.3, Python 3.4, Python 3.5 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue17183 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue17183] Small enhancements to Lib/_markupbase.py
Changes by Ezio Melotti ezio.melo...@gmail.com: -- assignee: - ezio.melotti components: +Library (Lib) nosy: +ezio.melotti stage: - needs patch versions: -Python 2.6, Python 2.7, Python 3.1, Python 3.2, Python 3.3, Python 3.5 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue17183 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue17183] Small enhancements to Lib/_markupbase.py
Serhiy Storchaka added the comment: if '' in rawdata[j:]: return rawdata.find(, j) + 1 See issue17170 for this idiom. -- nosy: +serhiy.storchaka ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue17183 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com