[issue17183] Small enhancements to Lib/_markupbase.py

2013-02-21 Thread Ezio Melotti

Ezio Melotti added the comment:

I did some macro-benchmarks and the proposed changes don't seem to affect the 
result (most likely because they are in _parse_doctype_element and 
_parse_doctype_attlist which should be called only once per document).

I did some profiling, and this is the result:
 4437196 function calls (4436748 primitive calls) in 36.582 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
929317.4000.000   17.0820.000 parser.py:320(parse_starttag)
  2026.3630.032   36.2810.180 parser.py:171(goahead)
   6732855.3020.0005.3020.000 {method 'match' of 
'_sre.SRE_Pattern' objects}
   3694183.2720.0004.5540.000 _markupbase.py:48(updatepos)
832432.6980.0004.6390.000 parser.py:421(parse_endtag)
   3088822.0060.0002.0060.000 {method 'group' of 
'_sre.SRE_Match' objects}
   2700741.5210.0001.5210.000 {method 'search' of 
'_sre.SRE_Pattern' objects}
929311.1500.0002.6430.000 
parser.py:378(check_for_whole_start_tag)
   2910791.0280.0001.0280.000 {method 'count' of 'str' objects}
   2958920.8830.0000.8830.000 {method 'startswith' of 'str' 
objects}
   3874390.7330.0000.7330.000 {method 'lower' of 'str' objects}
   4039220.6420.0000.6420.000 {method 'end' of '_sre.SRE_Match' 
objects}
   1245120.4060.0001.1560.000 parser.py:504(unescape)
   1867750.3260.0000.3260.000 {method 'start' of 
'_sre.SRE_Match' objects}
962130.2550.0000.2550.000 {method 'endswith' of 'str' 
objects}
595220.2530.0000.2530.000 {method 'rindex' of 'str' objects}
832260.2150.0000.2150.000 parser.py:164(clear_cdata_mode)
 64280.1940.0000.3370.000 parser.py:507(replaceEntities)
   1064870.1830.0000.1830.000 parser.py:484(handle_data)

Excluding string and regex methods, the 3 slowest methods are parse_starttag, 
goahead, and updatepos.
The attached patch adds a couple of simple optimizations to the first two -- I 
couldn't think a way to optimize updatepos.
The resulting speedup is however fairly small, so I'm not sure it's worth 
applying the patch.
I might try doing other benchmarks in future (should I add them somewhere in 
Tools?).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue17183
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17183] Small enhancements to Lib/_markupbase.py

2013-02-21 Thread Ezio Melotti

Changes by Ezio Melotti ezio.melo...@gmail.com:


--
keywords: +patch
Added file: http://bugs.python.org/file29158/issue17183.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue17183
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17183] Small enhancements to Lib/_markupbase.py

2013-02-16 Thread Guido Reina

Guido Reina added the comment:

I am attaching a .tgz file with the tests I have performed.

The .tgz file contains also a README.txt file with more detailed information.

I have done the following test:
The script loads the HTML file 'search.html' in 'rawdata' and searches '' in a 
loop from the position 'i', being i in: range(len(rawdata)).

with the three variants: in + find (test1.py), find (test2.py), index 
(test3.py).

Result:
Script  First run   Second run  Third run
-
test1.py2.332.322.33
test2.py0.750.740.76
test3.py0.750.740.74


I don't know if the test is representative and whether it helps.
If you think that the test could be improved/changed, just let me know, I will 
be happy to help.

--
Added file: http://bugs.python.org/file29084/test.tgz

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue17183
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17183] Small enhancements to Lib/_markupbase.py

2013-02-15 Thread Ezio Melotti

Ezio Melotti added the comment:

We should add some benchmarks to see if there is any difference between the two 
forms.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue17183
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17183] Small enhancements to Lib/_markupbase.py

2013-02-15 Thread Terry J. Reedy

Terry J. Reedy added the comment:

'Enhancement' issues are for visible behavior additions (or occasionally, 
changes). This is intended to be an invisible small speedup, hence it is a 
'performance' issue, and gets a different title.

As explained in #17170, the change will not be a speedup if the substring being 
looked for is usually not there. The reason is the .find lookup and function 
call versus the direct syntax. Even if it is faster, I strongly doubt it would 
be hardly noticeable in the context of this function, which itself is a small 
piece of parsing an entire document, and it is against our policy to make such 
micro-optimizations in working code.

The complete block in question Lib/_markupbase.py, 254:7 is

  rawdata = self.rawdata
  if '' in rawdata[j:]:
return rawdata.find(, j) + 1
  return -1

[Ugh. Localizing rawdata negates some of whatever advantage is gained from the 
double scan.]

If I were to rewrite it, I would replace it with

  try:
return self.rawdata.index(, j) + 1
  except ValueError:
return -1

as better style, and a better example for readers, regardless of micro speed 
differences. But style-only changes in working code is also against our policy. 
So I would be closing this if Ezio had not grabbed it ;-).

--
nosy: +terry.reedy

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue17183
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17183] Small enhancements to Lib/_markupbase.py

2013-02-15 Thread Ezio Melotti

Ezio Melotti added the comment:

I would still do a benchmark, for these reasons:
1) IIRC rawdata might be the whole document (or at least everything that has 
not been parsed yet);
2) the '' is very likely to be found;

This situation is fairly different from the one presented in #17170, where the 
strings are shorts and the character is not present in the majority of the 
strings.

Profiling and improving html.parser (and hence _markupbase) was already on my 
todo list (even if admittedly not anywhere near the top :), so writing a 
benchmark for it might be useful for further enhancements too.

(Note: HTMLParser is already fairly fast, parsing ~1.3MB/s according to 
http://www.crummy.com/2012/02/06/0, but I've never done anything to make it 
even faster, so there might still be room for improvements.)

--
type: enhancement - performance

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue17183
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17183] Small enhancements to Lib/_markupbase.py

2013-02-11 Thread Guido Reina

New submission from Guido Reina:

In the file: Lib/_markupbase.py, function: _parse_doctype_element there is:

if '' in rawdata[j:]:
return rawdata.find(, j) + 1

rawdata[j:] is being scanned twice.

It would be better to do:
pos = rawdata.find(, j)
if pos != -1:
return pos + 1


Same thing in the function: _parse_doctype_attlist:

if ) in rawdata[j:]:
j = rawdata.find(), j) + 1
else:
return -1

It would be better to do:
pos = rawdata.find(), j)
if pos != -1:
j = pos + 1
else:
return -1

--
messages: 181903
nosy: guido
priority: normal
severity: normal
status: open
title: Small enhancements to Lib/_markupbase.py
type: enhancement
versions: Python 2.6, Python 2.7, Python 3.1, Python 3.2, Python 3.3, Python 
3.4, Python 3.5

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue17183
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17183] Small enhancements to Lib/_markupbase.py

2013-02-11 Thread Ezio Melotti

Changes by Ezio Melotti ezio.melo...@gmail.com:


--
assignee:  - ezio.melotti
components: +Library (Lib)
nosy: +ezio.melotti
stage:  - needs patch
versions:  -Python 2.6, Python 2.7, Python 3.1, Python 3.2, Python 3.3, Python 
3.5

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue17183
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17183] Small enhancements to Lib/_markupbase.py

2013-02-11 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

 if '' in rawdata[j:]:
 return rawdata.find(, j) + 1

See issue17170 for this idiom.

--
nosy: +serhiy.storchaka

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue17183
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com