[issue1486713] HTMLParser : A auto-tolerant parsing mode

2011-11-16 Thread kxroberto

kxroberto kxrobe...@users.sourceforge.net added the comment:

Well in many browsers for example there is a internal warning and error log 
(window). Which yet does not (need to) claim to be a official W3C checker. It 
has positive effect on web stabilization. 
For example just looking now I see the many HTML and CSS warnings and errors 
about the sourceforge site and this bug tracker in the Browsers log - not 
believing that the log covers the bugs 100% ;-)

The events of warnings are easily available here, and calling self.warning, as 
it was, costs quite nothing. I don't see a problem for non-users of this 
feature. And most code using HTMLParser also emits warnings on the next higher 
syntax level, so to not have a black box...
 
As I used a tolerant version of HTMLParser for about a decade, I can say the 
warnings are of the same value in many apps and use case, as to be able to have 
look into a Browsers syntax log. 
The style of stretching a argument to black-white is not reasonable here in 
the world of human edited HTML ;-)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1486713
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1486713] HTMLParser : A auto-tolerant parsing mode

2011-11-16 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

The strict/tolerant mode mainly works by using either a strict or a tolerant 
regex.  If the markup is invalid, the strict regex doesn't match and it gives 
an error.  The tolerant regex will match both valid and invalid markup at the 
same time, without distinctions, and that's why there's no way to emit a 
warning for these cases.  I think there are a couple of places where a warning 
could be emitted, but that would just cover a small percentage of the errors.  
Even if we find a way to emit a warning for everything allowed by the tolerant 
mode that fails on strict, it won't still cover all the possible errors, that's 
why I think tools like validators and conformance checkers (or even the 
warning/error logs) should be used instead.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1486713
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1486713] HTMLParser : A auto-tolerant parsing mode

2011-11-16 Thread kxroberto

kxroberto kxrobe...@users.sourceforge.net added the comment:

The old patch warned already the majority of real cases  - except the missing 
white space between attributes.

The tolerant regex will match both: 
locatestarttagend_tolerant: The main and frequent issue on the web here is the 
missing white space between attributes (with enclosed values). And there is the 
new tolerant comma between attributes, which however I have not seen so far 
anywhere (the old warning machanism and attrfind.match would have already 
raised it at junk chars ... event.
Both issues can be easily warned (also/already) at quite no cost by the 
slightly extended regex below (when the 2 new non-pseudo regex groups are check 
against None in check_for_whole_start_tag). 
Or missing whitespace could be warned (multiple times) at attrfind time.

attrfind_tolerant : I see no point in the old/strict attrfind. (and the 
difference is guessed 0.000% of real cases). attrfind_tolerant  could become 
the only attrfind.


--

locatestarttagend_tolerant = re.compile(r
  [a-zA-Z][-.a-zA-Z0-9:_]*  # tag name
  (?:(?:\s+|(\s*))   # optional whitespace before attribute name
(?:[a-zA-Z_][-.:a-zA-Z0-9_]* # attribute name
  (?:\s*=\s* # value indicator
(?:'[^']*'   # LITA-enclosed value
  |\[^\]*\# LIT-enclosed value
  |[^'\\s]+# bare value
 )
 (?:\s*(,))*   # possibly followed by a comma
   )?
 )
   )*
  \s*# trailing whitespace
, re.VERBOSE)
attrfind_tolerant = re.compile(
r'\s*([a-zA-Z_][-.:a-zA-Z_0-9]*)(\s*=\s*'
r'(\'[^\']*\'|[^]*|[^\s]*))?')


#s='abc a=b,+c=de=ftext'
#s='abc a=b,+ c=de=ftext'
s='abc a=b,+,c=d e=ftext'

m = locatestarttagend_tolerant.search(s)
print m.group()
print m.groups()
#if m.group(1) is not None: self.warning('space missing ...
#if m.group(2) is not None: self.warning('comma between attr...

m = attrfind_tolerant.search(s, 5)
print m.group()
print m.groups()

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1486713
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1486713] HTMLParser : A auto-tolerant parsing mode

2011-11-16 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

Note that the regex and the way the parser considers the commas changed in 
16ed15ff0d7c (it now considers them as the name of a value-less attribute), so 
adding a group for the comma is no longer doable.

In theory, the approach you suggest might work, but if we want some warning 
mechanism it should be generic enough to work with all kind of invalid markup.  
In addition this adds complexity to already complex regular expressions, so 
there should be a valid use case for this.
Also keep in mind that HTMLParser won't do any check about the validity of the 
elements' names or attributes' names/values, or even if they are nested/closed 
correctly, so even with a comprehensive set of warnings, you won't still be 
able to use HTMLParser to validate your pages.

--
nosy: +eric.araujo

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1486713
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1486713] HTMLParser : A auto-tolerant parsing mode

2011-11-16 Thread kxroberto

kxroberto kxrobe...@users.sourceforge.net added the comment:

16ed15ff0d7c was not in current stable py3.2 so I missed it..

When the comma is now raised as attribute name, then the problem is anyway 
moved to the higher level anyway - and is/can be handled easily there by usual 
methods.
(still I guess locatestarttagend_tolerant matches a free standing comma extra 
after an attribute)

should be generic enough to work with all kind of invalid markup: I think we 
would be rather complete then (-missing space issue)- at least regarding %age 
of real cases. And it could be improved with few touches over time if something 
missing. 100% is not the point unless it shall drive the official W3C checker. 
The call of self.warning, as in old patch, doesn't cost otherwise and I see no 
real increase of complexity/cpu-time.

HTMLParser won't do any check about the validity of the elements' names or 
attributes' names/values: yes thats of course up to the next level handler 
(BTDT)- thus the possibilty of error handling is not killed. Its about what 
HTMLParser _hides_ irrecoverably.

there should be a valid use case for this: Almost any app which parses HTML 
(self authored or remote) can have (should have?) a no-fuzz/collateral warn log 
option. (-no need to make a expensive W3C checker session). I mostly have this 
in use as said, as it was anyway there.

Well, as for me, I use anyway a private backport to Python2 of this. I try to 
avoid Python3 as far as possible. (No real plus, too much problems) So for me 
its just about joining Python4 in the future perhaps - which can do true 
CPython multithreading, stackless, psyco/static typing ... and print statement 
again without typing so many extra braces ;-)
I considered extra libs like the HTML tidy binding, but this is all too much 
fuzz for most cases. And HTMLParser has already quite everything, with the few 
calls inserted ..

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1486713
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1486713] HTMLParser : A auto-tolerant parsing mode

2011-11-16 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

 16ed15ff0d7c was not in current stable py3.2 so I missed it..

It's also in 3.2 and 2.7 (but it's quite recent, so if you didn't pull recently 
you might have missed it).

 When the comma is now raised as attribute name, then the problem is 
 anyway moved to the higher level anyway - and is/can be handled easily 
 there by usual methods.

The next level could/should validate the name of the attribute and determine 
that ',' is not a valid attribute name, so in this case there's no warning to 
raise here (actually you could detect that it's not a-zA-Z (or whatever the 
specs say) and raise a more general warning even at this level, but no 
information is lost here about this).

 100% is not the point unless it shall drive the official W3C checker.

I'm still not sure that having 70-80% is useful (unless we can achieve 100% on 
this level and leave the rest to an upper layer).  If you think this is doable 
you could try to first identify what errors should be detected by this layer, 
see if they are all detectable and then propose a patch.

 The call of self.warning, as in old patch, doesn't cost otherwise and
 I see no real increase of complexity/cpu-time.

The extra complexity is mainly in the already complex regular expressions, and 
also in the list of 'if' that will have to check the content of the groups to 
report the warnings.  These changes are indeed not too invasive, but they still 
make the code more complicated.

 Almost any app which parses HTML (self authored or remote) can have 
 (should have?) a no-fuzz/collateral warn log option. (-no need to 
 make a expensive W3C checker session).

I think the original goal of HTMLParser was parsing mostly-valid HTML.  People 
started reporting issues with less-valid HTML, and these issues got fixed to 
make it able to parse non-valid HTML.  AFAIK it never followed strictly any 
HTML standard, and it just provided a best-effort way to get data out of an 
HTML page.  So, I would consider doing validation or even being a building 
block for a conforming parser out of the scope of the module.

 I mostly have this in use as said, as it was anyway there.

If 'this' refers to some kind of warning system, what do you do with these 
warnings?   Do you fix them, avoid using the w3c validator (or any other 
conforming validator) and consider a mostly-valid page good enough?  Or do you 
fix them, and then you also check with the w3c validator?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1486713
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1486713] HTMLParser : A auto-tolerant parsing mode

2011-11-15 Thread kxroberto

kxroberto kxrobe...@users.sourceforge.net added the comment:

I looked at the new patch http://hg.python.org/lookup/r86952 for Py3 (regarding 
the extended tolerance and local backporting to Python2.7):

What I miss are the calls of a kind of self.warning(msg,i,k) function in 
non-strict/tolerant mode (where self.error is called in strict mode). Such 
function could be empty or could be a silent simple counter (like in the old 
patch) - and could be easily sub-classed for advanced use.
I often want at least the possibilty of a HTML error log - so the HTML author 
(sometimes its me myself) can be noticed to get it more strict on the long run 
;-) ...

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1486713
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1486713] HTMLParser : A auto-tolerant parsing mode

2011-11-15 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

The HTMLParser is not suitable for validation, even the strict mode allows some 
non valid markup (and it might be removed soon).
Also I don't think it's easy to call a self.warnings() without trying the 
strict mode first.  The tolerant parsing just allow more things, without making 
any distinction between valid and not.

--
nosy: +ezio.melotti

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1486713
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1486713] HTMLParser : A auto-tolerant parsing mode

2010-12-03 Thread R. David Murray

R. David Murray rdmur...@bitdance.com added the comment:

A note for the curious: I changed the keyword name from 'tolerant' to 'strict' 
because the stdlib has other examples of 'strict' as a keyword, but the word 
'tolerant' appears nowhere in the documentation and certainly not as a keyword. 
 So it seemed better to remain consistent with existing practice.  This would 
be even better if the default value of 'strict' was False, but unfortunately we 
can't do that.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1486713
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1486713] HTMLParser : A auto-tolerant parsing mode

2010-12-02 Thread R. David Murray

R. David Murray rdmur...@bitdance.com added the comment:

I have committed a version of this patch, without the warnings, using the 
keyword 'strict=True' as the default, and with a couple added heuristics from 
other similar issues, in r86952.

kxroberto, if you want to supply your full name, I'll add you to Misc/ACKS.  
Thanks for the original patch.

--
nosy:  -BreamoreBoy
resolution:  - accepted
stage: unit test needed - committed/rejected
status: open - closed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1486713
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1486713] HTMLParser : A auto-tolerant parsing mode

2010-11-20 Thread Neil Muller

Neil Muller drnlmuller+b...@gmail.com added the comment:

#975556 and #1046092 look like they should also be superseded by this.

--
nosy: +Neil Muller

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1486713
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1486713] HTMLParser : A auto-tolerant parsing mode

2010-09-04 Thread R. David Murray

R. David Murray rdmur...@bitdance.com added the comment:

See also issue 1058305, which may be a duplicate.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1486713
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1486713] HTMLParser : A auto-tolerant parsing mode

2010-08-27 Thread R. David Murray

R. David Murray rdmur...@bitdance.com added the comment:

For anyone who does want to work on this (and I do, but it will be quite a 
while before I can) see also issue 6191.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1486713
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1486713] HTMLParser : A auto-tolerant parsing mode

2010-08-26 Thread kxroberto

kxroberto kxrobe...@users.sourceforge.net added the comment:

I'm not working with Py3. don't how much that module is different in 3.
unless its going into a py2 version, I'll leave the FR so far to the py3 
community

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1486713
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1486713] HTMLParser : A auto-tolerant parsing mode

2010-08-24 Thread kxroberto

kxroberto kxrobe...@users.sourceforge.net added the comment:

for me a parser which cannot be feed with HTML from outside (which I cannot 
edit myself) has not much use at all.
attached my current patch (vs. py26) - many changes meanwhile.
and a test case.

I've put the default to strict mode, but ...

--
Added file: http://bugs.python.org/file18623/HTMLParser_tolerant_py26.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1486713
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1486713] HTMLParser : A auto-tolerant parsing mode

2010-08-24 Thread kxroberto

Changes by kxroberto kxrobe...@users.sourceforge.net:


--
versions: +Python 2.6, Python 2.7
Added file: http://bugs.python.org/file18624/test_htmlparser_tolerant.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1486713
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1486713] HTMLParser : A auto-tolerant parsing mode

2010-08-24 Thread R. David Murray

R. David Murray rdmur...@bitdance.com added the comment:

2.6 is now in security-fix-only mode.  Since this is a new feature, it can only 
go into 3.2.

Can you provide a patch against py3k trunk?

I've only glanced at the patch briefly, but one thing that concerns me is 
'warning file'.  I suppose that either the logging module or perhaps the 
warnings module should be used instead.  We should look at how other stdlib 
modules handle this kind of thing.  Or perhaps warnings shouldn't be generated 
at all, since the default will be strict and therefore the programmer has 
consciously selected tolerant mode.

One stdlib model we could follow is the model of the email module: have a 
'defects' attribute that collects the errors.  email6, by the way, is going to 
have both 'tolerant' and 'strict' modes, and in that case the default is 
tolerant (and always has been) in respect for Postel's law, which is enshrined 
in the email RFCs.  If the HTTP standards have a similar recommendation to 
accept dirty input when possible, we could make an argument for changing 
HTMLParser's default to tolerant.

--
versions:  -Python 2.6, Python 2.7

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1486713
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1486713] HTMLParser : A auto-tolerant parsing mode

2010-08-24 Thread Terry J. Reedy

Terry J. Reedy tjre...@udel.edu added the comment:

I agree that a tolerant mode would be good (and often requested). String 
encoding and decoding also have strict and forgiving modes, so this seems close 
to a policy.

Unit tests with example snippets that properly fail strict mode and pass the 
new tolerant mode are needed both to completely review and apply a patch.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1486713
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1486713] HTMLParser : A auto-tolerant parsing mode

2010-08-22 Thread Mark Lawrence

Mark Lawrence breamore...@yahoo.co.uk added the comment:

I think this should be closed as have other similar requests in the last few 
days.

--
nosy: +BreamoreBoy, fdrake

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1486713
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1486713] HTMLParser : A auto-tolerant parsing mode

2010-08-22 Thread R. David Murray

R. David Murray rdmur...@bitdance.com added the comment:

I disagree (and might disagree with those other closings but I haven't noticed 
them I guess).  BeautifulSoup does *not* cover this ground, it is broken in 3.x 
because of the lack of a tolerant HTML parser in the stdlib (it used to use 
sgmlib, which is now gone).  BeautifulSoup would probably very much like to 
have this tolerant mode.

It probably shouldn't be the default, though, for backward compatibility 
reasons :(

--
nosy: +orsenthil, r.david.murray

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1486713
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1486713] HTMLParser : A auto-tolerant parsing mode

2010-08-08 Thread Terry J. Reedy

Terry J. Reedy tjre...@udel.edu added the comment:

This needs to be checked for applicability to 3.x.
Do beautifulsoup and other programs cover this ground (tolerant parsing of junk 
html)?

--
nosy: +terry.reedy
versions: +Python 3.2 -Python 2.7, Python 3.1

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1486713
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1486713] HTMLParser : A auto-tolerant parsing mode

2009-03-20 Thread Daniel Diniz

Changes by Daniel Diniz aja...@gmail.com:


--
stage:  - test needed
type:  - feature request
versions: +Python 2.7, Python 3.1 -Python 2.4

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1486713
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com