Re: html parser , unexpected '' char in declaration
Jesus Rivero - (Neurogeek) [EMAIL PROTECTED] wrote: hmmm, that's kind of different issue then. I can guess, from the error you pasted earlier, that the problem shown is due to the fact Python is interpreting a as an expression and not as a char. review your code or try to figure out the exact input you're receving within the mta. Well, Jesus, you are 0 for 2. Sakcee pointed out what the exact problem was in his original message. The HTML he is being given is ill-formed; the !DOCTYPE directive is not closed. The SGML parser finds a html tag which it thinks is inside the !DOCTYPE, and that's illegal. well probabbly I should explain more. this is part of an email . after the mta delivers the email, it is stored in a local dir. After that the email is being parsed by the parser inside an web based imap client at display time. I dont think I have the choice of rewriting the message!? and I dont want to reject the message alltogether. I can either 1-fix the incoming html by tidying it up or 2- strip only plain text out and dispaly that you have spam, 3 - or ignore that mal-formatted tag and display the rest If this is happening with more than one message, you could check for it rather easily with a regular expression, or even just ''.find, and then either insert a closing '' or delete everything up to the html before parsing it. -- - Tim Roberts, [EMAIL PROTECTED] Providenza Boekelheide, Inc. -- http://mail.python.org/mailman/listinfo/python-list
Re: html parser , unexpected '' char in declaration
thanks for the suggestions, this is not happening frequently, actually this is the first time I have seen this exception in the system, which means that some spam message was generated with ill-formated html. i guess the best way would be to check using regular expression and delete the unclosed tags. -- http://mail.python.org/mailman/listinfo/python-list
Re: html parser , unexpected '' char in declaration
Oopss! You are totally right guys, i did miss the closing '' thinking about maybe errors in the use of ' or . Jesus Tim Roberts wrote: Jesus Rivero - (Neurogeek) [EMAIL PROTECTED] wrote: hmmm, that's kind of different issue then. I can guess, from the error you pasted earlier, that the problem shown is due to the fact Python is interpreting a as an expression and not as a char. review your code or try to figure out the exact input you're receving within the mta. Well, Jesus, you are 0 for 2. Sakcee pointed out what the exact problem was in his original message. The HTML he is being given is ill-formed; the !DOCTYPE directive is not closed. The SGML parser finds a html tag which it thinks is inside the !DOCTYPE, and that's illegal. well probabbly I should explain more. this is part of an email . after the mta delivers the email, it is stored in a local dir. After that the email is being parsed by the parser inside an web based imap client at display time. I dont think I have the choice of rewriting the message!? and I dont want to reject the message alltogether. I can either 1-fix the incoming html by tidying it up or 2- strip only plain text out and dispaly that you have spam, 3 - or ignore that mal-formatted tag and display the rest If this is happening with more than one message, you could check for it rather easily with a regular expression, or even just ''.find, and then either insert a closing '' or delete everything up to the html before parsing it. -- http://mail.python.org/mailman/listinfo/python-list
Re: html parser , unexpected '' char in declaration
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Sakcee wrote: html = 'html!DOCTYPE html PUBLIC -//W3C//DTD XHTML 1.0 Transitional//EN head/head body bgcolor=#ff\r\n Foo foo , blah blah /body/html' html = !DOCTYPE html PUBLIC -//W3C//DTD XHTML 1.0 Transitional//EN html head /head body bgcolor=#ff Foo foo , blah blah /body /html Try checking your html code. It looks really messy. ' char is not for multiple line strings. You can try the code above. As a suggestion, you should really focus on learning html basics ;) Regards Jesus (Neurogeek) import htmllib import formatter parser=htmllib.HTMLParser(formatter.NullFormatter()) parser.feed(html) Traceback (most recent call last): File stdin, line 1, in ? File /usr/lib/python2.4/sgmllib.py, line 95, in feed self.goahead(0) File /usr/lib/python2.4/sgmllib.py, line 165, in goahead k = self.parse_declaration(i) File /usr/lib/python2.4/markupbase.py, line 132, in parse_declaration self.error( File /usr/lib/python2.4/htmllib.py, line 40, in error raise HTMLParseError(message) htmllib.HTMLParseError: unexpected '' char in declaration the error is generated by unclosed DOCTYPE declaration what is the best way to handle this kind of document. should I use regex to check and strip, or does HTMLParser offers something? , can i override default sgmllib behaviour I have to work with this htmllib because of existing modules . thanks -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.1 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFD+mZzdIssYB9vBoMRAoWXAJ9KuAnLLXhZVv4t6fDBpu3RW6oxFgCeM/1S iNScofTDdJxLfOkaAR9Ejws= =+LTo -END PGP SIGNATURE- -- http://mail.python.org/mailman/listinfo/python-list
Re: html parser , unexpected '' char in declaration
thanks for the reply well probabbly I should explain more. this is part of an email . after the mta delivers the email, it is stored in a local dir. After that the email is being parsed by the parser inside an web based imap client at display time. I dont think I have the choice of rewriting the message!? and I dont want to reject the message alltogether. I can either 1-fix the incoming html by tidying it up or 2- strip only plain text out and dispaly that you have spam, 3 - or ignore that mal-formatted tag and display the rest -- http://mail.python.org/mailman/listinfo/python-list
Re: html parser , unexpected '' char in declaration
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 hmmm, that's kind of different issue then. I can guess, from the error you pasted earlier, that the problem shown is due to the fact Python is interpreting a as an expression and not as a char. review your code or try to figure out the exact input you're receving within the mta. Regards, Jesus (Neurogeek) Sakcee wrote: thanks for the reply well probabbly I should explain more. this is part of an email . after the mta delivers the email, it is stored in a local dir. After that the email is being parsed by the parser inside an web based imap client at display time. I dont think I have the choice of rewriting the message!? and I dont want to reject the message alltogether. I can either 1-fix the incoming html by tidying it up or 2- strip only plain text out and dispaly that you have spam, 3 - or ignore that mal-formatted tag and display the rest -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.1 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFD+n5gdIssYB9vBoMRAvIHAJ9H+IQWtaEMa9FBYFvDAQXcIO2SRwCfX3yj BEvNJ6yWht1b+dBc6ohkwYI= =X1JL -END PGP SIGNATURE- -- http://mail.python.org/mailman/listinfo/python-list