Re: html parser , unexpected '' char in declaration

2006-02-21 Thread Tim Roberts
Jesus Rivero - (Neurogeek) [EMAIL PROTECTED] wrote:

hmmm, that's kind of different issue then.

I can guess, from the error you pasted earlier, that the problem shown
is due to the fact Python is interpreting a  as an expression and not
as a char. review your code or try to figure out the exact input you're
receving within the mta.

Well, Jesus, you are 0 for 2.  Sakcee pointed out what the exact problem
was in his original message.  The HTML he is being given is ill-formed; the
!DOCTYPE directive is not closed.  The SGML parser finds a html tag
which it thinks is inside the !DOCTYPE, and that's illegal.

 well probabbly I should explain more.  this is part of an email . after
 the mta delivers the email, it is stored in a local dir.
 After that the email is being parsed by the parser inside an web based
 imap client at display time.
 
 I dont think I have the choice of rewriting the message!? and I dont
 want to reject the message alltogether.
 
 I can either 1-fix the incoming html by tidying it up
 or 2- strip only plain text out and dispaly that you have spam, 3 - or
 ignore that mal-formatted tag and display the rest

If this is happening with more than one message, you could check for it
rather easily with a regular expression, or even just ''.find, and then
either insert a closing '' or delete everything up to the html before
parsing it.
-- 
- Tim Roberts, [EMAIL PROTECTED]
  Providenza  Boekelheide, Inc.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: html parser , unexpected '' char in declaration

2006-02-21 Thread Sakcee
thanks for the suggestions,

this is not happening frequently, actually this is the first time I
have seen this exception in the system, which means that some spam
message was generated with ill-formated html.
i guess the best way would be to check using regular expression and
delete the unclosed tags.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: html parser , unexpected '' char in declaration

2006-02-21 Thread Jesus Rivero (Neurogeek)
Oopss!

You are totally right guys, i did miss the closing '' thinking about
maybe errors in the use of ' or .

Jesus

Tim Roberts wrote:

Jesus Rivero - (Neurogeek) [EMAIL PROTECTED] wrote:
  

hmmm, that's kind of different issue then.

I can guess, from the error you pasted earlier, that the problem shown
is due to the fact Python is interpreting a  as an expression and not
as a char. review your code or try to figure out the exact input you're
receving within the mta.



Well, Jesus, you are 0 for 2.  Sakcee pointed out what the exact problem
was in his original message.  The HTML he is being given is ill-formed; the
!DOCTYPE directive is not closed.  The SGML parser finds a html tag
which it thinks is inside the !DOCTYPE, and that's illegal.

  

well probabbly I should explain more.  this is part of an email . after
the mta delivers the email, it is stored in a local dir.
After that the email is being parsed by the parser inside an web based
imap client at display time.

I dont think I have the choice of rewriting the message!? and I dont
want to reject the message alltogether.

I can either 1-fix the incoming html by tidying it up
or 2- strip only plain text out and dispaly that you have spam, 3 - or
ignore that mal-formatted tag and display the rest
  


If this is happening with more than one message, you could check for it
rather easily with a regular expression, or even just ''.find, and then
either insert a closing '' or delete everything up to the html before
parsing it.
  


-- 
http://mail.python.org/mailman/listinfo/python-list


html parser , unexpected '' char in declaration

2006-02-20 Thread Sakcee
html =
'html!DOCTYPE html PUBLIC -//W3C//DTD XHTML 1.0 Transitional//EN
head/head body bgcolor=#ff\r\n Foo foo , blah blah
/body/html'

 import htmllib
 import formatter
 parser=htmllib.HTMLParser(formatter.NullFormatter())
 parser.feed(html)

Traceback (most recent call last):
  File stdin, line 1, in ?
  File /usr/lib/python2.4/sgmllib.py, line 95, in feed
self.goahead(0)
  File /usr/lib/python2.4/sgmllib.py, line 165, in goahead
k = self.parse_declaration(i)
File /usr/lib/python2.4/markupbase.py, line 132, in parse_declaration
self.error(
  File /usr/lib/python2.4/htmllib.py, line 40, in error
raise HTMLParseError(message)
htmllib.HTMLParseError: unexpected '' char in declaration


the error is generated by unclosed DOCTYPE declaration

what is the best way to handle this kind of document. should I use
regex to check and strip, or does HTMLParser offers something? , can i
override default sgmllib behaviour
I have to work with this htmllib because of existing modules .


thanks

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: html parser , unexpected '' char in declaration

2006-02-20 Thread Jesus Rivero - (Neurogeek)
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Sakcee wrote:
 html =
 'html!DOCTYPE html PUBLIC -//W3C//DTD XHTML 1.0 Transitional//EN
 head/head body bgcolor=#ff\r\n Foo foo , blah blah
 /body/html'
 
 

html =

!DOCTYPE html PUBLIC -//W3C//DTD XHTML 1.0 Transitional//EN
html
 head
 /head
 body bgcolor=#ff
Foo foo , blah blah
 /body
/html


Try checking your html code. It looks really messy. ' char is not for
multiple line strings. You can try the code above.

As a suggestion, you should really focus on learning html basics ;)

Regards

Jesus (Neurogeek)

import htmllib
import formatter
parser=htmllib.HTMLParser(formatter.NullFormatter())
parser.feed(html)
 
 
 Traceback (most recent call last):
   File stdin, line 1, in ?
   File /usr/lib/python2.4/sgmllib.py, line 95, in feed
 self.goahead(0)
   File /usr/lib/python2.4/sgmllib.py, line 165, in goahead
 k = self.parse_declaration(i)
 File /usr/lib/python2.4/markupbase.py, line 132, in parse_declaration
 self.error(
   File /usr/lib/python2.4/htmllib.py, line 40, in error
 raise HTMLParseError(message)
 htmllib.HTMLParseError: unexpected '' char in declaration
 
 
 the error is generated by unclosed DOCTYPE declaration
 
 what is the best way to handle this kind of document. should I use
 regex to check and strip, or does HTMLParser offers something? , can i
 override default sgmllib behaviour
 I have to work with this htmllib because of existing modules .
 
 
 thanks
 

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFD+mZzdIssYB9vBoMRAoWXAJ9KuAnLLXhZVv4t6fDBpu3RW6oxFgCeM/1S
iNScofTDdJxLfOkaAR9Ejws=
=+LTo
-END PGP SIGNATURE-
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: html parser , unexpected '' char in declaration

2006-02-20 Thread Sakcee
thanks for the reply

well probabbly I should explain more.  this is part of an email . after
the mta delivers the email, it is stored in a local dir.
After that the email is being parsed by the parser inside an web based
imap client at display time.

I dont think I have the choice of rewriting the message!? and I dont
want to reject the message alltogether.

I can either 1-fix the incoming html by tidying it up
or 2- strip only plain text out and dispaly that you have spam, 3 - or
ignore that mal-formatted tag and display the rest

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: html parser , unexpected '' char in declaration

2006-02-20 Thread Jesus Rivero - (Neurogeek)
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

hmmm, that's kind of different issue then.

I can guess, from the error you pasted earlier, that the problem shown
is due to the fact Python is interpreting a  as an expression and not
as a char. review your code or try to figure out the exact input you're
receving within the mta.


Regards,

 Jesus (Neurogeek)

Sakcee wrote:

 thanks for the reply
 
 well probabbly I should explain more.  this is part of an email . after
 the mta delivers the email, it is stored in a local dir.
 After that the email is being parsed by the parser inside an web based
 imap client at display time.
 
 I dont think I have the choice of rewriting the message!? and I dont
 want to reject the message alltogether.
 
 I can either 1-fix the incoming html by tidying it up
 or 2- strip only plain text out and dispaly that you have spam, 3 - or
 ignore that mal-formatted tag and display the rest
 

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFD+n5gdIssYB9vBoMRAvIHAJ9H+IQWtaEMa9FBYFvDAQXcIO2SRwCfX3yj
BEvNJ6yWht1b+dBc6ohkwYI=
=X1JL
-END PGP SIGNATURE-
-- 
http://mail.python.org/mailman/listinfo/python-list