Re: Removing an attribute from html with Regex

Stefan Behnel Thu, 30 Dec 2010 01:00:21 -0800

Selvam, 30.12.2010 08:30:

I have some HTML string which I would like to feed to BeautifulSoup.


But, One malformed attribute breaks BeautifulSoup.

     <p style='terp_header' wrong_tag=' text1 ' text2 ' and 'para'  '
  class='terp_header'>  My String</p>

Didn't try with BS (and you forgot to say what "breaks" means exactly inyour case), but it parses in a somewhat reasonable way with lxml:


  Python 3.2b2 (py3k:87572, Dec 29 2010, 21:25:38)
  [GCC 4.4.3] on linux2
  Type "help", "copyright", "credits" or "license" for more information.
  >>> import lxml.html as H
  >>> doc = H.fromstring('''
  ... <p style='terp_header' wrong_tag=' text1 ' text2 ' and 'para'  '
  ...  class='terp_header'> My String</p>
  ... ''')
  >>> H.tostring(doc)
  b'<p style="terp_header" wrong_tag=" text1 " text2 and \
    class="terp_header"> My String</p>'
  >>> doc.attrib
  {'text2': '', 'and': '', 'style': 'terp_header', \
   'wrong_tag': ' text1 ', 'class': 'terp_header'}

I would like it to replace all the occurances of that attribute with an
empty string.

I am unable to figure out the exact regex, which can do this job.

This is what, I have managed so far,

m = re.compile("rml_except='([^']*)")


I assume "rml_accept" is the real name of the attribute?

You may be able to do this with a look-ahead expression, e.g.:

  replace = re.compile('(wrong_tag\s*=\s*[^>=]*)(?=>|\s+\w+\s*=)').sub

  html_data = replace('', html_data)

The trick is to match everything up to the next character that looksreasonable again, i.e. a closing tag character (">") or another attribute.


Stefan

--
http://mail.python.org/mailman/listinfo/python-list

Re: Removing an attribute from html with Regex

Reply via email to