On Fri, Jan 22, 2016 at 8:01 AM, inhahe <inh...@gmail.com> wrote: > Say I have the following HTML (I hope this shows up as plain text here > rather than formatting): > > <div style="font-size: 20pt;"><span style="color: #000000;"><em><strong>"Is > today the day?"</strong></em></span></div> > > And I want to extract the "Is today the day?" part. There are other places > in the document with <em> and <strong>, but this is the only place that > uses color #000000, so I want to extract anything that's within a color > #000000 style, even if it's nested multiple levels deep within that. > > - Sometimes the color is defined as RGB(0, 0, 0) and sometimes it's defined > as #000000 > - Sometimes the <strong> is within the <em> and sometimes the <em> is > within the <strong>. > - There may be other discrepancies I haven't noticed yet > > How can I do this in BeautifulSoup (or is this better done in lxml.html)?
I hope this helps you get started: This may help you get started: from bs4 import BeautifulSoup from itertools import chain soup = BeautifulSoup('''\ <div style="font-size: 20pt;"><span style="color: #000000;"><em><strong>"Is today the day?"</strong></em></span></div> <div style="font-size: 20pt;"><span style="color: RGB(0, 0, 0);"><strong><em>"Is tomorrow the day?"</em></strong></span></div>''') # We're going to get all the tags that specify the color, either using hex or RGB. # If you only want to get the span tags, just give the positional argument 'span' to # find_all: # for tag in chain(soup.find_all('span', style='color: #000000;'), # soup.find_all('span', style='color: RGB(0, 0, 0);')): for tag in chain(soup.find_all(style='color: #000000;'), soup.find_all(style='color: RGB(0, 0, 0);')): try: print(tag.em.strong.text) except AttributeError: try: print(tag.strong.em.text) except AttributeError: print('ooooooh nooooo no text') Cody -- https://mail.python.org/mailman/listinfo/python-list