Re: Html Parsing stuff
dont worry it has been solved -- https://mail.python.org/mailman/listinfo/python-list
Html Parsing stuff
Ok i get the basics of this and i have been doing some successful parsings and using regular expressions to find html tags. I have tried to find an img tag and write that image to a file. I have had no success. It says it has successfully wrote the image to the file with a try... except statement but when i try to open this it says that the image has like no been saved correctly or is damaged. This was just reading the src attribute of the tag and trying to save that link to a .jpg(the extension of the image). Ok so i looked deeper and added a forward slash to the url and then added the image src attribute to it. I then opened that link with the urllib.urlopen() and then read the contents and saved it to the file again. I still got the same result as before. Is there a function in beautiful soup or the urllib module that i can use to save and image. This is just a problem i am sorting out not a whole application so the code is small. Thanks -- https://mail.python.org/mailman/listinfo/python-list
Re: Beautifulsoup html parsing - nested tags
On Wed, Jan 5, 2011 at 2:58 PM, Selvam wrote: > Hi all, > > I am trying to parse some html string with BeatifulSoup. > > The string is, > > > > > > > > Tax > > Base > > Amount > > > > > > > > rtables=soup.findAll(re.compile('table$')) > > The rtables is, > > [ > > > > > > Tax > > Base > > Amount > > , > ] > > > > The tr inside the blocktable are appearing inside the table, while > blocktable contains nothing. > > Is there any way, I can get the tr in the right place (inside blocktable) ? > > -- > Regards, > S.Selvam > SG E-ndicus Infotech Pvt Ltd. > http://e-ndicus.com/ > > " I am because we are " > Replying to myself, BeautifulSoup.BeautifulSoup.NESTABLE_TABLE_TAGS['tr'].append('blocktable') adding this, solved the issue. -- Regards, S.Selvam SG E-ndicus Infotech Pvt Ltd. http://e-ndicus.com/ " I am because we are " -- http://mail.python.org/mailman/listinfo/python-list
Beautifulsoup html parsing - nested tags
Hi all, I am trying to parse some html string with BeatifulSoup. The string is, Tax Base Amount rtables=soup.findAll(re.compile('table$')) The rtables is, [ Tax Base Amount , ] The tr inside the blocktable are appearing inside the table, while blocktable contains nothing. Is there any way, I can get the tr in the right place (inside blocktable) ? -- Regards, S.Selvam SG E-ndicus Infotech Pvt Ltd. http://e-ndicus.com/ " I am because we are " -- http://mail.python.org/mailman/listinfo/python-list
Re: HTML Parsing
[EMAIL PROTECTED] wrote: Hi everyone I am trying to build my own web crawler for an experiement and I don't know how to access HTTP protocol with python. Also, Are there any Opensource Parsing engine for HTML documents available in Python too? That would be great. Check on Mechanize. It wraps Beautiful Soup inside of methods that aid in website crawling. http://pypi.python.org/pypi/mechanize/0.1.7b -Larry -- http://mail.python.org/mailman/listinfo/python-list
Re: HTML Parsing
Stefan Behnel <[EMAIL PROTECTED]>: > [EMAIL PROTECTED] wrote: >> I am trying to build my own web crawler for an experiement and I don't >> know how to access HTTP protocol with python. >> >> Also, Are there any Opensource Parsing engine for HTML documents >> available in Python too? That would be great. > > Try lxml.html. It parses broken HTML, supports HTTP, is much faster than > BeautifulSoup and threadable, all of which should be helpful for your > crawler. You should mention its powerful features like XPATH and CSS selection support and its easy API here, too ;) -- Freedom is always the freedom of dissenters. (Rosa Luxemburg) -- http://mail.python.org/mailman/listinfo/python-list
Re: HTML Parsing
[EMAIL PROTECTED] wrote: > I am trying to build my own web crawler for an experiement and I don't > know how to access HTTP protocol with python. > > Also, Are there any Opensource Parsing engine for HTML documents > available in Python too? That would be great. Try lxml.html. It parses broken HTML, supports HTTP, is much faster than BeautifulSoup and threadable, all of which should be helpful for your crawler. http://codespeak.net/lxml/ Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: HTML Parsing
> Hi everyone Hello > I am trying to build my own web crawler for an experiement and I don't > know how to access HTTP protocol with python. urllib2: http://docs.python.org/lib/module-urllib2.html > Also, Are there any Opensource Parsing engine for HTML documents > available in Python too? That would be great. BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/ http://www.crummy.com/software/BeautifulSoup/documentation.html All the best -- NOAGBODJI Paul Victor -- http://mail.python.org/mailman/listinfo/python-list
Re: HTML Parsing
On Jun 28, 9:03 pm, [EMAIL PROTECTED] wrote: > Hi everyone > I am trying to build my own web crawler for an experiement and I don't > know how to access HTTP protocol with python. Look at the httplib module. > > Also, Are there any Opensource Parsing engine for HTML documents > available in Python too? That would be great. -- http://mail.python.org/mailman/listinfo/python-list
Re: HTML Parsing
On Sat, 28 Jun 2008 19:03:39 -0700, disappearedng wrote: > Hi everyone > I am trying to build my own web crawler for an experiement and I don't > know how to access HTTP protocol with python. > > Also, Are there any Opensource Parsing engine for HTML documents > available in Python too? That would be great. Check out BeautifulSoup. I don't recall what license it uses, but the source is available, and it deals well with not-necessarily-beautiful- inside HTML. -- http://mail.python.org/mailman/listinfo/python-list
HTML Parsing
Hi everyone I am trying to build my own web crawler for an experiement and I don't know how to access HTTP protocol with python. Also, Are there any Opensource Parsing engine for HTML documents available in Python too? That would be great. -- http://mail.python.org/mailman/listinfo/python-list
Re: HTML parsing confusion
En Wed, 23 Jan 2008 10:40:14 -0200, Alnilam <[EMAIL PROTECTED]> escribió: > Skipping past html validation, and html to xhtml 'cleaning', and > instead starting with the assumption that I have files that are valid > XHTML, can anyone give me a good example of how I would use _ htmllib, > HTMLParser, or ElementTree _ to parse out the text of one specific > childNode, similar to the examples that I provided above using regex? The diveintopython page is not valid XHTML (but it's valid HTML). Assuming it's property converted: py> from cStringIO import StringIO py> import xml.etree.ElementTree as ET py> tree = ET.parse(StringIO(page)) py> elem = tree.findall('//p')[4] py> py> # from the online ElementTree docs py> http://www.effbot.org/zone/element-bits-and-pieces.htm ... def gettext(elem): ... text = elem.text or "" ... for e in elem: ... text += gettext(e) ... if e.tail: ... text += e.tail ... return text ... py> print gettext(elem) The complete text is available online. You can read the revision history to see what's new. Updated 20 May 2004 -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list
Re: HTML parsing confusion
On Jan 23, 2008 7:40 AM, Alnilam <[EMAIL PROTECTED]> wrote: > Skipping past html validation, and html to xhtml 'cleaning', and > instead starting with the assumption that I have files that are valid > XHTML, can anyone give me a good example of how I would use _ htmllib, > HTMLParser, or ElementTree _ to parse out the text of one specific > childNode, similar to the examples that I provided above using regex? Have you looked at any of the tutorials or sample code for these libraries? If you had a specific question, you will probably get more specific help. I started writing up some sample code, but realized I was mostly reprising the long tutorial on SGMLLib here: http://www.boddie.org.uk/python/HTML.html -- Jerry -- http://mail.python.org/mailman/listinfo/python-list
Re: HTML parsing confusion
On Jan 23, 3:54 am, "M.-A. Lemburg" <[EMAIL PROTECTED]> wrote: > >> I was asking this community if there was a simple way to use only the > >> tools included with Python to parse a bit of html. > > There are lots of ways doing HTML parsing in Python. A common > one is e.g. using mxTidy to convert the HTML into valid XHTML > and then use ElementTree to parse the data. > > http://www.egenix.com/files/python/mxTidy.htmlhttp://docs.python.org/lib/module-xml.etree.ElementTree.html > > For simple tasks you can also use the HTMLParser that's part > of the Python std lib. > > http://docs.python.org/lib/module-HTMLParser.html > > Which tools to use is really dependent on what you are > trying to solve. > > -- > Marc-Andre Lemburg > eGenix.com > > Professional Python Services directly from the Source (#1, Jan 23 2008)>>> > Python/Zope Consulting and Support ... http://www.egenix.com/ > >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ > >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ > > > > Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! > > eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 > D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg > Registered at Amtsgericht Duesseldorf: HRB 46611 Thanks. So far that makes 3 votes for BeautifulSoup, and one vote each for libxml2dom, pyparsing, and mxTidy. I'm sure those would all be great solutions, if I was looking to solve my coding question with external modules. Several folks have mentioned now that they think that if I have files that are valid XHTML, that I could use htmllib, HTMLParser, or ElementTree (all of which are part of the standard libraries in v 2.5). Skipping past html validation, and html to xhtml 'cleaning', and instead starting with the assumption that I have files that are valid XHTML, can anyone give me a good example of how I would use _ htmllib, HTMLParser, or ElementTree _ to parse out the text of one specific childNode, similar to the examples that I provided above using regex? -- http://mail.python.org/mailman/listinfo/python-list
Re: HTML parsing confusion
> The pages I'm trying to write this code to run against aren't in the > wild, though. They are static html files on my company's lan, are very > consistent in format, and are (I believe) valid html. Obvious way to check this is to go to http://validator.w3.org/ and see what it tells you about your html... -- http://mail.python.org/mailman/listinfo/python-list
Re: HTML parsing confusion
On 2008-01-23 01:29, Gabriel Genellina wrote: > En Tue, 22 Jan 2008 19:20:32 -0200, Alnilam <[EMAIL PROTECTED]> escribió: > >> On Jan 22, 11:39 am, "Diez B. Roggisch" <[EMAIL PROTECTED]> wrote: >>> Alnilam wrote: >>>> On Jan 22, 8:44 am, Alnilam <[EMAIL PROTECTED]> wrote: >>>>>> Pardon me, but the standard issue Python 2.n (for n in range(5, 2, >>>>>> -1)) doesn't have an xml.dom.ext ... you must have the >>> mega-monstrous >>>>>> 200-modules PyXML package installed. And you don't want the 75Kb >>>>>> BeautifulSoup? >>>> Ugh. Found it. Sorry about that, but I still don't understand why >>>> there isn't a simple way to do this without using PyXML, BeautifulSoup >>>> or libxml2dom. What's the point in having sgmllib, htmllib, >>>> HTMLParser, and formatter all built in if I have to use use someone >>>> else's modules to write a couple of lines of code that achieve the >>>> simple thing I want. I get the feeling that this would be easier if I >>>> just broke down and wrote a couple of regular expressions, but it >>>> hardly seems a 'pythonic' way of going about things. >>> This is simply a gross misunderstanding of what BeautifulSoup or lxml >>> accomplish. Dealing with mal-formatted HTML whilst trying to make _some_ >>> sense is by no means trivial. And just because you can come up with a >>> few >>> lines of code using rexes that work for your current use-case doesn't >>> mean >>> that they serve as general html-fixing-routine. Or do you think the >>> rather >>> long history and 75Kb of code for BS are because it's creator wasn't >>> aware >>> of rexes? >> I am, by no means, trying to trivialize the work that goes into >> creating the numerous modules out there. However as a relatively >> novice programmer trying to figure out something, the fact that these >> modules are pushed on people with such zealous devotion that you take >> offense at my desire to not use them gives me a bit of pause. I use >> non-included modules for tasks that require them, when the capability >> to do something clearly can't be done easily another way (eg. >> MySQLdb). I am sure that there will be plenty of times where I will >> use BeautifulSoup. In this instance, however, I was trying to solve a >> specific problem which I attempted to lay out clearly from the >> outset. >> >> I was asking this community if there was a simple way to use only the >> tools included with Python to parse a bit of html. There are lots of ways doing HTML parsing in Python. A common one is e.g. using mxTidy to convert the HTML into valid XHTML and then use ElementTree to parse the data. http://www.egenix.com/files/python/mxTidy.html http://docs.python.org/lib/module-xml.etree.ElementTree.html For simple tasks you can also use the HTMLParser that's part of the Python std lib. http://docs.python.org/lib/module-HTMLParser.html Which tools to use is really dependent on what you are trying to solve. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jan 23 2008) >>> Python/Zope Consulting and Support ...http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 -- http://mail.python.org/mailman/listinfo/python-list
Re: HTML parsing confusion
On Jan 22, 7:29 pm, "Gabriel Genellina" <[EMAIL PROTECTED]> wrote: > > > I was asking this community if there was a simple way to use only the > > tools included with Python to parse a bit of html. > > If you *know* that your document is valid HTML, you can use the HTMLParser > module in the standard Python library. Or even the parser in the htmllib > module. But a lot of HTML pages out there are invalid, some are grossly > invalid, and those parsers are just unable to handle them. This is why > modules like BeautifulSoup exist: they contain a lot of heuristics and > trial-and-error and personal experience from the developers, in order to > guess more or less what the page author intended to write and make some > sense of that "tag soup". > A guesswork like that is not suitable for the std lib ("Errors should > never pass silently" and "In the face of ambiguity, refuse the temptation > to guess.") but makes a perfect 3rd party module. > > If you want to use regular expressions, and that works OK for the > documents you are handling now, fine. But don't complain when your RE's > match too much or too little or don't match at all because of unclosed > tags, improperly nested tags, nonsense markup, or just a valid combination > that you didn't take into account. > > -- > Gabriel Genellina Thanks, Gabriel. That does make sense, both what the benefits of BeautifulSoup are and why it probably won't become std lib anytime soon. The pages I'm trying to write this code to run against aren't in the wild, though. They are static html files on my company's lan, are very consistent in format, and are (I believe) valid html. They just have specific paragraphs of useful information, located in the same place in each file, that I want to 'harvest' and put to better use. I used diveintopython.org as an example only (and in part because it had good clean html formatting). I am pretty sure that I could craft some regular expressions to do the work -- which of course would not be the case if I was screen scraping web pages in the 'wild' -- but I was trying to find a way to do that using one of those std libs you mentioned. I'm not sure if HTMLParser or htmllib would work better to achieve the same effect as the regex example I gave above, or how to get them to do that. I thought I'd come close, but as someone pointed out early on, I'd accidently tapped into PyXML which is installed where I was testing code, but not necessarily where I need it. It may turn out that the regex way works faster, but falling back on methods I'm comfortable with doesn't help expand my Python knowledge. So if anyone can tell me how to get HTMLParser or htmllib to grab a specific paragraph, and then provide the text in that paragraph in a clean, markup-free format, I'd appreciate it. -- http://mail.python.org/mailman/listinfo/python-list
Re: HTML parsing confusion
On Jan 22, 7:29 pm, "Gabriel Genellina" <[EMAIL PROTECTED]> wrote: > > > I was asking this community if there was a simple way to use only the > > tools included with Python to parse a bit of html. > > If you *know* that your document is valid HTML, you can use the HTMLParser > module in the standard Python library. Or even the parser in the htmllib > module. But a lot of HTML pages out there are invalid, some are grossly > invalid, and those parsers are just unable to handle them. This is why > modules like BeautifulSoup exist: they contain a lot of heuristics and > trial-and-error and personal experience from the developers, in order to > guess more or less what the page author intended to write and make some > sense of that "tag soup". > A guesswork like that is not suitable for the std lib ("Errors should > never pass silently" and "In the face of ambiguity, refuse the temptation > to guess.") but makes a perfect 3rd party module. > > If you want to use regular expressions, and that works OK for the > documents you are handling now, fine. But don't complain when your RE's > match too much or too little or don't match at all because of unclosed > tags, improperly nested tags, nonsense markup, or just a valid combination > that you didn't take into account. > > -- > Gabriel Genellina Thank you. That does make perfect sense, and is a good clear position on the up and down side of what I'm trying to do, as well as a good explanation for why BeautifulSoup will probably remain outside the std lib. I'm sure that I will get plenty of use out of it. If, however, I am sure that the html code in target documents is good, and the framework html doesn't change, just the data on page after page of static html, would it be better to just go with regex or with one of the std lib items you mentioned. I thought the latter, but I'm stuck on how to make them generate results similar to the code I put above as an example. I'm not trying to code this to go against html in the wild, but to try to strip specific, consistently located data from the markup and turn it into something more useful. I may have confused folks by using the www.diveintopython.org page as an example, but its html seemed to be valid strict tags. -- http://mail.python.org/mailman/listinfo/python-list
Re: HTML parsing confusion
En Tue, 22 Jan 2008 19:20:32 -0200, Alnilam <[EMAIL PROTECTED]> escribió: > On Jan 22, 11:39 am, "Diez B. Roggisch" <[EMAIL PROTECTED]> wrote: >> Alnilam wrote: >> > On Jan 22, 8:44 am, Alnilam <[EMAIL PROTECTED]> wrote: >> >> > Pardon me, but the standard issue Python 2.n (for n in range(5, 2, >> >> > -1)) doesn't have an xml.dom.ext ... you must have the >> mega-monstrous >> >> > 200-modules PyXML package installed. And you don't want the 75Kb >> >> > BeautifulSoup? >> > Ugh. Found it. Sorry about that, but I still don't understand why >> > there isn't a simple way to do this without using PyXML, BeautifulSoup >> > or libxml2dom. What's the point in having sgmllib, htmllib, >> > HTMLParser, and formatter all built in if I have to use use someone >> > else's modules to write a couple of lines of code that achieve the >> > simple thing I want. I get the feeling that this would be easier if I >> > just broke down and wrote a couple of regular expressions, but it >> > hardly seems a 'pythonic' way of going about things. >> >> This is simply a gross misunderstanding of what BeautifulSoup or lxml >> accomplish. Dealing with mal-formatted HTML whilst trying to make _some_ >> sense is by no means trivial. And just because you can come up with a >> few >> lines of code using rexes that work for your current use-case doesn't >> mean >> that they serve as general html-fixing-routine. Or do you think the >> rather >> long history and 75Kb of code for BS are because it's creator wasn't >> aware >> of rexes? > > I am, by no means, trying to trivialize the work that goes into > creating the numerous modules out there. However as a relatively > novice programmer trying to figure out something, the fact that these > modules are pushed on people with such zealous devotion that you take > offense at my desire to not use them gives me a bit of pause. I use > non-included modules for tasks that require them, when the capability > to do something clearly can't be done easily another way (eg. > MySQLdb). I am sure that there will be plenty of times where I will > use BeautifulSoup. In this instance, however, I was trying to solve a > specific problem which I attempted to lay out clearly from the > outset. > > I was asking this community if there was a simple way to use only the > tools included with Python to parse a bit of html. If you *know* that your document is valid HTML, you can use the HTMLParser module in the standard Python library. Or even the parser in the htmllib module. But a lot of HTML pages out there are invalid, some are grossly invalid, and those parsers are just unable to handle them. This is why modules like BeautifulSoup exist: they contain a lot of heuristics and trial-and-error and personal experience from the developers, in order to guess more or less what the page author intended to write and make some sense of that "tag soup". A guesswork like that is not suitable for the std lib ("Errors should never pass silently" and "In the face of ambiguity, refuse the temptation to guess.") but makes a perfect 3rd party module. If you want to use regular expressions, and that works OK for the documents you are handling now, fine. But don't complain when your RE's match too much or too little or don't match at all because of unclosed tags, improperly nested tags, nonsense markup, or just a valid combination that you didn't take into account. -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list
Re: HTML parsing confusion
On Jan 22, 11:39 am, "Diez B. Roggisch" <[EMAIL PROTECTED]> wrote: > Alnilam wrote: > > On Jan 22, 8:44 am, Alnilam <[EMAIL PROTECTED]> wrote: > >> > Pardon me, but the standard issue Python 2.n (for n in range(5, 2, > >> > -1)) doesn't have an xml.dom.ext ... you must have the mega-monstrous > >> > 200-modules PyXML package installed. And you don't want the 75Kb > >> > BeautifulSoup? > > >> I wasn't aware that I had PyXML installed, and can't find a reference > >> to having it installed in pydocs. ... > > > Ugh. Found it. Sorry about that, but I still don't understand why > > there isn't a simple way to do this without using PyXML, BeautifulSoup > > or libxml2dom. What's the point in having sgmllib, htmllib, > > HTMLParser, and formatter all built in if I have to use use someone > > else's modules to write a couple of lines of code that achieve the > > simple thing I want. I get the feeling that this would be easier if I > > just broke down and wrote a couple of regular expressions, but it > > hardly seems a 'pythonic' way of going about things. > > This is simply a gross misunderstanding of what BeautifulSoup or lxml > accomplish. Dealing with mal-formatted HTML whilst trying to make _some_ > sense is by no means trivial. And just because you can come up with a few > lines of code using rexes that work for your current use-case doesn't mean > that they serve as general html-fixing-routine. Or do you think the rather > long history and 75Kb of code for BS are because it's creator wasn't aware > of rexes? > > And it also makes no sense stuffing everything remotely useful into the > standard lib. This would force to align development and release cycles, > resulting in much less features and stability as it can be wished. > > And to be honest: I fail to see where your problem is. BeatifulSoup is a > single Python file. So whatever you carry with you from machine to machine, > if it's capable of holding a file of your own code, you can simply put > BeautifulSoup beside it - even if it was a floppy disk. > > Diez I am, by no means, trying to trivialize the work that goes into creating the numerous modules out there. However as a relatively novice programmer trying to figure out something, the fact that these modules are pushed on people with such zealous devotion that you take offense at my desire to not use them gives me a bit of pause. I use non-included modules for tasks that require them, when the capability to do something clearly can't be done easily another way (eg. MySQLdb). I am sure that there will be plenty of times where I will use BeautifulSoup. In this instance, however, I was trying to solve a specific problem which I attempted to lay out clearly from the outset. I was asking this community if there was a simple way to use only the tools included with Python to parse a bit of html. If the answer is no, that's fine. Confusing, but fine. If the answer is yes, great. I look forward to learning from someone's example. If you don't have an answer, or a positive contribution, then please don't interject your angst into this thread. -- http://mail.python.org/mailman/listinfo/python-list
Re: HTML parsing confusion
Alnilam wrote: > On Jan 22, 8:44 am, Alnilam <[EMAIL PROTECTED]> wrote: >> > Pardon me, but the standard issue Python 2.n (for n in range(5, 2, >> > -1)) doesn't have an xml.dom.ext ... you must have the mega-monstrous >> > 200-modules PyXML package installed. And you don't want the 75Kb >> > BeautifulSoup? >> >> I wasn't aware that I had PyXML installed, and can't find a reference >> to having it installed in pydocs. ... > > Ugh. Found it. Sorry about that, but I still don't understand why > there isn't a simple way to do this without using PyXML, BeautifulSoup > or libxml2dom. What's the point in having sgmllib, htmllib, > HTMLParser, and formatter all built in if I have to use use someone > else's modules to write a couple of lines of code that achieve the > simple thing I want. I get the feeling that this would be easier if I > just broke down and wrote a couple of regular expressions, but it > hardly seems a 'pythonic' way of going about things. This is simply a gross misunderstanding of what BeautifulSoup or lxml accomplish. Dealing with mal-formatted HTML whilst trying to make _some_ sense is by no means trivial. And just because you can come up with a few lines of code using rexes that work for your current use-case doesn't mean that they serve as general html-fixing-routine. Or do you think the rather long history and 75Kb of code for BS are because it's creator wasn't aware of rexes? And it also makes no sense stuffing everything remotely useful into the standard lib. This would force to align development and release cycles, resulting in much less features and stability as it can be wished. And to be honest: I fail to see where your problem is. BeatifulSoup is a single Python file. So whatever you carry with you from machine to machine, if it's capable of holding a file of your own code, you can simply put BeautifulSoup beside it - even if it was a floppy disk. Diez -- http://mail.python.org/mailman/listinfo/python-list
Re: HTML parsing confusion
On Jan 22, 8:44 am, Alnilam <[EMAIL PROTECTED]> wrote: > > Pardon me, but the standard issue Python 2.n (for n in range(5, 2, > > -1)) doesn't have an xml.dom.ext ... you must have the mega-monstrous > > 200-modules PyXML package installed. And you don't want the 75Kb > > BeautifulSoup? > > I wasn't aware that I had PyXML installed, and can't find a reference > to having it installed in pydocs. ... Ugh. Found it. Sorry about that, but I still don't understand why there isn't a simple way to do this without using PyXML, BeautifulSoup or libxml2dom. What's the point in having sgmllib, htmllib, HTMLParser, and formatter all built in if I have to use use someone else's modules to write a couple of lines of code that achieve the simple thing I want. I get the feeling that this would be easier if I just broke down and wrote a couple of regular expressions, but it hardly seems a 'pythonic' way of going about things. # get the source (assuming you don't have it locally and have an internet connection) >>> import urllib >>> page = urllib.urlopen("http://diveintopython.org/";) >>> source = page.read() >>> page.close() # set up some regex to find tags, strip them out, and correct some formatting oddities >>> import re >>> p = re.compile(r'(.*?)',re.DOTALL) >>> tag_strip = re.compile(r'>(.*?)<',re.DOTALL) >>> fix_format = re.compile(r'\n +',re.MULTILINE) # achieve clean results. >>> paragraphs = re.findall(p,source) >>> text_list = re.findall(tag_strip,paragraphs[5]) >>> text = "".join(text_list) >>> clean_text = re.sub(fix_format," ",text) This works, and is small and easily reproduced, but seems like it would break easily and seems a waste of other *ML specific parsers. -- http://mail.python.org/mailman/listinfo/python-list
Re: HTML parsing confusion
On Jan 22, 7:44 am, Alnilam <[EMAIL PROTECTED]> wrote: > ...I move from computer to > computer regularly, and while all have a recent copy of Python, each > has different (or no) extra modules, and I don't always have the > luxury of downloading extras. That being said, if there's a simple way > of doing it with BeautifulSoup, please show me an example. Maybe I can > figure out a way to carry the extra modules I need around with me. Pyparsing's footprint is intentionally small - just one pyparsing.py file that you can drop into a directory next to your own script. And the code to extract paragraph 5 of the "Dive Into Python" home page? See annotated code below. -- Paul from pyparsing import makeHTMLTags, SkipTo, anyOpenTag, anyCloseTag import urllib import textwrap page = urllib.urlopen("http://diveintopython.org/";) source = page.read() page.close() # define a simple paragraph matcher pStart,pEnd = makeHTMLTags("P") paragraph = pStart.suppress() + SkipTo(pEnd) + pEnd.suppress() # get all paragraphs from the input string (or use the # scanString generator function to stop at the correct # paragraph instead of reading them all) paragraphs = paragraph.searchString(source) # create a transformer that will strip HTML tags tagStripper = anyOpenTag.suppress() | anyCloseTag.suppress() # get paragraph[5] and strip the HTML tags p5TextOnly = tagStripper.transformString(paragraphs[5][0]) # remove extra whitespace p5TextOnly = " ".join(p5TextOnly.split()) # print out a nicely wrapped string - so few people know # that textwrap is part of the standard Python distribution, # but it is very handy print textwrap.fill(p5TextOnly, 60) -- http://mail.python.org/mailman/listinfo/python-list
Re: HTML parsing confusion
> Pardon me, but the standard issue Python 2.n (for n in range(5, 2, > -1)) doesn't have an xml.dom.ext ... you must have the mega-monstrous > 200-modules PyXML package installed. And you don't want the 75Kb > BeautifulSoup? I wasn't aware that I had PyXML installed, and can't find a reference to having it installed in pydocs. And that highlights the problem I have at the moment with using other modules. I move from computer to computer regularly, and while all have a recent copy of Python, each has different (or no) extra modules, and I don't always have the luxury of downloading extras. That being said, if there's a simple way of doing it with BeautifulSoup, please show me an example. Maybe I can figure out a way to carry the extra modules I need around with me. -- http://mail.python.org/mailman/listinfo/python-list
Re: HTML parsing confusion
On 22 Jan, 06:31, Alnilam <[EMAIL PROTECTED]> wrote: > Sorry for the noob question, but I've gone through the documentation > on python.org, tried some of the diveintopython and boddie's examples, > and looked through some of the numerous posts in this group on the > subject and I'm still rather confused. I know that there are some > great tools out there for doing this (BeautifulSoup, lxml, etc.) but I > am trying to accomplish a simple task with a minimal (as in nil) > amount of adding in modules that aren't "stock" 2.5, and writing a > huge class of my own (or copying one from diveintopython) seems > overkill for what I want to do. It's unfortunate that you don't want to install extra modules, but I'd probably use libxml2dom [1] for what you're about to describe... > Here's what I want to accomplish... I want to open a page, identify a > specific point in the page, and turn the information there into > plaintext. For example, on thewww.diveintopython.orgpage, I want to > turn the paragraph that starts "Translations are freely > permitted" (and ends ..."let me know"), into a string variable. > > Opening the file seems pretty straightforward. > > >>> import urllib > >>> page = urllib.urlopen("http://diveintopython.org/";) > >>> source = page.read() > >>> page.close() > > gets me to a string variable consisting of the un-parsed contents of > the page. Yes, there may be shortcuts that let some parsers read directly from the server, but it's always good to have the page text around, anyway. > Now things get confusing, though, since there appear to be several > approaches. > One that I read somewhere was: > > >>> from xml.dom.ext.reader import HtmlLib > >>> reader = HtmlLib.Reader() > >>> doc = reader.fromString(source) > > This gets me doc as > > >>> paragraphs = doc.getElementsByTagName('p') > > gets me all of the paragraph children, and the one I specifically want > can then be referenced with: paragraphs[5] This method seems to be > pretty straightforward, but what do I do with it to get it into a > string cleanly? In less sophisticated DOM implementations, what you'd do is to loop over the "descendant" nodes of the paragraph which are text nodes and concatenate them. > >>> from xml.dom.ext import PrettyPrint > >>> PrettyPrint(paragraphs[5]) > > shows me the text, but still in html, and I can't seem to get it to > turn into a string variable, and I think the PrettyPrint function is > unnecessary for what I want to do. Yes, PrettyPrint is for prettyprinting XML. You just want to visit and collect the text nodes. >Formatter seems to do what I want, > but I can't figure out how to link the "Element Node" at > paragraphs[5] with the formatter functions to produce the string I > want as output. I tried some of the htmllib.HTMLParser(formatter > stuff) examples, but while I can supposedly get that to work with > formatter a little easier, I can't figure out how to get HTMLParser to > drill down specifically to the 6th paragraph's contents. Given that you've found the paragraph above, you just need to write a recursive function which visits child nodes, and if it finds a text node then it collects the value of the node in a list; otherwise, for elements, it visits the child nodes of that element; and so on. The recursive approach is presumably what the formatter uses, but I can't say that I've really looked at it. Meanwhile, with libxml2dom, you'd do something like this: import libxml2dom d = libxml2dom.parseURI("http://www.diveintopython.org/";, html=1) saved = None # Find the paragraphs. for p in d.xpath("//p"): # Get the text without leading and trailing space. text = p.textContent.strip() # Save the appropriate paragraph text. if text.startswith("Translations are freely permitted") and \ text.endswith("just let me know."): saved = text break The magic part of this code which saves you from needing to write that recursive function mentioned above is the textContent property on the paragraph element. Paul [1] http://www.python.org/pypi/libxml2dom -- http://mail.python.org/mailman/listinfo/python-list
Re: HTML parsing confusion
On Jan 22, 4:31 pm, Alnilam <[EMAIL PROTECTED]> wrote: > Sorry for the noob question, but I've gone through the documentation > on python.org, tried some of the diveintopython and boddie's examples, > and looked through some of the numerous posts in this group on the > subject and I'm still rather confused. I know that there are some > great tools out there for doing this (BeautifulSoup, lxml, etc.) but I > am trying to accomplish a simple task with a minimal (as in nil) > amount of adding in modules that aren't "stock" 2.5, and writing a > huge class of my own (or copying one from diveintopython) seems > overkill for what I want to do. > > Here's what I want to accomplish... I want to open a page, identify a > specific point in the page, and turn the information there into > plaintext. For example, on thewww.diveintopython.orgpage, I want to > turn the paragraph that starts "Translations are freely > permitted" (and ends ..."let me know"), into a string variable. > > Opening the file seems pretty straightforward. > > >>> import urllib > >>> page = urllib.urlopen("http://diveintopython.org/";) > >>> source = page.read() > >>> page.close() > > gets me to a string variable consisting of the un-parsed contents of > the page. > Now things get confusing, though, since there appear to be several > approaches. > One that I read somewhere was: > > >>> from xml.dom.ext.reader import HtmlLib Pardon me, but the standard issue Python 2.n (for n in range(5, 2, -1)) doesn't have an xml.dom.ext ... you must have the mega-monstrous 200-modules PyXML package installed. And you don't want the 75Kb BeautifulSoup? -- http://mail.python.org/mailman/listinfo/python-list
HTML parsing confusion
Sorry for the noob question, but I've gone through the documentation on python.org, tried some of the diveintopython and boddie's examples, and looked through some of the numerous posts in this group on the subject and I'm still rather confused. I know that there are some great tools out there for doing this (BeautifulSoup, lxml, etc.) but I am trying to accomplish a simple task with a minimal (as in nil) amount of adding in modules that aren't "stock" 2.5, and writing a huge class of my own (or copying one from diveintopython) seems overkill for what I want to do. Here's what I want to accomplish... I want to open a page, identify a specific point in the page, and turn the information there into plaintext. For example, on the www.diveintopython.org page, I want to turn the paragraph that starts "Translations are freely permitted" (and ends ..."let me know"), into a string variable. Opening the file seems pretty straightforward. >>> import urllib >>> page = urllib.urlopen("http://diveintopython.org/";) >>> source = page.read() >>> page.close() gets me to a string variable consisting of the un-parsed contents of the page. Now things get confusing, though, since there appear to be several approaches. One that I read somewhere was: >>> from xml.dom.ext.reader import HtmlLib >>> reader = HtmlLib.Reader() >>> doc = reader.fromString(source) This gets me doc as >>> paragraphs = doc.getElementsByTagName('p') gets me all of the paragraph children, and the one I specifically want can then be referenced with: paragraphs[5] This method seems to be pretty straightforward, but what do I do with it to get it into a string cleanly? >>> from xml.dom.ext import PrettyPrint >>> PrettyPrint(paragraphs[5]) shows me the text, but still in html, and I can't seem to get it to turn into a string variable, and I think the PrettyPrint function is unnecessary for what I want to do. Formatter seems to do what I want, but I can't figure out how to link the "Element Node" at paragraphs[5] with the formatter functions to produce the string I want as output. I tried some of the htmllib.HTMLParser(formatter stuff) examples, but while I can supposedly get that to work with formatter a little easier, I can't figure out how to get HTMLParser to drill down specifically to the 6th paragraph's contents. Thanks in advance. - A. -- http://mail.python.org/mailman/listinfo/python-list
Re: How to Encode Parameters into an HTML Parsing Script
On Jun 21, 9:45 pm, "Gabriel Genellina" <[EMAIL PROTECTED]> wrote: > En Thu, 21 Jun 2007 23:37:07 -0300, <[EMAIL PROTECTED]> escribió: > > > So for example if I wanted to navigate to an encoded url > >http://online.investools.com/landing.iedu?signedin=truerather than > > justhttp://online.investools.com/landing.iedu How would I do this? > > How can I modify thescriptto urlencode these parameters: > > {signedin:true} and to associate them with a specific url from the > > urlList > > If you want to use GET, append '?' plus the encoded parameters to the > desired url: > > py> data = {'signedin':'true', 'another':42} > py> print urlencode(data) > signedin=true&another=42 > > Do not use the data argument to urlopen. > > -- > Gabriel Genellina Sweet! I love this python group -- http://mail.python.org/mailman/listinfo/python-list
Re: How to Encode Parameters into an HTML Parsing Script
En Thu, 21 Jun 2007 23:37:07 -0300, <[EMAIL PROTECTED]> escribió: > So for example if I wanted to navigate to an encoded url > http://online.investools.com/landing.iedu?signedin=true rather than > just http://online.investools.com/landing.iedu How would I do this? > How can I modify the script to urlencode these parameters: > {signedin:true} and to associate them with a specific url from the > urlList If you want to use GET, append '?' plus the encoded parameters to the desired url: py> data = {'signedin':'true', 'another':42} py> print urlencode(data) signedin=true&another=42 Do not use the data argument to urlopen. -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list
How to Encode Parameters into an HTML Parsing Script
I've written a Script that navigates various urls on a website, and fetches the contents. The Url's are being fed from a list "urlList". Everything seems to work splendidly, until I introduce the concept of encoding parameters for a certain url. So for example if I wanted to navigate to an encoded url http://online.investools.com/landing.iedu?signedin=true rather than just http://online.investools.com/landing.iedu How would I do this? How can I modify the script to urlencode these parameters: {signedin:true} and to associate them with a specific url from the urlList Thank you! import datetime, time, re, os, sys, traceback, smtplib, string, urllib2, urllib, inspect from urllib2 import build_opener, HTTPCookieProcessor, Request opener = build_opener(HTTPCookieProcessor) from urllib import urlencode def urlopen2(url, data=None, user_agent='urlopen2'): """Opens Our URLS """ if hasattr(data, "__iter__"): data = urlencode(data) headers = {'User-Agent' : user_agent} # User-Agent for Unspecified Browser return opener.open(Request(url, data, headers)) def badCharCheck(host,url): try: page = urlopen2("http://"+host+".investools.com/"+url+"";, ()) pageRead= page.read() print "Loading:",url #print pageRead except: print "Failed: ", traceback.format_tb(sys.exc_info()[2]),'\n' if __name__ == '__main__': host= "online" urlList = ["landing.iedu","sitemap.iedu"] print "\n","* Begin BadCharCheck for", host for url in urlList: badCharCheck(host,url) print'* TEST FINISHED! Total Runs:' sys.exit() OUTPUT: * Begin BadCharCheck for online Loading: landing.iedu Loading: sitemap.iedu * TEST FINISHED! Total Runs: -- http://mail.python.org/mailman/listinfo/python-list
Re: Output of HTML parsing
Jackie schrieb: > On 6 15 , 2 01 , Stefan Behnel <[EMAIL PROTECTED]> wrote: >> Jackie wrote: > >> import lxml.etree as et >> url = "http://www.economics.utoronto.ca/index.php/index/person/faculty/"; >> tree = et.parse(url) >> > >> Stefan- - >> >> - - > > Thank you. But when I tried to run the above part, the following > message showed up: > > Traceback (most recent call last): > File "D:\TS\Python\workspace\eco_department\lxml_ver.py", line 3, in > > tree = et.parse(url) > File "etree.pyx", line 1845, in etree.parse > File "parser.pxi", line 928, in etree._parseDocument > File "parser.pxi", line 932, in etree._parseDocumentFromURL > File "parser.pxi", line 849, in etree._parseDocFromFile > File "parser.pxi", line 557, in etree._BaseParser._parseDocFromFile > File "parser.pxi", line 631, in etree._handleParseResult > File "parser.pxi", line 602, in etree._raiseParseError > etree.XMLSyntaxError: line 2845: Premature end of data in tag html > line 8 > > Could you please tell me where went wrong? Ah, ok, then the page is not actually XHTML, but broken HTML. Use this idiom instead: parser = et.HTMLParser() tree = et.parse(url, parser) Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: Output of HTML parsing
On 6 15 , 2 01 , Stefan Behnel <[EMAIL PROTECTED]> wrote: > Jackie wrote: > import lxml.etree as et > url = "http://www.economics.utoronto.ca/index.php/index/person/faculty/"; > tree = et.parse(url) > > Stefan- - > > - - Thank you. But when I tried to run the above part, the following message showed up: Traceback (most recent call last): File "D:\TS\Python\workspace\eco_department\lxml_ver.py", line 3, in tree = et.parse(url) File "etree.pyx", line 1845, in etree.parse File "parser.pxi", line 928, in etree._parseDocument File "parser.pxi", line 932, in etree._parseDocumentFromURL File "parser.pxi", line 849, in etree._parseDocFromFile File "parser.pxi", line 557, in etree._BaseParser._parseDocFromFile File "parser.pxi", line 631, in etree._handleParseResult File "parser.pxi", line 602, in etree._raiseParseError etree.XMLSyntaxError: line 2845: Premature end of data in tag html line 8 Could you please tell me where went wrong? Thank you Jackie -- http://mail.python.org/mailman/listinfo/python-list
Output of html parsing
Hi, all, I want to get the information of the professors (name,title) from the following link: "http://www.economics.utoronto.ca/index.php/index/person/faculty/"; Ideally, I'd like to have a output file where each line is one Prof, including his name and title. In practice, I use the CSV module. The following is my program: --- Program import urllib,re,csv url = "http://www.economics.utoronto.ca/index.php/index/person/faculty/"; sock = urllib.urlopen(url) htmlSource = sock.read() sock.close() namePattern = re.compile(r'class="name">(.*)') titlePattern = re.compile(r', (.*)\s*') name = namePattern.findall(htmlSource) title_temp = titlePattern.findall(htmlSource) title =[] for item in title_temp: item_new=" ".join(item.split())#Suppress the spaces between 'title' and title.extend([item_new]) output =[] for i in range(len(name)): output.insert(i,[name[i],title[i]])#Generate a list of [name, title] writer = csv.writer(open("professor.csv", "wb")) writer.writerows(output) #output CSV file -- End of Program -- My questions are: 1.The code above assume that each Prof has a tilte. If any one of them does not, the name and title will be mismatched. How to program to allow that title can be empty? 2.Is there any easier way to get the data I want other than using list? 3.Should I close the opened csv file("professor.csv")? How to close it? Thanks! Jackie - All new Yahoo! Mail - - Get a sneak peak at messages with a handy reading pane.-- http://mail.python.org/mailman/listinfo/python-list
Re: Output of HTML parsing
Jackie wrote: > I want to get the information of the professors (name,title) from the > following link: > > "http://www.economics.utoronto.ca/index.php/index/person/faculty/"; That's even XHTML, no need to go through BeautifulSoup. Use lxml instead. http://codespeak.net/lxml > Ideally, I'd like to have a output file where each line is one Prof, > including his name and title. In practice, I use the CSV module. > > > import urllib,re,csv > > url = "http://www.economics.utoronto.ca/index.php/index/person/ > faculty/" > > sock = urllib.urlopen(url) > htmlSource = sock.read() > sock.close() import lxml.etree as et url = "http://www.economics.utoronto.ca/index.php/index/person/faculty/"; tree = et.parse(url) > namePattern = re.compile(r'class="name">(.*)') > titlePattern = re.compile(r', (.*)\s*') > > name = namePattern.findall(htmlSource) > title_temp = titlePattern.findall(htmlSource) > title =[] > for item in title_temp: > item_new=" ".join(item.split())#Suppress the > spaces between 'title' and > title.extend([item_new]) > > > output =[] > for i in range(len(name)): > output.insert(i,[name[i],title[i]])#Generate a list of > [name, title] # untested get_name_text = et.XPath('normalize-space(td[a/@class="name"]') name_list = [] for name_row in tree.xpath('//tr[td/a/@class = "name"]'): name_list.append( tuple(get_name_text(name_row).split(",", 3) + ["","",""])[:3] ) > writer = csv.writer(open("professor.csv", "wb")) > writer.writerows(output) #output CSV file writer = csv.writer(open("professor.csv", "wb")) writer.writerows(name_list) #output CSV file > -- End of Program > -- > > 3.Should I close the opened csv file("professor.csv")? How to close > it? I guess it has a "close()" function? Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: Output of HTML parsing
[ Jackie <[EMAIL PROTECTED]> ] > 1.The code above assume that each Prof has a tilte. If any one of them > does not, the name and title will be mismatched. How to program to > allow that title can be empty? > > 2.Is there any easier way to get the data I want other than using > list? Use BeautifulSoup. > 3.Should I close the opened csv file("professor.csv")? How to close > it? Assign the file object to a separate name (e.g. stream) and then invoke its close method after writing all csv data to it. -- Freedom is always the freedom of dissenters. (Rosa Luxemburg) signature.asc Description: This is a digitally signed message part. -- http://mail.python.org/mailman/listinfo/python-list
Output of HTML parsing
Hi, all, I want to get the information of the professors (name,title) from the following link: "http://www.economics.utoronto.ca/index.php/index/person/faculty/"; Ideally, I'd like to have a output file where each line is one Prof, including his name and title. In practice, I use the CSV module. The following is my program: --- Program import urllib,re,csv url = "http://www.economics.utoronto.ca/index.php/index/person/ faculty/" sock = urllib.urlopen(url) htmlSource = sock.read() sock.close() namePattern = re.compile(r'class="name">(.*)') titlePattern = re.compile(r', (.*)\s*') name = namePattern.findall(htmlSource) title_temp = titlePattern.findall(htmlSource) title =[] for item in title_temp: item_new=" ".join(item.split())#Suppress the spaces between 'title' and title.extend([item_new]) output =[] for i in range(len(name)): output.insert(i,[name[i],title[i]])#Generate a list of [name, title] writer = csv.writer(open("professor.csv", "wb")) writer.writerows(output) #output CSV file -- End of Program -- My questions are: 1.The code above assume that each Prof has a tilte. If any one of them does not, the name and title will be mismatched. How to program to allow that title can be empty? 2.Is there any easier way to get the data I want other than using list? 3.Should I close the opened csv file("professor.csv")? How to close it? Thanks! Jackie -- http://mail.python.org/mailman/listinfo/python-list
Re: HTML Parsing
John Machin wrote: > One can even use ElementTree, if the HTML is well-formed. See below. > However if it is as ill-formed as the sample (4th "td" element not > closed; I've omitted it below), then the OP would be better off > sticking with Beautiful Soup :-) Or (as we were talking about the best of both worlds already) use lxml's HTML parser, which is also capable of parsing pretty disgusting HTML-like tag soup. Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: HTML Parsing
John Machin wrote: > One can even use ElementTree, if the HTML is well-formed. See below. > However if it is as ill-formed as the sample (4th "td" element not > closed; I've omitted it below), then the OP would be better off > sticking with Beautiful Soup :-) or get the best of both worlds: http://effbot.org/zone/element-soup.htm -- http://mail.python.org/mailman/listinfo/python-list
Re: HTML Parsing
On Feb 11, 6:05 pm, Ayaz Ahmed Khan <[EMAIL PROTECTED]> wrote: > "mtuller" typed: > > > I have also tried Beautiful Soup, but had trouble understanding the > > documentation > > As Gabriel has suggested, spend a little more time going through the > documentation of BeautifulSoup. It is pretty easy to grasp. > > I'll give you an example: I want to extract the text between the > following span tags in a large HTML source file. > > Linux Kernel Bluetooth CAPI Packet Remote Buffer Overflow > Vulnerability > > >>> import re > >>> from BeautifulSoup import BeautifulSoup > >>> from urllib2 import urlopen > >>> soup = BeautifulSoup(urlopen('http://www.someurl.tld/')) > >>> title = soup.find(name='span', attrs={'class':'title'}, > >>> text=re.compile(r'^Linux \w+')) > >>> title > > u'Linux Kernel Bluetooth CAPI Packet Remote Buffer Overflow Vulnerability' > One can even use ElementTree, if the HTML is well-formed. See below. However if it is as ill-formed as the sample (4th "td" element not closed; I've omitted it below), then the OP would be better off sticking with Beautiful Soup :-) C:\junk>type element_soup.py from xml.etree import cElementTree as ET import cStringIO guff = """ LETTER 33,699 1.0 """ tree = ET.parse(cStringIO.StringIO(guff)) for elem in tree.getiterator('td'): key = elem.get('headers') assert elem[0].tag == 'span' value = elem[0].text print repr(key), repr(value) C:\junk>\python25\python element_soup.py 'col1_1' 'LETTER' 'col2_1' '33,699' 'col3_1' '1.0' HTH, John -- http://mail.python.org/mailman/listinfo/python-list
Re: HTML Parsing
"mtuller" typed: > I have also tried Beautiful Soup, but had trouble understanding the > documentation As Gabriel has suggested, spend a little more time going through the documentation of BeautifulSoup. It is pretty easy to grasp. I'll give you an example: I want to extract the text between the following span tags in a large HTML source file. Linux Kernel Bluetooth CAPI Packet Remote Buffer Overflow Vulnerability >>> import re >>> from BeautifulSoup import BeautifulSoup >>> from urllib2 import urlopen >>> soup = BeautifulSoup(urlopen('http://www.someurl.tld/')) >>> title = soup.find(name='span', attrs={'class':'title'}, >>> text=re.compile(r'^Linux \w+')) >>> title u'Linux Kernel Bluetooth CAPI Packet Remote Buffer Overflow Vulnerability' -- Ayaz Ahmed Khan A witty saying proves nothing, but saying something pointless gets people's attention. -- http://mail.python.org/mailman/listinfo/python-list
Re: HTML Parsing
En Sat, 10 Feb 2007 20:07:43 -0300, mtuller <[EMAIL PROTECTED]> escribió: > > > LETTER > > 33,699 > > 1.0 > > > > I want to extract the 33,699 (which is dynamic) and set the value to a > variable so that I can insert it into a database. I have tried parsing > [...] > I have also tried Beautiful Soup, but had trouble understanding the > documentation, and HTMLParser doesn't seem to do what I want. Can[...] Just try harder with BeautifulSoup, should work OK for your use case. Unfortunately I can't give you an example right now. -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list
HTML Parsing
Alright. I have tried everything I can find, but am not getting anywhere. I have a web page that has data like this: LETTER 33,699 1.0 What is show is only a small section. I want to extract the 33,699 (which is dynamic) and set the value to a variable so that I can insert it into a database. I have tried parsing the html with pyparsing, and the examples will get it to print all instances with span, of which there are a hundred or so when I use: for srvrtokens in printCount.searchString(printerListHTML): print srvrtokens If I set the last line to srvtokens[3] I get the values, but I don't know grab a single line and then set that as a variable. I have also tried Beautiful Soup, but had trouble understanding the documentation, and HTMLParser doesn't seem to do what I want. Can someone point me to a tutorial or give me some pointers on how to parse html where there are multiple lines with the same tags and then be able to go to a certain line and grab a value and set a variable's value to that? Thanks, Mike -- http://mail.python.org/mailman/listinfo/python-list
Re: HTML Parsing and Indexing
On Nov 13, 1:12 pm, [EMAIL PROTECTED] wrote: > > I need a help on HTML parser. > > > I saw a couple of python parsers like pyparsing, yappy, yapps, etc but > they havn't given any example for HTML parsing. Geez, how hard did you look? pyparsing's wiki menu includes an 'Examples' link, which take you to a page of examples including 3 having to do with scraping HTML. You can view the examples right in the wiki, without even having to download the package (of course, you *would* have to download to actually run the examples). -- Paul -- http://mail.python.org/mailman/listinfo/python-list
Re: HTML Parsing and Indexing
[EMAIL PROTECTED] wrote: > I am involved in one project which tends to collect news > information published on selected, known web sites inthe format of > HTML, RSS, etc and sortlist them and create a bookmark on our website > for the news content(we will use django for web development). Currently > this project is under heavy development. > > I need a help on HTML parser. lxml includes an HTML parser which can parse straight from URLs. http://codespeak.net/lxml/ http://cheeseshop.python.org/pypi/lxml Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: HTML Parsing and Indexing
[EMAIL PROTECTED] wrote: > I am involved in one project which tends to collect news > information published on selected, known web sites inthe format of > HTML, RSS, etc I just can't imagine why anyone would still want to do this. With RSS, it's an easy (if not trivial) problem. With HTML it's hard, it's unstable, and the legality of recycling others' content like this is far from clear. Are you _sure_ there's still a need to do this thoroughly awkward task? How many sites are there that are worth scraping, permit scraping, and don't yet offer RSS ? -- http://mail.python.org/mailman/listinfo/python-list
Re: HTML Parsing and Indexing
a combination of urllib, urlib2 and BeautifulSoup should do it. Read BeautifulSoup's documentation to know how to browse through the DOM. [EMAIL PROTECTED] a écrit : > Hi All, > > I am involved in one project which tends to collect news > information published on selected, known web sites inthe format of > HTML, RSS, etc and sortlist them and create a bookmark on our website > for the news content(we will use django for web development). Currently > this project is under heavy development. > > I need a help on HTML parser. > > I can download the web pages from target sites. Then I have to start > doing parsing. Since they all html web pages, they will have different > styles, tags, it is very hard for me to parse the data. So what we plan > is to have one or more rules for each website and run based on rule. We > can even write some small amount of code for each web site if > required. But Crawler, Parser and Indexer need to run unattended. I > don't know how to proceed next.. > > I saw a couple of python parsers like pyparsing, yappy, yapps, etc but > they havn't given any example for HTML parsing. Someone recommended > using "lynx" to convert the page into the text and parse the data. That > also looks good but still i end of writing a huge chunk of code for > each web page. > > What we need is, > > One nice parser which should work on HTML/text file (lynx output) and > work based on certain rules and return us a result (Am I need magix to > do this :-( ) > > Sorry about my english.. > > Thanks & Regards, > > Krish -- http://mail.python.org/mailman/listinfo/python-list
Re: HTML Parsing and Indexing
[EMAIL PROTECTED] wrote: > I need a help on HTML parser. http://www.effbot.org/pyfaq/tutor-how-do-i-get-data-out-of-html.htm -- http://mail.python.org/mailman/listinfo/python-list
HTML Parsing and Indexing
Hi All, I am involved in one project which tends to collect news information published on selected, known web sites inthe format of HTML, RSS, etc and sortlist them and create a bookmark on our website for the news content(we will use django for web development). Currently this project is under heavy development. I need a help on HTML parser. I can download the web pages from target sites. Then I have to start doing parsing. Since they all html web pages, they will have different styles, tags, it is very hard for me to parse the data. So what we plan is to have one or more rules for each website and run based on rule. We can even write some small amount of code for each web site if required. But Crawler, Parser and Indexer need to run unattended. I don't know how to proceed next.. I saw a couple of python parsers like pyparsing, yappy, yapps, etc but they havn't given any example for HTML parsing. Someone recommended using "lynx" to convert the page into the text and parse the data. That also looks good but still i end of writing a huge chunk of code for each web page. What we need is, One nice parser which should work on HTML/text file (lynx output) and work based on certain rules and return us a result (Am I need magix to do this :-( ) Sorry about my english.. Thanks & Regards, Krish -- http://mail.python.org/mailman/listinfo/python-list
Re: HTML parsing bug?
>> this is a comment in JavaScript, which is itself inside an HTML comment > Did you read the post? misread it rather ... -- http://mail.python.org/mailman/listinfo/python-list
Re: HTML parsing bug?
[EMAIL PROTECTED] wrote: > Python 2.3.5 seems to choke when trying to parse html files, because it > doesn't realize that what's inside is a comment in HTML, > even if this comment is inside , especially if it's a > comment inside that script code too. nope. what's inside is not a comment if it's inside a
Re: HTML parsing bug?
"Istvan Albert" <[EMAIL PROTECTED]> wrote: > >> this is a comment in JavaScript, which is itself inside an HTML comment > >Don't nest HTML comments. Occasionaly it may break the browsers as >well. Did you read the post? He didn't nest HTML comments. He put a Javascript comment inside an HTML comment, inside a pair. Virtually every page with Javascript does exactly the same thing. -- - Tim Roberts, [EMAIL PROTECTED] Providenza & Boekelheide, Inc. -- http://mail.python.org/mailman/listinfo/python-list
Re: HTML parsing bug?
> this is a comment in JavaScript, which is itself inside an HTML comment Don't nest HTML comments. Occasionaly it may break the browsers as well. (I remember this from one of the weirdest of bughunts : whenever the number of characters between nested HTML comments was divisible by four the page would render incorrectly ... or something of that sorts) i. -- http://mail.python.org/mailman/listinfo/python-list
Re: HTML parsing bug?
<[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > Python 2.3.5 seems to choke when trying to parse html files, because it > doesn't realize that what's inside is a comment in HTML, > even if this comment is inside , especially if it's a > comment inside that script code too. Actually, you are technically incorrect; try validating the code you posted. Google found this explanation: http://lachy.id.au/log/2005/05/script-comments Feeding even slightly invalid HTML to the standard library parser will often choke it. If you can't guarantee clean sources, best use Tidy first or another parser entirely. -- http://mail.python.org/mailman/listinfo/python-list
Re: HTML parsing bug?
> // - this is a comment in JavaScript, which is itself inside > an HTML comment This is supposed to be one line. Got wrapped during posting. -- http://mail.python.org/mailman/listinfo/python-list
HTML parsing bug?
Python 2.3.5 seems to choke when trying to parse html files, because it doesn't realize that what's inside is a comment in HTML, even if this comment is inside , especially if it's a comment inside that script code too. The html file: Choke on this Hey there The Python program: from urllib2 import urlopen from HTMLParser import HTMLParser f = urlopen("file:///PATH_TO_THE_ABOVE/index.html") p = HTMLParser() p.feed(f.read()) -- http://mail.python.org/mailman/listinfo/python-list
Re: HTML parsing/scraping & python
Take a look at SW Explorer Automation (http://home.comcast.net/~furmana/SWIEAutomation.htm)(SWEA). SWEA creates an object model (automation interface) for any Web application running in Internet Explorer. It supports all IE functionality:frames, java script, dialogs, downloads. The runtime can also work under non-interactive user accounts (ASP.NET or service applications) on Window 2000/2003 server or Windows XP. -- http://mail.python.org/mailman/listinfo/python-list
Re: HTML parsing/scraping & python
John J. Lee wrote: > Sanjay Arora <[EMAIL PROTECTED]> writes: > > > We are looking to select the language & toolset more suitable for a > > project that requires getting data from several web-sites in real- > > timehtml parsing/scraping. It would require full emulation of the > > browser, including handling cookies, automated logins & following > > multiple web-link paths. Multiple threading would be a plus but not > > requirement. > [...] > > What's the application? > > > John I'll do your googling for you ;-p (The topic guide needs to be updated for mechanize, pamie, beautiful soup, clientTable, pullparser, etc.) http://www.python.org/topics/web/HTML.html http://blog.ianbicking.org/best-of-the-web-app-test-frameworks.html -- http://mail.python.org/mailman/listinfo/python-list
Re: HTML parsing/scraping & python
Sanjay Arora <[EMAIL PROTECTED]> writes: > We are looking to select the language & toolset more suitable for a > project that requires getting data from several web-sites in real- > timehtml parsing/scraping. It would require full emulation of the > browser, including handling cookies, automated logins & following > multiple web-link paths. Multiple threading would be a plus but not > requirement. [...] What's the application? John -- http://mail.python.org/mailman/listinfo/python-list
Re: HTML parsing/scraping & python
"Fuzzyman" <[EMAIL PROTECTED]> writes: > The standard library module for fetching HTML is urllib2. Does urllib2 replace everything in urllib? I thought there was some urllib functionality that urllib2 didn't do. > There is a project called mechanize, built by John Lee on top of > urllib2 and other standard modules. > It will emulate a browsers behaviour - including history, cookies, > basic authentication, etc. urllib2 handles cookies and authentication. I use those features daily. I'm not sure history would apply, unless you're also handling javascript. Is there some other way to ask the browser to go back in history? http://www.mired.org/home/mwm/ Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information. -- http://mail.python.org/mailman/listinfo/python-list
Re: HTML parsing/scraping & python
The standard library module for fetching HTML is urllib2. The best module for scraping the HTML is BeautifulSoup. There is a project called mechanize, built by John Lee on top of urllib2 and other standard modules. It will emulate a browsers behaviour - including history, cookies, basic authentication, etc. There are several modules for automated form filling - FormEncode being one. All the best, Fuzzyman http://www.voidspace.org.uk/python/index.shtml -- http://mail.python.org/mailman/listinfo/python-list
Re: HTML parsing/scraping & python
Sanjay Arora <[EMAIL PROTECTED]> writes: > We are looking to select the language & toolset more suitable for a > project that requires getting data from several web-sites in real- > timehtml parsing/scraping. It would require full emulation of the > browser, including handling cookies, automated logins & following > multiple web-link paths. Multiple threading would be a plus but not > requirement. Believe it or not, everything you ask for can be done by Python out of the box. But there are limitations. For one, the HTML parsing module that comes with Python doesn't handle invalid HTML very well. Thanks to Netscape, invalid HTML is the rule rather than the exception on the web. So you probably want to use a third party module for that. I use BeautifulSoup, which handles XML, HTML, has a *lovely* API (going from BeautifulSoup to DOM is always a major dissapointment), and works well with broken X/HTML. That sufficient for my needs, but I haven't been asked to do a lot of automated form filling, so the facilities in the standard library work for me. There are third party tools to help with that. I'm sure someone willsuggest them. > Can you suggest solutions for python? Pros & Cons using Perl vs. Python? > Why Python? Because it's beautiful. Seriously, Python code is very readable, by design. Of course, some of the features that make that happen drive some people crazy. If you're one of them, then Python isn't the language for you. http://www.mired.org/home/mwm/ Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information. -- http://mail.python.org/mailman/listinfo/python-list
HTML parsing/scraping & python
We are looking to select the language & toolset more suitable for a project that requires getting data from several web-sites in real- timehtml parsing/scraping. It would require full emulation of the browser, including handling cookies, automated logins & following multiple web-link paths. Multiple threading would be a plus but not requirement. Some solutions were suggested: Perl: LWP::Simple WWW::Mechanize HTML::Parser Curl & libcurl: Can you suggest solutions for python? Pros & Cons using Perl vs. Python? Why Python? Pointers to various other tools & their comparisons with python solutions will be most appreciated. Anyone who is knowledgeable about the application subject, please do share your knowledge to help us do this right. With best regards. Sanjay. -- http://mail.python.org/mailman/listinfo/python-list
html parsing
Hi all, Please help me in parsing the html document and extract the http links . Thanks in advance!!1 Suchitra -- http://mail.python.org/mailman/listinfo/python-list