Re: Html Parsing stuff

2014-07-21 Thread Nicholas Cannon
dont worry it has been solved
-- 
https://mail.python.org/mailman/listinfo/python-list


Html Parsing stuff

2014-07-21 Thread Nicholas Cannon
Ok i get the basics of this and i have been doing some successful parsings and 
using regular expressions to find html tags. I have tried to find an img tag 
and write that image to a file. I have had no success. It says it has 
successfully wrote the image to the file with a try... except statement but 
when i try to open this it says that the image has like no been saved correctly 
or is damaged. This was just reading the src attribute of the tag and trying to 
save that link to a .jpg(the extension of the image). Ok so i looked deeper and 
added a forward slash to the url and then added the image src attribute to it. 
I then opened that link with the urllib.urlopen() and then read the contents 
and saved it to the file again. I still got the same result as before. Is there 
a function in beautiful soup or the urllib module that i can use to save and 
image. This is just a problem i am sorting out not a whole application so the 
code is small. Thanks
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Beautifulsoup html parsing - nested tags

2011-01-05 Thread Selvam
On Wed, Jan 5, 2011 at 2:58 PM, Selvam  wrote:

> Hi all,
>
> I am trying to parse some html string with BeatifulSoup.
>
> The string is,
>
> 
>   
> 
>   
> 
>   
>   Tax
>   
>   Base
>   
> Amount
> 
>   
> 
>   
>
>
>
> rtables=soup.findAll(re.compile('table$'))
>
> The rtables is,
>
> [
> 
> 
> 
> 
> 
> Tax
> 
> Base
> 
> Amount
> 
> , 
> ]
>
>
>
> The tr inside the blocktable are appearing inside the table, while
> blocktable contains nothing.
>
> Is there any way, I can get the tr in the right place (inside blocktable) ?
>
> --
> Regards,
> S.Selvam
> SG E-ndicus Infotech Pvt Ltd.
> http://e-ndicus.com/
>
>  " I am because we are "
>

Replying to myself,

BeautifulSoup.BeautifulSoup.NESTABLE_TABLE_TAGS['tr'].append('blocktable')

adding this, solved the issue.

-- 
Regards,
S.Selvam
SG E-ndicus Infotech Pvt Ltd.
http://e-ndicus.com/

 " I am because we are "
-- 
http://mail.python.org/mailman/listinfo/python-list


Beautifulsoup html parsing - nested tags

2011-01-05 Thread Selvam
Hi all,

I am trying to parse some html string with BeatifulSoup.

The string is,


  

  

  
  Tax
  
  Base
  
Amount

  

  
   


rtables=soup.findAll(re.compile('table$'))

The rtables is,

[





Tax

Base

Amount

, 
]



The tr inside the blocktable are appearing inside the table, while
blocktable contains nothing.

Is there any way, I can get the tr in the right place (inside blocktable) ?

-- 
Regards,
S.Selvam
SG E-ndicus Infotech Pvt Ltd.
http://e-ndicus.com/

 " I am because we are "
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: HTML Parsing

2008-06-30 Thread Larry Bates

[EMAIL PROTECTED] wrote:

Hi everyone
I am trying to build my own web crawler for an experiement and I don't
know how to access HTTP protocol with python.

Also, Are there any Opensource Parsing engine for HTML documents
available in Python too? That would be great.


Check on Mechanize.  It wraps Beautiful Soup inside of methods that aid in 
website crawling.


http://pypi.python.org/pypi/mechanize/0.1.7b

-Larry
--
http://mail.python.org/mailman/listinfo/python-list


Re: HTML Parsing

2008-06-29 Thread Sebastian "lunar" Wiesner
Stefan Behnel <[EMAIL PROTECTED]>:

> [EMAIL PROTECTED] wrote:
>> I am trying to build my own web crawler for an experiement and I don't
>> know how to access HTTP protocol with python.
>>
>> Also, Are there any Opensource Parsing engine for HTML documents
>> available in Python too? That would be great.
> 
> Try lxml.html. It parses broken HTML, supports HTTP, is much faster than
> BeautifulSoup and threadable, all of which should be helpful for your
> crawler.

You should mention its powerful features like XPATH and CSS selection
support and its easy API here, too ;)

-- 
Freedom is always the freedom of dissenters.
  (Rosa Luxemburg)
--
http://mail.python.org/mailman/listinfo/python-list


Re: HTML Parsing

2008-06-28 Thread Stefan Behnel
[EMAIL PROTECTED] wrote:
> I am trying to build my own web crawler for an experiement and I don't
> know how to access HTTP protocol with python.
>
> Also, Are there any Opensource Parsing engine for HTML documents
> available in Python too? That would be great.

Try lxml.html. It parses broken HTML, supports HTTP, is much faster than
BeautifulSoup and threadable, all of which should be helpful for your crawler.

http://codespeak.net/lxml/

Stefan
--
http://mail.python.org/mailman/listinfo/python-list


Re: HTML Parsing

2008-06-28 Thread Victor Noagbodji
> Hi everyone
Hello

> I am trying to build my own web crawler for an experiement and I don't
> know how to access HTTP protocol with python.
urllib2: http://docs.python.org/lib/module-urllib2.html

> Also, Are there any Opensource Parsing engine for HTML documents
> available in Python too? That would be great.
BeautifulSoup:
  http://www.crummy.com/software/BeautifulSoup/
  http://www.crummy.com/software/BeautifulSoup/documentation.html

All the best

-- 
NOAGBODJI Paul Victor
--
http://mail.python.org/mailman/listinfo/python-list


Re: HTML Parsing

2008-06-28 Thread Benjamin
On Jun 28, 9:03 pm, [EMAIL PROTECTED] wrote:
> Hi everyone
> I am trying to build my own web crawler for an experiement and I don't
> know how to access HTTP protocol with python.

Look at the httplib module.

>
> Also, Are there any Opensource Parsing engine for HTML documents
> available in Python too? That would be great.

--
http://mail.python.org/mailman/listinfo/python-list


Re: HTML Parsing

2008-06-28 Thread Dan Stromberg
On Sat, 28 Jun 2008 19:03:39 -0700, disappearedng wrote:

> Hi everyone
> I am trying to build my own web crawler for an experiement and I don't
> know how to access HTTP protocol with python.
> 
> Also, Are there any Opensource Parsing engine for HTML documents
> available in Python too? That would be great.

Check out BeautifulSoup.  I don't recall what license it uses, but the 
source is available, and it deals well with not-necessarily-beautiful-
inside HTML.

--
http://mail.python.org/mailman/listinfo/python-list


HTML Parsing

2008-06-28 Thread disappearedng
Hi everyone
I am trying to build my own web crawler for an experiement and I don't
know how to access HTTP protocol with python.

Also, Are there any Opensource Parsing engine for HTML documents
available in Python too? That would be great.


--
http://mail.python.org/mailman/listinfo/python-list


Re: HTML parsing confusion

2008-01-23 Thread Gabriel Genellina
En Wed, 23 Jan 2008 10:40:14 -0200, Alnilam <[EMAIL PROTECTED]> escribió:

> Skipping past html validation, and html to xhtml 'cleaning', and
> instead starting with the assumption that I have files that are valid
> XHTML, can anyone give me a good example of how I would use _ htmllib,
> HTMLParser, or ElementTree _ to parse out the text of one specific
> childNode, similar to the examples that I provided above using regex?

The diveintopython page is not valid XHTML (but it's valid HTML). Assuming  
it's property converted:

py> from cStringIO import StringIO
py> import xml.etree.ElementTree as ET
py> tree = ET.parse(StringIO(page))
py> elem = tree.findall('//p')[4]
py>
py> # from the online ElementTree docs
py> http://www.effbot.org/zone/element-bits-and-pieces.htm
... def gettext(elem):
... text = elem.text or ""
... for e in elem:
... text += gettext(e)
... if e.tail:
... text += e.tail
... return text
...
py> print gettext(elem)
The complete text is available online.  You can read the revision history  
to see
  what's new. Updated 20 May 2004

-- 
Gabriel Genellina

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: HTML parsing confusion

2008-01-23 Thread Jerry Hill
On Jan 23, 2008 7:40 AM, Alnilam <[EMAIL PROTECTED]> wrote:
> Skipping past html validation, and html to xhtml 'cleaning', and
> instead starting with the assumption that I have files that are valid
> XHTML, can anyone give me a good example of how I would use _ htmllib,
> HTMLParser, or ElementTree _ to parse out the text of one specific
> childNode, similar to the examples that I provided above using regex?

Have you looked at any of the tutorials or sample code for these
libraries?  If you had a specific question, you will probably get more
specific help.  I started writing up some sample code, but realized I
was mostly reprising the long tutorial on SGMLLib here:
http://www.boddie.org.uk/python/HTML.html

-- 
Jerry
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: HTML parsing confusion

2008-01-23 Thread Alnilam
On Jan 23, 3:54 am, "M.-A. Lemburg" <[EMAIL PROTECTED]> wrote:

> >> I was asking this community if there was a simple way to use only the
> >> tools included with Python to parse a bit of html.
>
> There are lots of ways doing HTML parsing in Python. A common
> one is e.g. using mxTidy to convert the HTML into valid XHTML
> and then use ElementTree to parse the data.
>
> http://www.egenix.com/files/python/mxTidy.htmlhttp://docs.python.org/lib/module-xml.etree.ElementTree.html
>
> For simple tasks you can also use the HTMLParser that's part
> of the Python std lib.
>
> http://docs.python.org/lib/module-HTMLParser.html
>
> Which tools to use is really dependent on what you are
> trying to solve.
>
> --
> Marc-Andre Lemburg
> eGenix.com
>
> Professional Python Services directly from the Source  (#1, Jan 23 2008)>>> 
> Python/Zope Consulting and Support ...        http://www.egenix.com/
> >>> mxODBC.Zope.Database.Adapter ...            http://zope.egenix.com/
> >>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
>
> 
>
>  Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! 
>
>    eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
>     D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
>            Registered at Amtsgericht Duesseldorf: HRB 46611

Thanks. So far that makes 3 votes for BeautifulSoup, and one vote each
for libxml2dom, pyparsing, and mxTidy. I'm sure those would all be
great solutions, if I was looking to solve my coding question with
external modules.

Several folks have mentioned now that they think that if I have files
that are valid XHTML, that I could use htmllib, HTMLParser, or
ElementTree (all of which are part of the standard libraries in v
2.5).

Skipping past html validation, and html to xhtml 'cleaning', and
instead starting with the assumption that I have files that are valid
XHTML, can anyone give me a good example of how I would use _ htmllib,
HTMLParser, or ElementTree _ to parse out the text of one specific
childNode, similar to the examples that I provided above using regex?
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: HTML parsing confusion

2008-01-23 Thread cokofreedom
> The pages I'm trying to write this code to run against aren't in the
> wild, though. They are static html files on my company's lan, are very
> consistent in format, and are (I believe) valid html.

Obvious way to check this is to go to http://validator.w3.org/ and see
what it tells you about your html...
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: HTML parsing confusion

2008-01-23 Thread M.-A. Lemburg
On 2008-01-23 01:29, Gabriel Genellina wrote:
> En Tue, 22 Jan 2008 19:20:32 -0200, Alnilam <[EMAIL PROTECTED]> escribió:
> 
>> On Jan 22, 11:39 am, "Diez B. Roggisch" <[EMAIL PROTECTED]> wrote:
>>> Alnilam wrote:
>>>> On Jan 22, 8:44 am, Alnilam <[EMAIL PROTECTED]> wrote:
>>>>>> Pardon me, but the standard issue Python 2.n (for n in range(5, 2,
>>>>>> -1)) doesn't have an xml.dom.ext ... you must have the  
>>> mega-monstrous
>>>>>> 200-modules PyXML package installed. And you don't want the 75Kb
>>>>>> BeautifulSoup?
>>>> Ugh. Found it. Sorry about that, but I still don't understand why
>>>> there isn't a simple way to do this without using PyXML, BeautifulSoup
>>>> or libxml2dom. What's the point in having sgmllib, htmllib,
>>>> HTMLParser, and formatter all built in if I have to use use someone
>>>> else's modules to write a couple of lines of code that achieve the
>>>> simple thing I want. I get the feeling that this would be easier if I
>>>> just broke down and wrote a couple of regular expressions, but it
>>>> hardly seems a 'pythonic' way of going about things.
>>> This is simply a gross misunderstanding of what BeautifulSoup or lxml
>>> accomplish. Dealing with mal-formatted HTML whilst trying to make _some_
>>> sense is by no means trivial. And just because you can come up with a  
>>> few
>>> lines of code using rexes that work for your current use-case doesn't  
>>> mean
>>> that they serve as general html-fixing-routine. Or do you think the  
>>> rather
>>> long history and 75Kb of code for BS are because it's creator wasn't  
>>> aware
>>> of rexes?
>> I am, by no means, trying to trivialize the work that goes into
>> creating the numerous modules out there. However as a relatively
>> novice programmer trying to figure out something, the fact that these
>> modules are pushed on people with such zealous devotion that you take
>> offense at my desire to not use them gives me a bit of pause. I use
>> non-included modules for tasks that require them, when the capability
>> to do something clearly can't be done easily another way (eg.
>> MySQLdb). I am sure that there will be plenty of times where I will
>> use BeautifulSoup. In this instance, however, I was trying to solve a
>> specific problem which I attempted to lay out clearly from the
>> outset.
>>
>> I was asking this community if there was a simple way to use only the
>> tools included with Python to parse a bit of html.

There are lots of ways doing HTML parsing in Python. A common
one is e.g. using mxTidy to convert the HTML into valid XHTML
and then use ElementTree to parse the data.

http://www.egenix.com/files/python/mxTidy.html
http://docs.python.org/lib/module-xml.etree.ElementTree.html

For simple tasks you can also use the HTMLParser that's part
of the Python std lib.

http://docs.python.org/lib/module-HTMLParser.html

Which tools to use is really dependent on what you are
trying to solve.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jan 23 2008)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


 Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: HTML parsing confusion

2008-01-22 Thread Alnilam
On Jan 22, 7:29 pm, "Gabriel Genellina" <[EMAIL PROTECTED]>
wrote:
>
> > I was asking this community if there was a simple way to use only the
> > tools included with Python to parse a bit of html.
>
> If you *know* that your document is valid HTML, you can use the HTMLParser  
> module in the standard Python library. Or even the parser in the htmllib  
> module. But a lot of HTML pages out there are invalid, some are grossly  
> invalid, and those parsers are just unable to handle them. This is why  
> modules like BeautifulSoup exist: they contain a lot of heuristics and  
> trial-and-error and personal experience from the developers, in order to  
> guess more or less what the page author intended to write and make some  
> sense of that "tag soup".
> A guesswork like that is not suitable for the std lib ("Errors should  
> never pass silently" and "In the face of ambiguity, refuse the temptation  
> to guess.") but makes a perfect 3rd party module.
>
> If you want to use regular expressions, and that works OK for the  
> documents you are handling now, fine. But don't complain when your RE's  
> match too much or too little or don't match at all because of unclosed  
> tags, improperly nested tags, nonsense markup, or just a valid combination  
> that you didn't take into account.
>
> --
> Gabriel Genellina

Thanks, Gabriel. That does make sense, both what the benefits of
BeautifulSoup are and why it probably won't become std lib anytime
soon.

The pages I'm trying to write this code to run against aren't in the
wild, though. They are static html files on my company's lan, are very
consistent in format, and are (I believe) valid html. They just have
specific paragraphs of useful information, located in the same place
in each file, that I want to 'harvest' and put to better use. I used
diveintopython.org as an example only (and in part because it had good
clean html formatting). I am pretty sure that I could craft some
regular expressions to do the work -- which of course would not be the
case if I was screen scraping web pages in the 'wild' -- but I was
trying to find a way to do that using one of those std libs you
mentioned.

I'm not sure if HTMLParser or htmllib would work better to achieve the
same effect as the regex example I gave above, or how to get them to
do that. I thought I'd come close, but as someone pointed out early
on, I'd accidently tapped into PyXML which is installed where I was
testing code, but not necessarily where I need it. It may turn out
that the regex way works faster, but falling back on methods I'm
comfortable with doesn't help expand my Python knowledge.

So if anyone can tell me how to get HTMLParser or htmllib to grab a
specific paragraph, and then provide the text in that paragraph in a
clean, markup-free format, I'd appreciate it.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: HTML parsing confusion

2008-01-22 Thread [EMAIL PROTECTED]
On Jan 22, 7:29 pm, "Gabriel Genellina" <[EMAIL PROTECTED]>
wrote:

>
> > I was asking this community if there was a simple way to use only the
> > tools included with Python to parse a bit of html.
>
> If you *know* that your document is valid HTML, you can use the HTMLParser
> module in the standard Python library. Or even the parser in the htmllib
> module. But a lot of HTML pages out there are invalid, some are grossly
> invalid, and those parsers are just unable to handle them. This is why
> modules like BeautifulSoup exist: they contain a lot of heuristics and
> trial-and-error and personal experience from the developers, in order to
> guess more or less what the page author intended to write and make some
> sense of that "tag soup".
> A guesswork like that is not suitable for the std lib ("Errors should
> never pass silently" and "In the face of ambiguity, refuse the temptation
> to guess.") but makes a perfect 3rd party module.
>
> If you want to use regular expressions, and that works OK for the
> documents you are handling now, fine. But don't complain when your RE's
> match too much or too little or don't match at all because of unclosed
> tags, improperly nested tags, nonsense markup, or just a valid combination
> that you didn't take into account.
>
> --
> Gabriel Genellina

Thank you. That does make perfect sense, and is a good clear position
on the up and down side of what I'm trying to do, as well as a good
explanation for why BeautifulSoup will probably remain outside the std
lib. I'm sure that I will get plenty of use out of it.

If, however, I am sure that the html code in  target documents is
good, and the framework html doesn't change, just the data on page
after page of static html, would it be better to just go with regex or
with one of the std lib items you mentioned. I thought the latter, but
I'm stuck on how to make them generate results similar to the code I
put above as an example. I'm not trying to code this to go against
html in the wild, but to try to strip specific, consistently located
data from the markup and turn it into something more useful.

I may have confused folks by using the www.diveintopython.org page as
an example, but its html seemed to be valid strict tags.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: HTML parsing confusion

2008-01-22 Thread Gabriel Genellina
En Tue, 22 Jan 2008 19:20:32 -0200, Alnilam <[EMAIL PROTECTED]> escribió:

> On Jan 22, 11:39 am, "Diez B. Roggisch" <[EMAIL PROTECTED]> wrote:
>> Alnilam wrote:
>> > On Jan 22, 8:44 am, Alnilam <[EMAIL PROTECTED]> wrote:
>> >> > Pardon me, but the standard issue Python 2.n (for n in range(5, 2,
>> >> > -1)) doesn't have an xml.dom.ext ... you must have the  
>> mega-monstrous
>> >> > 200-modules PyXML package installed. And you don't want the 75Kb
>> >> > BeautifulSoup?
>> > Ugh. Found it. Sorry about that, but I still don't understand why
>> > there isn't a simple way to do this without using PyXML, BeautifulSoup
>> > or libxml2dom. What's the point in having sgmllib, htmllib,
>> > HTMLParser, and formatter all built in if I have to use use someone
>> > else's modules to write a couple of lines of code that achieve the
>> > simple thing I want. I get the feeling that this would be easier if I
>> > just broke down and wrote a couple of regular expressions, but it
>> > hardly seems a 'pythonic' way of going about things.
>>
>> This is simply a gross misunderstanding of what BeautifulSoup or lxml
>> accomplish. Dealing with mal-formatted HTML whilst trying to make _some_
>> sense is by no means trivial. And just because you can come up with a  
>> few
>> lines of code using rexes that work for your current use-case doesn't  
>> mean
>> that they serve as general html-fixing-routine. Or do you think the  
>> rather
>> long history and 75Kb of code for BS are because it's creator wasn't  
>> aware
>> of rexes?
>
> I am, by no means, trying to trivialize the work that goes into
> creating the numerous modules out there. However as a relatively
> novice programmer trying to figure out something, the fact that these
> modules are pushed on people with such zealous devotion that you take
> offense at my desire to not use them gives me a bit of pause. I use
> non-included modules for tasks that require them, when the capability
> to do something clearly can't be done easily another way (eg.
> MySQLdb). I am sure that there will be plenty of times where I will
> use BeautifulSoup. In this instance, however, I was trying to solve a
> specific problem which I attempted to lay out clearly from the
> outset.
>
> I was asking this community if there was a simple way to use only the
> tools included with Python to parse a bit of html.

If you *know* that your document is valid HTML, you can use the HTMLParser  
module in the standard Python library. Or even the parser in the htmllib  
module. But a lot of HTML pages out there are invalid, some are grossly  
invalid, and those parsers are just unable to handle them. This is why  
modules like BeautifulSoup exist: they contain a lot of heuristics and  
trial-and-error and personal experience from the developers, in order to  
guess more or less what the page author intended to write and make some  
sense of that "tag soup".
A guesswork like that is not suitable for the std lib ("Errors should  
never pass silently" and "In the face of ambiguity, refuse the temptation  
to guess.") but makes a perfect 3rd party module.

If you want to use regular expressions, and that works OK for the  
documents you are handling now, fine. But don't complain when your RE's  
match too much or too little or don't match at all because of unclosed  
tags, improperly nested tags, nonsense markup, or just a valid combination  
that you didn't take into account.

-- 
Gabriel Genellina

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: HTML parsing confusion

2008-01-22 Thread Alnilam
On Jan 22, 11:39 am, "Diez B. Roggisch" <[EMAIL PROTECTED]> wrote:
> Alnilam wrote:
> > On Jan 22, 8:44 am, Alnilam <[EMAIL PROTECTED]> wrote:
> >> > Pardon me, but the standard issue Python 2.n (for n in range(5, 2,
> >> > -1)) doesn't have an xml.dom.ext ... you must have the mega-monstrous
> >> > 200-modules PyXML package installed. And you don't want the 75Kb
> >> > BeautifulSoup?
>
> >> I wasn't aware that I had PyXML installed, and can't find a reference
> >> to having it installed in pydocs. ...
>
> > Ugh. Found it. Sorry about that, but I still don't understand why
> > there isn't a simple way to do this without using PyXML, BeautifulSoup
> > or libxml2dom. What's the point in having sgmllib, htmllib,
> > HTMLParser, and formatter all built in if I have to use use someone
> > else's modules to write a couple of lines of code that achieve the
> > simple thing I want. I get the feeling that this would be easier if I
> > just broke down and wrote a couple of regular expressions, but it
> > hardly seems a 'pythonic' way of going about things.
>
> This is simply a gross misunderstanding of what BeautifulSoup or lxml
> accomplish. Dealing with mal-formatted HTML whilst trying to make _some_
> sense is by no means trivial. And just because you can come up with a few
> lines of code using rexes that work for your current use-case doesn't mean
> that they serve as general html-fixing-routine. Or do you think the rather
> long history and 75Kb of code for BS are because it's creator wasn't aware
> of rexes?
>
> And it also makes no sense stuffing everything remotely useful into the
> standard lib. This would force to align development and release cycles,
> resulting in much less features and stability as it can be wished.
>
> And to be honest: I fail to see where your problem is. BeatifulSoup is a
> single Python file. So whatever you carry with you from machine to machine,
> if it's capable of holding a file of your own code, you can simply put
> BeautifulSoup beside it - even if it was a floppy  disk.
>
> Diez


I am, by no means, trying to trivialize the work that goes into
creating the numerous modules out there. However as a relatively
novice programmer trying to figure out something, the fact that these
modules are pushed on people with such zealous devotion that you take
offense at my desire to not use them gives me a bit of pause. I use
non-included modules for tasks that require them, when the capability
to do something clearly can't be done easily another way (eg.
MySQLdb). I am sure that there will be plenty of times where I will
use BeautifulSoup. In this instance, however, I was trying to solve a
specific problem which I attempted to lay out clearly from the
outset.

I was asking this community if there was a simple way to use only the
tools included with Python to parse a bit of html.

If the answer is no, that's fine. Confusing, but fine. If the answer
is yes, great. I look forward to learning from someone's example. If
you don't have an answer, or a positive contribution, then please
don't interject your angst into this thread.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: HTML parsing confusion

2008-01-22 Thread Diez B. Roggisch
Alnilam wrote:

> On Jan 22, 8:44 am, Alnilam <[EMAIL PROTECTED]> wrote:
>> > Pardon me, but the standard issue Python 2.n (for n in range(5, 2,
>> > -1)) doesn't have an xml.dom.ext ... you must have the mega-monstrous
>> > 200-modules PyXML package installed. And you don't want the 75Kb
>> > BeautifulSoup?
>>
>> I wasn't aware that I had PyXML installed, and can't find a reference
>> to having it installed in pydocs. ...
> 
> Ugh. Found it. Sorry about that, but I still don't understand why
> there isn't a simple way to do this without using PyXML, BeautifulSoup
> or libxml2dom. What's the point in having sgmllib, htmllib,
> HTMLParser, and formatter all built in if I have to use use someone
> else's modules to write a couple of lines of code that achieve the
> simple thing I want. I get the feeling that this would be easier if I
> just broke down and wrote a couple of regular expressions, but it
> hardly seems a 'pythonic' way of going about things.

This is simply a gross misunderstanding of what BeautifulSoup or lxml
accomplish. Dealing with mal-formatted HTML whilst trying to make _some_
sense is by no means trivial. And just because you can come up with a few
lines of code using rexes that work for your current use-case doesn't mean
that they serve as general html-fixing-routine. Or do you think the rather
long history and 75Kb of code for BS are because it's creator wasn't aware
of rexes?

And it also makes no sense stuffing everything remotely useful into the
standard lib. This would force to align development and release cycles,
resulting in much less features and stability as it can be wished.

And to be honest: I fail to see where your problem is. BeatifulSoup is a
single Python file. So whatever you carry with you from machine to machine,
if it's capable of holding a file of your own code, you can simply put
BeautifulSoup beside it - even if it was a floppy  disk.

Diez
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: HTML parsing confusion

2008-01-22 Thread Alnilam
On Jan 22, 8:44 am, Alnilam <[EMAIL PROTECTED]> wrote:
> > Pardon me, but the standard issue Python 2.n (for n in range(5, 2,
> > -1)) doesn't have an xml.dom.ext ... you must have the mega-monstrous
> > 200-modules PyXML package installed. And you don't want the 75Kb
> > BeautifulSoup?
>
> I wasn't aware that I had PyXML installed, and can't find a reference
> to having it installed in pydocs. ...

Ugh. Found it. Sorry about that, but I still don't understand why
there isn't a simple way to do this without using PyXML, BeautifulSoup
or libxml2dom. What's the point in having sgmllib, htmllib,
HTMLParser, and formatter all built in if I have to use use someone
else's modules to write a couple of lines of code that achieve the
simple thing I want. I get the feeling that this would be easier if I
just broke down and wrote a couple of regular expressions, but it
hardly seems a 'pythonic' way of going about things.

# get the source (assuming you don't have it locally and have an
internet connection)
>>> import urllib
>>> page = urllib.urlopen("http://diveintopython.org/";)
>>> source = page.read()
>>> page.close()

# set up some regex to find tags, strip them out, and correct some
formatting oddities
>>> import re
>>> p = re.compile(r'(.*?)',re.DOTALL)
>>> tag_strip = re.compile(r'>(.*?)<',re.DOTALL)
>>> fix_format = re.compile(r'\n +',re.MULTILINE)

# achieve clean results.
>>> paragraphs = re.findall(p,source)
>>> text_list = re.findall(tag_strip,paragraphs[5])
>>> text = "".join(text_list)
>>> clean_text = re.sub(fix_format," ",text)

This works, and is small and easily reproduced, but seems like it
would break easily and seems a waste of other *ML specific parsers.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: HTML parsing confusion

2008-01-22 Thread Paul McGuire
On Jan 22, 7:44 am, Alnilam <[EMAIL PROTECTED]> wrote:
> ...I move from computer to
> computer regularly, and while all have a recent copy of Python, each
> has different (or no) extra modules, and I don't always have the
> luxury of downloading extras. That being said, if there's a simple way
> of doing it with BeautifulSoup, please show me an example. Maybe I can
> figure out a way to carry the extra modules I need around with me.

Pyparsing's footprint is intentionally small - just one pyparsing.py
file that you can drop into a directory next to your own script.  And
the code to extract paragraph 5 of the "Dive Into Python" home page?
See annotated code below.

-- Paul

from pyparsing import makeHTMLTags, SkipTo, anyOpenTag, anyCloseTag
import urllib
import textwrap

page = urllib.urlopen("http://diveintopython.org/";)
source = page.read()
page.close()

# define a simple paragraph matcher
pStart,pEnd = makeHTMLTags("P")
paragraph = pStart.suppress() + SkipTo(pEnd) + pEnd.suppress()

# get all paragraphs from the input string (or use the
# scanString generator function to stop at the correct
# paragraph instead of reading them all)
paragraphs = paragraph.searchString(source)

# create a transformer that will strip HTML tags
tagStripper = anyOpenTag.suppress() | anyCloseTag.suppress()

# get paragraph[5] and strip the HTML tags
p5TextOnly = tagStripper.transformString(paragraphs[5][0])

# remove extra whitespace
p5TextOnly = " ".join(p5TextOnly.split())

# print out a nicely wrapped string - so few people know
# that textwrap is part of the standard Python distribution,
# but it is very handy
print textwrap.fill(p5TextOnly, 60)

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: HTML parsing confusion

2008-01-22 Thread Alnilam
> Pardon me, but the standard issue Python 2.n (for n in range(5, 2,
> -1)) doesn't have an xml.dom.ext ... you must have the mega-monstrous
> 200-modules PyXML package installed. And you don't want the 75Kb
> BeautifulSoup?

I wasn't aware that I had PyXML installed, and can't find a reference
to having it installed in pydocs. And that highlights the problem I
have at the moment with using other modules. I move from computer to
computer regularly, and while all have a recent copy of Python, each
has different (or no) extra modules, and I don't always have the
luxury of downloading extras. That being said, if there's a simple way
of doing it with BeautifulSoup, please show me an example. Maybe I can
figure out a way to carry the extra modules I need around with me.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: HTML parsing confusion

2008-01-22 Thread Paul Boddie
On 22 Jan, 06:31, Alnilam <[EMAIL PROTECTED]> wrote:
> Sorry for the noob question, but I've gone through the documentation
> on python.org, tried some of the diveintopython and boddie's examples,
> and looked through some of the numerous posts in this group on the
> subject and I'm still rather confused. I know that there are some
> great tools out there for doing this (BeautifulSoup, lxml, etc.) but I
> am trying to accomplish a simple task with a minimal (as in nil)
> amount of adding in modules that aren't "stock" 2.5, and writing a
> huge class of my own (or copying one from diveintopython) seems
> overkill for what I want to do.

It's unfortunate that you don't want to install extra modules, but I'd
probably use libxml2dom [1] for what you're about to describe...

> Here's what I want to accomplish... I want to open a page, identify a
> specific point in the page, and turn the information there into
> plaintext. For example, on thewww.diveintopython.orgpage, I want to
> turn the paragraph that starts "Translations are freely
> permitted" (and ends ..."let me know"), into a string variable.
>
> Opening the file seems pretty straightforward.
>
> >>> import urllib
> >>> page = urllib.urlopen("http://diveintopython.org/";)
> >>> source = page.read()
> >>> page.close()
>
> gets me to a string variable consisting of the un-parsed contents of
> the page.

Yes, there may be shortcuts that let some parsers read directly from
the server, but it's always good to have the page text around, anyway.

> Now things get confusing, though, since there appear to be several
> approaches.
> One that I read somewhere was:
>
> >>> from xml.dom.ext.reader import HtmlLib
> >>> reader = HtmlLib.Reader()
> >>> doc = reader.fromString(source)
>
> This gets me doc as 
>
> >>> paragraphs = doc.getElementsByTagName('p')
>
> gets me all of the paragraph children, and the one I specifically want
> can then be referenced with: paragraphs[5] This method seems to be
> pretty straightforward, but what do I do with it to get it into a
> string cleanly?

In less sophisticated DOM implementations, what you'd do is to loop
over the "descendant" nodes of the paragraph which are text nodes and
concatenate them.

> >>> from xml.dom.ext import PrettyPrint
> >>> PrettyPrint(paragraphs[5])
>
> shows me the text, but still in html, and I can't seem to get it to
> turn into a string variable, and I think the PrettyPrint function is
> unnecessary for what I want to do.

Yes, PrettyPrint is for prettyprinting XML. You just want to visit and
collect the text nodes.

>Formatter seems to do what I want,
> but I can't figure out how to link the  "Element Node" at
> paragraphs[5] with the formatter functions to produce the string I
> want as output. I tried some of the htmllib.HTMLParser(formatter
> stuff) examples, but while I can supposedly get that to work with
> formatter a little easier, I can't figure out how to get HTMLParser to
> drill down specifically to the 6th paragraph's contents.

Given that you've found the paragraph above, you just need to write a
recursive function which visits child nodes, and if it finds a text
node then it collects the value of the node in a list; otherwise, for
elements, it visits the child nodes of that element; and so on. The
recursive approach is presumably what the formatter uses, but I can't
say that I've really looked at it.

Meanwhile, with libxml2dom, you'd do something like this:

  import libxml2dom
  d = libxml2dom.parseURI("http://www.diveintopython.org/";, html=1)
  saved = None

  # Find the paragraphs.
  for p in d.xpath("//p"):

# Get the text without leading and trailing space.
text = p.textContent.strip()

# Save the appropriate paragraph text.
if text.startswith("Translations are freely permitted") and \
  text.endswith("just let me know."):

  saved = text
  break

The magic part of this code which saves you from needing to write that
recursive function mentioned above is the textContent property on the
paragraph element.

Paul

[1] http://www.python.org/pypi/libxml2dom
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: HTML parsing confusion

2008-01-22 Thread John Machin
On Jan 22, 4:31 pm, Alnilam <[EMAIL PROTECTED]> wrote:
> Sorry for the noob question, but I've gone through the documentation
> on python.org, tried some of the diveintopython and boddie's examples,
> and looked through some of the numerous posts in this group on the
> subject and I'm still rather confused. I know that there are some
> great tools out there for doing this (BeautifulSoup, lxml, etc.) but I
> am trying to accomplish a simple task with a minimal (as in nil)
> amount of adding in modules that aren't "stock" 2.5, and writing a
> huge class of my own (or copying one from diveintopython) seems
> overkill for what I want to do.
>
> Here's what I want to accomplish... I want to open a page, identify a
> specific point in the page, and turn the information there into
> plaintext. For example, on thewww.diveintopython.orgpage, I want to
> turn the paragraph that starts "Translations are freely
> permitted" (and ends ..."let me know"), into a string variable.
>
> Opening the file seems pretty straightforward.
>
> >>> import urllib
> >>> page = urllib.urlopen("http://diveintopython.org/";)
> >>> source = page.read()
> >>> page.close()
>
> gets me to a string variable consisting of the un-parsed contents of
> the page.
> Now things get confusing, though, since there appear to be several
> approaches.
> One that I read somewhere was:
>
> >>> from xml.dom.ext.reader import HtmlLib

Pardon me, but the standard issue Python 2.n (for n in range(5, 2,
-1)) doesn't have an xml.dom.ext ... you must have the mega-monstrous
200-modules PyXML package installed. And you don't want the 75Kb
BeautifulSoup?

-- 
http://mail.python.org/mailman/listinfo/python-list


HTML parsing confusion

2008-01-21 Thread Alnilam
Sorry for the noob question, but I've gone through the documentation
on python.org, tried some of the diveintopython and boddie's examples,
and looked through some of the numerous posts in this group on the
subject and I'm still rather confused. I know that there are some
great tools out there for doing this (BeautifulSoup, lxml, etc.) but I
am trying to accomplish a simple task with a minimal (as in nil)
amount of adding in modules that aren't "stock" 2.5, and writing a
huge class of my own (or copying one from diveintopython) seems
overkill for what I want to do.

Here's what I want to accomplish... I want to open a page, identify a
specific point in the page, and turn the information there into
plaintext. For example, on the www.diveintopython.org page, I want to
turn the paragraph that starts "Translations are freely
permitted" (and ends ..."let me know"), into a string variable.

Opening the file seems pretty straightforward.

>>> import urllib
>>> page = urllib.urlopen("http://diveintopython.org/";)
>>> source = page.read()
>>> page.close()

gets me to a string variable consisting of the un-parsed contents of
the page.
Now things get confusing, though, since there appear to be several
approaches.
One that I read somewhere was:

>>> from xml.dom.ext.reader import HtmlLib
>>> reader = HtmlLib.Reader()
>>> doc = reader.fromString(source)

This gets me doc as 

>>> paragraphs = doc.getElementsByTagName('p')

gets me all of the paragraph children, and the one I specifically want
can then be referenced with: paragraphs[5] This method seems to be
pretty straightforward, but what do I do with it to get it into a
string cleanly?

>>> from xml.dom.ext import PrettyPrint
>>> PrettyPrint(paragraphs[5])

shows me the text, but still in html, and I can't seem to get it to
turn into a string variable, and I think the PrettyPrint function is
unnecessary for what I want to do. Formatter seems to do what I want,
but I can't figure out how to link the  "Element Node" at
paragraphs[5] with the formatter functions to produce the string I
want as output. I tried some of the htmllib.HTMLParser(formatter
stuff) examples, but while I can supposedly get that to work with
formatter a little easier, I can't figure out how to get HTMLParser to
drill down specifically to the 6th paragraph's contents.

Thanks in advance.

- A.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to Encode Parameters into an HTML Parsing Script

2007-06-22 Thread SMERSH009X
On Jun 21, 9:45 pm, "Gabriel Genellina" <[EMAIL PROTECTED]>
wrote:
> En Thu, 21 Jun 2007 23:37:07 -0300, <[EMAIL PROTECTED]> escribió:
>
> > So for example if I wanted to navigate to an encoded url
> >http://online.investools.com/landing.iedu?signedin=truerather than
> > justhttp://online.investools.com/landing.iedu  How would I do this?
> > How can I modify thescriptto urlencode these parameters:
> > {signedin:true} and to associate them with a specific url from the
> > urlList
>
> If you want to use GET, append '?' plus the encoded parameters to the
> desired url:
>
> py> data = {'signedin':'true', 'another':42}
> py> print urlencode(data)
> signedin=true&another=42
>
> Do not use the data argument to urlopen.
>
> --
> Gabriel Genellina

Sweet! I love this python group

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: How to Encode Parameters into an HTML Parsing Script

2007-06-21 Thread Gabriel Genellina
En Thu, 21 Jun 2007 23:37:07 -0300, <[EMAIL PROTECTED]> escribió:

> So for example if I wanted to navigate to an encoded url
> http://online.investools.com/landing.iedu?signedin=true rather than
> just http://online.investools.com/landing.iedu   How would I do this?
> How can I modify the script to urlencode these parameters:
> {signedin:true} and to associate them with a specific url from the
> urlList

If you want to use GET, append '?' plus the encoded parameters to the  
desired url:

py> data = {'signedin':'true', 'another':42}
py> print urlencode(data)
signedin=true&another=42

Do not use the data argument to urlopen.

-- 
Gabriel Genellina

-- 
http://mail.python.org/mailman/listinfo/python-list


How to Encode Parameters into an HTML Parsing Script

2007-06-21 Thread SMERSH009X
I've written a Script that navigates various urls on a website, and
fetches the contents.
The Url's are being fed from a list "urlList". Everything seems to
work splendidly, until I introduce the concept of encoding parameters
for a certain url.
So for example if I wanted to navigate to an encoded url
http://online.investools.com/landing.iedu?signedin=true rather than
just http://online.investools.com/landing.iedu   How would I do this?
How can I modify the script to urlencode these parameters:
{signedin:true} and to associate them with a specific url from the
urlList
 Thank you!


import datetime, time, re, os, sys, traceback, smtplib, string,
urllib2, urllib, inspect
from urllib2 import build_opener, HTTPCookieProcessor, Request
opener = build_opener(HTTPCookieProcessor)
from urllib import urlencode

def urlopen2(url, data=None, user_agent='urlopen2'):
"""Opens Our URLS """
if hasattr(data, "__iter__"):
data = urlencode(data)
headers = {'User-Agent' : user_agent}  # User-Agent for
Unspecified Browser
return opener.open(Request(url, data, headers))

def badCharCheck(host,url):
try:
page = urlopen2("http://"+host+".investools.com/"+url+"";, ())
pageRead= page.read()
print "Loading:",url
#print pageRead
except:
print "Failed: ", traceback.format_tb(sys.exc_info()[2]),'\n'


if __name__ == '__main__':
host= "online"
urlList = ["landing.iedu","sitemap.iedu"]
print "\n","* Begin BadCharCheck for", host
for url in urlList:
badCharCheck(host,url)

print'* TEST FINISHED! Total Runs:'
sys.exit()

OUTPUT:
* Begin BadCharCheck for online
Loading: landing.iedu
Loading: sitemap.iedu
* TEST FINISHED! Total Runs:

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Output of HTML parsing

2007-06-19 Thread Stefan Behnel
Jackie schrieb:
> On 6 15 ,   2 01 , Stefan Behnel <[EMAIL PROTECTED]> wrote:
>> Jackie wrote:
> 
>> import lxml.etree as et
>> url = "http://www.economics.utoronto.ca/index.php/index/person/faculty/";
>> tree = et.parse(url)
>>
> 
>> Stefan- -
>>
>> - -
> 
> Thank you. But when I tried to run the above part, the following
> message showed up:
> 
> Traceback (most recent call last):
>   File "D:\TS\Python\workspace\eco_department\lxml_ver.py", line 3, in
> 
> tree = et.parse(url)
>   File "etree.pyx", line 1845, in etree.parse
>   File "parser.pxi", line 928, in etree._parseDocument
>   File "parser.pxi", line 932, in etree._parseDocumentFromURL
>   File "parser.pxi", line 849, in etree._parseDocFromFile
>   File "parser.pxi", line 557, in etree._BaseParser._parseDocFromFile
>   File "parser.pxi", line 631, in etree._handleParseResult
>   File "parser.pxi", line 602, in etree._raiseParseError
> etree.XMLSyntaxError: line 2845: Premature end of data in tag html
> line 8
> 
> Could you please tell me where went wrong?

Ah, ok, then the page is not actually XHTML, but broken HTML. Use this idiom
instead:

parser = et.HTMLParser()
tree = et.parse(url, parser)

Stefan
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Output of HTML parsing

2007-06-19 Thread Jackie
On 6 15 ,   2 01 , Stefan Behnel <[EMAIL PROTECTED]> wrote:
> Jackie wrote:

> import lxml.etree as et
> url = "http://www.economics.utoronto.ca/index.php/index/person/faculty/";
> tree = et.parse(url)
>

> Stefan- -
>
> - -

Thank you. But when I tried to run the above part, the following
message showed up:

Traceback (most recent call last):
  File "D:\TS\Python\workspace\eco_department\lxml_ver.py", line 3, in

tree = et.parse(url)
  File "etree.pyx", line 1845, in etree.parse
  File "parser.pxi", line 928, in etree._parseDocument
  File "parser.pxi", line 932, in etree._parseDocumentFromURL
  File "parser.pxi", line 849, in etree._parseDocFromFile
  File "parser.pxi", line 557, in etree._BaseParser._parseDocFromFile
  File "parser.pxi", line 631, in etree._handleParseResult
  File "parser.pxi", line 602, in etree._raiseParseError
etree.XMLSyntaxError: line 2845: Premature end of data in tag html
line 8

Could you please tell me where went wrong?

Thank you

Jackie

-- 
http://mail.python.org/mailman/listinfo/python-list


Output of html parsing

2007-06-16 Thread Jackie Wang
Hi, all,
   
  I want to get the information of the professors (name,title) from the 
following link:
   
  "http://www.economics.utoronto.ca/index.php/index/person/faculty/";
   
  Ideally, I'd like to have a output file where each line is one Prof, 
including his name and title. In practice, I use the CSV module.
   
  The following is my program:
  
--- Program 
  import urllib,re,csv
   
  url = "http://www.economics.utoronto.ca/index.php/index/person/faculty/";
   
  sock = urllib.urlopen(url)
htmlSource = sock.read()
sock.close()
   
  namePattern = re.compile(r'class="name">(.*)') 
titlePattern = re.compile(r', (.*)\s*')
  name = namePattern.findall(htmlSource)
title_temp = titlePattern.findall(htmlSource)
title =[]
for item in title_temp:
item_new=" ".join(item.split())#Suppress the spaces between 
'title' and 
title.extend([item_new])

  output =[] 
for i in range(len(name)):
output.insert(i,[name[i],title[i]])#Generate a list of [name, 
title]

writer = csv.writer(open("professor.csv", "wb"))
writer.writerows(output)   #output CSV file
  -- End of Program --
   
  My questions are:
   
  1.The code above assume that each Prof has a tilte. If any one of them does 
not, the name and title will be mismatched. How to program to allow that 
  title can be empty?
   
  2.Is there any easier way to get the data I want other than using list?
   
  3.Should I close the opened csv file("professor.csv")? How to close it?
   
  Thanks!
   
  Jackie
  
 

   
-
 All new Yahoo! Mail - 
-
Get a sneak peak at messages with a handy reading pane.-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Output of HTML parsing

2007-06-15 Thread Stefan Behnel
Jackie wrote:
> I want to get the information of the professors (name,title) from the
> following link:
> 
> "http://www.economics.utoronto.ca/index.php/index/person/faculty/";

That's even XHTML, no need to go through BeautifulSoup. Use lxml instead.

http://codespeak.net/lxml


> Ideally, I'd like to have a output file where each line is one Prof,
> including his name and title. In practice, I use the CSV module.
> 
> 
> import urllib,re,csv
> 
> url = "http://www.economics.utoronto.ca/index.php/index/person/
> faculty/"
> 
> sock = urllib.urlopen(url)
> htmlSource = sock.read()
> sock.close()

import lxml.etree as et
url = "http://www.economics.utoronto.ca/index.php/index/person/faculty/";
tree = et.parse(url)

> namePattern = re.compile(r'class="name">(.*)')
> titlePattern = re.compile(r', (.*)\s*')
> 
> name = namePattern.findall(htmlSource)
> title_temp = titlePattern.findall(htmlSource)
> title =[]
> for item in title_temp:
> item_new=" ".join(item.split())#Suppress the
> spaces between 'title' and 
> title.extend([item_new])
> 
> 
> output =[]
> for i in range(len(name)):
> output.insert(i,[name[i],title[i]])#Generate a list of
> [name, title]

# untested
get_name_text = et.XPath('normalize-space(td[a/@class="name"]')
name_list = []
for name_row in tree.xpath('//tr[td/a/@class = "name"]'):
  name_list.append(
tuple(get_name_text(name_row).split(",", 3) + ["","",""])[:3] )


> writer = csv.writer(open("professor.csv", "wb"))
> writer.writerows(output)   #output CSV file

writer = csv.writer(open("professor.csv", "wb"))
writer.writerows(name_list) #output CSV file
> -- End of Program
> --
> 
> 3.Should I close the opened csv file("professor.csv")? How to close
> it?

I guess it has a "close()" function?

Stefan
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Output of HTML parsing

2007-06-15 Thread Sebastian Wiesner
[ Jackie <[EMAIL PROTECTED]> ]
> 1.The code above assume that each Prof has a tilte. If any one of them
> does not, the name and title will be mismatched. How to program to
> allow that title can be empty?
>
> 2.Is there any easier way to get the data I want other than using
> list?

Use BeautifulSoup.

> 3.Should I close the opened csv file("professor.csv")? How to close
> it?

Assign the file object to a separate name (e.g. stream) and then invoke its 
close method after writing all csv data to it.

-- 
Freedom is always the freedom of dissenters.
  (Rosa Luxemburg)


signature.asc
Description: This is a digitally signed message part.
-- 
http://mail.python.org/mailman/listinfo/python-list

Output of HTML parsing

2007-06-15 Thread Jackie
Hi, all,

I want to get the information of the professors (name,title) from the
following link:

"http://www.economics.utoronto.ca/index.php/index/person/faculty/";

Ideally, I'd like to have a output file where each line is one Prof,
including his name and title. In practice, I use the CSV module.

The following is my program:


--- Program


import urllib,re,csv

url = "http://www.economics.utoronto.ca/index.php/index/person/
faculty/"

sock = urllib.urlopen(url)
htmlSource = sock.read()
sock.close()

namePattern = re.compile(r'class="name">(.*)')
titlePattern = re.compile(r', (.*)\s*')

name = namePattern.findall(htmlSource)
title_temp = titlePattern.findall(htmlSource)
title =[]
for item in title_temp:
item_new=" ".join(item.split())#Suppress the
spaces between 'title' and 
title.extend([item_new])


output =[]
for i in range(len(name)):
output.insert(i,[name[i],title[i]])#Generate a list of
[name, title]

writer = csv.writer(open("professor.csv", "wb"))
writer.writerows(output)   #output CSV file

-- End of Program
--

My questions are:

1.The code above assume that each Prof has a tilte. If any one of them
does not, the name and title will be mismatched. How to program to
allow that title can be empty?

2.Is there any easier way to get the data I want other than using
list?

3.Should I close the opened csv file("professor.csv")? How to close
it?

Thanks!

Jackie

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: HTML Parsing

2007-02-25 Thread Stefan Behnel
John Machin wrote:
> One can even use ElementTree, if the HTML is well-formed. See below.
> However if it is as ill-formed as the sample (4th "td" element not
> closed; I've omitted it below), then the OP would be better off
> sticking with Beautiful Soup :-)

Or (as we were talking about the best of both worlds already) use lxml's HTML
parser, which is also capable of parsing pretty disgusting HTML-like tag soup.

Stefan
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: HTML Parsing

2007-02-11 Thread Fredrik Lundh
John Machin wrote:

> One can even use ElementTree, if the HTML is well-formed. See below.
> However if it is as ill-formed as the sample (4th "td" element not
> closed; I've omitted it below), then the OP would be better off
> sticking with Beautiful Soup :-)

or get the best of both worlds:

http://effbot.org/zone/element-soup.htm

 



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: HTML Parsing

2007-02-11 Thread John Machin
On Feb 11, 6:05 pm, Ayaz Ahmed Khan <[EMAIL PROTECTED]> wrote:
> "mtuller" typed:
>
> > I have also tried Beautiful Soup, but had trouble understanding the
> > documentation
>
> As Gabriel has suggested, spend a little more time going through the
> documentation of BeautifulSoup. It is pretty easy to grasp.
>
> I'll give you an example: I want to extract the text between the
> following span tags in a large HTML source file.
>
> Linux Kernel Bluetooth CAPI Packet Remote Buffer Overflow 
> Vulnerability
>
> >>> import re
> >>> from BeautifulSoup import BeautifulSoup
> >>> from urllib2 import urlopen
> >>> soup = BeautifulSoup(urlopen('http://www.someurl.tld/'))
> >>> title = soup.find(name='span', attrs={'class':'title'}, 
> >>> text=re.compile(r'^Linux \w+'))
> >>> title
>
> u'Linux Kernel Bluetooth CAPI Packet Remote Buffer Overflow Vulnerability'
>

One can even use ElementTree, if the HTML is well-formed. See below.
However if it is as ill-formed as the sample (4th "td" element not
closed; I've omitted it below), then the OP would be better off
sticking with Beautiful Soup :-)

C:\junk>type element_soup.py
from xml.etree import cElementTree as ET
import cStringIO

guff = """


LETTER

33,699

1.0

"""

tree = ET.parse(cStringIO.StringIO(guff))
for elem in tree.getiterator('td'):
key = elem.get('headers')
assert elem[0].tag == 'span'
value = elem[0].text
print repr(key), repr(value)

C:\junk>\python25\python element_soup.py
'col1_1' 'LETTER'
'col2_1' '33,699'
'col3_1' '1.0'

HTH,
John





-- 
http://mail.python.org/mailman/listinfo/python-list


Re: HTML Parsing

2007-02-10 Thread Ayaz Ahmed Khan
"mtuller" typed:

> I have also tried Beautiful Soup, but had trouble understanding the
> documentation

As Gabriel has suggested, spend a little more time going through the
documentation of BeautifulSoup. It is pretty easy to grasp.

I'll give you an example: I want to extract the text between the
following span tags in a large HTML source file.

Linux Kernel Bluetooth CAPI Packet Remote Buffer Overflow 
Vulnerability

>>> import re
>>> from BeautifulSoup import BeautifulSoup
>>> from urllib2 import urlopen
>>> soup = BeautifulSoup(urlopen('http://www.someurl.tld/')) 
>>> title = soup.find(name='span', attrs={'class':'title'}, 
>>> text=re.compile(r'^Linux \w+'))
>>> title
u'Linux Kernel Bluetooth CAPI Packet Remote Buffer Overflow Vulnerability'

-- 
Ayaz Ahmed Khan

A witty saying proves nothing, but saying something pointless gets
people's attention.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: HTML Parsing

2007-02-10 Thread Gabriel Genellina
En Sat, 10 Feb 2007 20:07:43 -0300, mtuller <[EMAIL PROTECTED]> escribió:

> 
> 
> LETTER
> 
> 33,699
> 
> 1.0
> 
> 
>
> I want to extract the 33,699 (which is dynamic) and set the value to a
> variable so that I can insert it into a database. I have tried parsing  
> [...]
> I have also tried Beautiful Soup, but had trouble understanding the
> documentation, and HTMLParser doesn't seem to do what I want. Can[...]

Just try harder with BeautifulSoup, should work OK for your use case.
Unfortunately I can't give you an example right now.

-- 
Gabriel Genellina

-- 
http://mail.python.org/mailman/listinfo/python-list


HTML Parsing

2007-02-10 Thread mtuller
Alright. I have tried everything I can find, but am not getting
anywhere. I have a web page that has data like this:



LETTER

33,699

1.0



What is show is only a small section.

I want to extract the 33,699 (which is dynamic) and set the value to a
variable so that I can insert it into a database. I have tried parsing
the html with pyparsing, and the examples will get it to print all
instances with span, of which there are a hundred or so when I use:

for srvrtokens in printCount.searchString(printerListHTML):
print srvrtokens

If I set the last line to srvtokens[3] I get the values, but I don't
know grab a single line and then set that as a variable.

I have also tried Beautiful Soup, but had trouble understanding the
documentation, and HTMLParser doesn't seem to do what I want. Can
someone point me to a tutorial or give me some pointers on how to
parse html where there are multiple lines with the same tags and then
be able to go to a certain line and grab a value and set a variable's
value to that?

Thanks,

Mike

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: HTML Parsing and Indexing

2006-11-16 Thread Paul McGuire
On Nov 13, 1:12 pm, [EMAIL PROTECTED] wrote:
>
> I need a help on HTML parser.
>

>
> I saw a couple of python parsers like pyparsing, yappy, yapps, etc but
> they havn't given any example for HTML parsing.

Geez, how hard did you look?  pyparsing's wiki menu includes an
'Examples' link, which take you to a page of examples including 3
having to do with scraping HTML.  You can view the examples right in
the wiki, without even having to download the package (of course, you
*would* have to download to actually run the examples).

-- Paul

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: HTML Parsing and Indexing

2006-11-13 Thread Stefan Behnel
[EMAIL PROTECTED] wrote:
> I am involved in one project which tends to collect news
> information published on selected, known web sites inthe format of
> HTML, RSS, etc and sortlist them and create a bookmark on our website
> for the news content(we will use django for web development). Currently
> this project is under heavy development.
> 
> I need a help on HTML parser.

lxml includes an HTML parser which can parse straight from URLs.

http://codespeak.net/lxml/
http://cheeseshop.python.org/pypi/lxml

Stefan
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: HTML Parsing and Indexing

2006-11-13 Thread Andy Dingley

[EMAIL PROTECTED] wrote:

> I am involved in one project which tends to collect news
> information published on selected, known web sites inthe format of
> HTML, RSS, etc

I just can't imagine why anyone would still want to do this.

With RSS, it's an easy (if not trivial) problem.

With HTML it's hard, it's unstable, and the legality of recycling
others' content like this is far from clear.  Are you _sure_ there's
still a need to do this thoroughly awkward task?  How many sites are
there that are worth scraping, permit scraping, and don't yet offer RSS
?

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: HTML Parsing and Indexing

2006-11-13 Thread Bernard
a combination of urllib, urlib2 and BeautifulSoup should do it.
Read BeautifulSoup's documentation to know how to browse through the
DOM.

[EMAIL PROTECTED] a écrit :

> Hi All,
>
> I am involved in one project which tends to collect news
> information published on selected, known web sites inthe format of
> HTML, RSS, etc and sortlist them and create a bookmark on our website
> for the news content(we will use django for web development). Currently
> this project is under heavy development.
>
> I need a help on HTML parser.
>
> I can download the web pages from target sites. Then I have to start
> doing parsing. Since they all html web pages, they will have different
> styles, tags, it is very hard for me to parse the data. So what we plan
> is to have one or more rules for each website and run based on rule. We
> can even write some small amount of code for each web site  if
> required. But Crawler, Parser and Indexer need to run unattended. I
> don't know how to proceed next..
>
> I saw a couple of python parsers like pyparsing, yappy, yapps, etc but
> they havn't given any example for HTML parsing. Someone recommended
> using "lynx" to convert the page into the text and parse the data. That
> also looks good but still i end of writing a huge chunk of code for
> each web page.
>
> What we need is,
>
> One nice parser which should work on HTML/text file (lynx output) and
> work based on certain rules and return us a result (Am I need magix to
> do this :-( )
> 
> Sorry about my english..
> 
> Thanks & Regards,
> 
> Krish

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: HTML Parsing and Indexing

2006-11-13 Thread Fredrik Lundh
[EMAIL PROTECTED] wrote:

> I need a help on HTML parser.

http://www.effbot.org/pyfaq/tutor-how-do-i-get-data-out-of-html.htm



-- 
http://mail.python.org/mailman/listinfo/python-list


HTML Parsing and Indexing

2006-11-13 Thread mailtogops
Hi All,

I am involved in one project which tends to collect news
information published on selected, known web sites inthe format of
HTML, RSS, etc and sortlist them and create a bookmark on our website
for the news content(we will use django for web development). Currently
this project is under heavy development.

I need a help on HTML parser.

I can download the web pages from target sites. Then I have to start
doing parsing. Since they all html web pages, they will have different
styles, tags, it is very hard for me to parse the data. So what we plan
is to have one or more rules for each website and run based on rule. We
can even write some small amount of code for each web site  if
required. But Crawler, Parser and Indexer need to run unattended. I
don't know how to proceed next..

I saw a couple of python parsers like pyparsing, yappy, yapps, etc but
they havn't given any example for HTML parsing. Someone recommended
using "lynx" to convert the page into the text and parse the data. That
also looks good but still i end of writing a huge chunk of code for
each web page.

What we need is,

One nice parser which should work on HTML/text file (lynx output) and
work based on certain rules and return us a result (Am I need magix to
do this :-( )

Sorry about my english..

Thanks & Regards,

Krish

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: HTML parsing bug?

2006-02-02 Thread Istvan Albert
>> this is a comment in JavaScript, which is itself inside an HTML comment

> Did you read the post? 

misread it rather ...

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: HTML parsing bug?

2006-02-01 Thread Fredrik Lundh
[EMAIL PROTECTED] wrote:

> Python 2.3.5 seems to choke when trying to parse html files, because it
> doesn't realize that what's inside  is a comment in HTML,
> even if this comment is inside  , especially if it's a
> comment inside that script code too.

nope.  what's inside  is not a comment if it's inside a 

Re: HTML parsing bug?

2006-02-01 Thread Tim Roberts
"Istvan Albert" <[EMAIL PROTECTED]> wrote:
>
>> this is a comment in JavaScript, which is itself inside an HTML comment
>
>Don't nest HTML comments. Occasionaly it may break the browsers as
>well.

Did you read the post?  He didn't nest HTML comments.  He put a Javascript
comment inside an HTML comment, inside a  pair.  Virtually
every page with Javascript does exactly the same thing.
-- 
- Tim Roberts, [EMAIL PROTECTED]
  Providenza & Boekelheide, Inc.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: HTML parsing bug?

2006-01-30 Thread Istvan Albert
> this is a comment in JavaScript, which is itself inside an HTML comment

Don't nest HTML comments. Occasionaly it may break the browsers as
well.

(I remember this from one of the weirdest of bughunts : whenever the
number of characters between nested HTML comments was divisible by four
the page would render incorrectly ... or something of that sorts)

i.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: HTML parsing bug?

2006-01-30 Thread Richard Brodie

<[EMAIL PROTECTED]> wrote in message 
news:[EMAIL PROTECTED]

> Python 2.3.5 seems to choke when trying to parse html files, because it
> doesn't realize that what's inside  is a comment in HTML,
> even if this comment is inside  , especially if it's a
> comment inside that script code too.

Actually, you are technically incorrect;  try validating the code you posted.
Google found this explanation: http://lachy.id.au/log/2005/05/script-comments
Feeding even slightly invalid HTML to the standard library parser will often
choke it. If you can't guarantee clean sources, best use Tidy first or another
parser entirely.



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: HTML parsing bug?

2006-01-30 Thread G.
> //   - this is a comment in JavaScript, which is itself inside
> an HTML comment

This is supposed to be one line. Got wrapped during posting.

-- 
http://mail.python.org/mailman/listinfo/python-list


HTML parsing bug?

2006-01-30 Thread g_no_mail_please
Python 2.3.5 seems to choke when trying to parse html files, because it
doesn't realize that what's inside  is a comment in HTML,
even if this comment is inside  , especially if it's a
comment inside that script code too.

The html file:


Choke on this





Hey there




The Python program:

from urllib2 import urlopen
from HTMLParser import HTMLParser
f = urlopen("file:///PATH_TO_THE_ABOVE/index.html")
p = HTMLParser()
p.feed(f.read())

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: HTML parsing/scraping & python

2005-12-09 Thread alex_f_il
Take a look at SW Explorer Automation
(http://home.comcast.net/~furmana/SWIEAutomation.htm)(SWEA). SWEA
creates an object model (automation interface) for any Web application
running in Internet Explorer. It supports all IE functionality:frames,
java script,  dialogs, downloads.

The runtime can  also work under  non-interactive user accounts
(ASP.NET or service applications) on Window 2000/2003 server or Windows
XP.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: HTML parsing/scraping & python

2005-12-04 Thread gene tani

John J. Lee wrote:
> Sanjay Arora <[EMAIL PROTECTED]> writes:
>
> > We are looking to select the language & toolset more suitable for a
> > project that requires getting data from several web-sites in real-
> > timehtml parsing/scraping. It would require full emulation of the
> > browser, including handling cookies, automated logins & following
> > multiple web-link paths. Multiple threading would be a plus but not
> > requirement.
> [...]
>
> What's the application?
>
>
> John

I'll do your googling for you ;-p

(The topic guide needs to be updated for mechanize, pamie, beautiful
soup, clientTable, pullparser, etc.)
http://www.python.org/topics/web/HTML.html
http://blog.ianbicking.org/best-of-the-web-app-test-frameworks.html

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: HTML parsing/scraping & python

2005-12-04 Thread John J. Lee
Sanjay Arora <[EMAIL PROTECTED]> writes:

> We are looking to select the language & toolset more suitable for a
> project that requires getting data from several web-sites in real-
> timehtml parsing/scraping. It would require full emulation of the
> browser, including handling cookies, automated logins & following
> multiple web-link paths. Multiple threading would be a plus but not
> requirement.
[...]

What's the application?


John

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: HTML parsing/scraping & python

2005-12-01 Thread Mike Meyer
"Fuzzyman" <[EMAIL PROTECTED]> writes:
> The standard library module for fetching HTML is urllib2.

Does urllib2 replace everything in urllib? I thought there was some
urllib functionality that urllib2 didn't do.

> There is a project called mechanize, built by John Lee on top of
> urllib2 and other standard modules.
> It will emulate a browsers behaviour - including history, cookies,
> basic authentication, etc.

urllib2 handles cookies and authentication. I use those features
daily. I'm not sure history would apply, unless you're also handling
javascript. Is there some other way to ask the browser to go back in
history?

  http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: HTML parsing/scraping & python

2005-12-01 Thread Fuzzyman
The standard library module for fetching HTML is urllib2.

The best module for scraping the HTML is BeautifulSoup.

There is a project called mechanize, built by John Lee on top of
urllib2 and other standard modules.

It will emulate a browsers behaviour - including history, cookies,
basic authentication, etc.

There are several modules for automated form filling - FormEncode being
one.

All the best,


Fuzzyman
http://www.voidspace.org.uk/python/index.shtml

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: HTML parsing/scraping & python

2005-11-30 Thread Mike Meyer
Sanjay Arora <[EMAIL PROTECTED]> writes:

> We are looking to select the language & toolset more suitable for a
> project that requires getting data from several web-sites in real-
> timehtml parsing/scraping. It would require full emulation of the
> browser, including handling cookies, automated logins & following
> multiple web-link paths. Multiple threading would be a plus but not
> requirement.

Believe it or not, everything you ask for can be done by Python out of
the box. But there are limitations.

For one, the HTML parsing module that comes with Python doesn't handle
invalid HTML very well. Thanks to Netscape, invalid HTML is the rule
rather than the exception on the web. So you probably want to use a
third party module for that. I use BeautifulSoup, which handles XML,
HTML, has a *lovely* API (going from BeautifulSoup to DOM is always a
major dissapointment), and works well with broken X/HTML.

That sufficient for my needs, but I haven't been asked to do a lot of
automated form filling, so the facilities in the standard library work
for me. There are third party tools to help with that. I'm sure
someone willsuggest them.

> Can you suggest solutions for python? Pros & Cons using Perl vs. Python?
> Why Python?

Because it's beautiful. Seriously, Python code is very readable, by
design. Of course, some of the features that make that happen drive
some people crazy. If you're one of them, then Python isn't the
language for you.

   http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
-- 
http://mail.python.org/mailman/listinfo/python-list


HTML parsing/scraping & python

2005-11-30 Thread Sanjay Arora
We are looking to select the language & toolset more suitable for a
project that requires getting data from several web-sites in real-
timehtml parsing/scraping. It would require full emulation of the
browser, including handling cookies, automated logins & following
multiple web-link paths. Multiple threading would be a plus but not
requirement.

Some solutions were suggested:

Perl:

LWP::Simple
WWW::Mechanize
HTML::Parser

Curl & libcurl:

Can you suggest solutions for python? Pros & Cons using Perl vs. Python?
Why Python?

Pointers to  various other tools & their comparisons  with python
solutions will be most appreciated. Anyone who is knowledgeable about
the application subject, please do share your knowledge to help us do
this right.

With best regards.
Sanjay.

-- 
http://mail.python.org/mailman/listinfo/python-list


html parsing

2005-03-13 Thread Suchitra



Hi all,
    
Please help me in parsing  the html document 
and extract the http links .
 
Thanks in advance!!1
 
Suchitra
-- 
http://mail.python.org/mailman/listinfo/python-list