On 2 July 2013 18:43, wrote:
> I could not use BeautifulSoup as I did not find an .exe file.
Were you perhaps looking for a .exe file to install BeautifulSoup?
It's quite plausible that a windows user like you might be dazzled at
the idea of a .tar.gz.
I suggest just using "pip install beautifu
On Tue, 02 Jul 2013 10:43:03 -0700, subhabangalore wrote:
> I could not use BeautifulSoup as I did not find an .exe file.
I believe that BeautifulSoup is a pure-Python module, and so does not
have a .exe file. However, it does have good tutorials:
https://duckduckgo.com/html/?q=beautifulsoup+tu
On 2013-07-02, subhabangal...@gmail.com wrote:
> Dear Group,
>
> I was looking for a good tutorial for a "HTML Parser". My
> intention was to extract tables from web pages or information
> from tables in web pages.
>
> I tried to make a search, I got HTMLParser, B
Dear Group,
I was looking for a good tutorial for a "HTML Parser". My intention was to
extract tables from web pages or information from tables in web pages.
I tried to make a search, I got HTMLParser, BeautifulSoup, etc. HTMLParser
works fine for me, but I am looking for a good t
I want to thank everyone for the help, which I found very useful (the
parts that I understood :-) ).
Since I think there was some question, it happens that I am working
under django and submitting a certain form triggers an html mail. I
wanted to validate the html in some of my unit tests. It is
In message <4b712919$0$6584$9b4e6...@newsspool3.arcor-online.net>, Stefan
Behnel wrote:
> Usually PyPI.
Where do you think these tools come from? They don’t write themselves, you
know.
--
http://mail.python.org/mailman/listinfo/python-list
Lawrence D'Oliveiro, 08.02.2010 22:39:
> In message <4b6fe93d$0$6724$9b4e6...@newsspool2.arcor-online.net>, Stefan
> Behnel wrote:
>
>> - generating HTML using a tool that guarantees correct HTML output
>
> Where do you think these tools come from?
Usually PyPI.
Stefan
--
http://mail.python.o
and the tweak is:
parser = etree.HTMLParser(recover=False)
return etree.HTML(xml, parser)
That reduces tolerance. The entire assert_xml() is (apologies for
wrapping lines!):
def _xml_to_tree(self, xml):
from lxml import etree
self._xml = xml
Stefan Behnel wrote:
> I don't read it that way. There's a huge difference between
>
> - generating HTML manually and validating (some of) it in a unit test
>
> and
>
> - generating HTML using a tool that guarantees correct HTML output
>
> the advantage of the second approach being that others hav
Lawrence D'Oliveiro, 08.02.2010 11:19:
> In message <4b6fd672$0$6734$9b4e6...@newsspool2.arcor-online.net>, Stefan
> Behnel wrote:
>
>> Jim, 06.02.2010 20:09:
>>
>>> I generate some HTML and I want to include in my unit tests a check
>>> for syntax. So I am looking for a program that will compla
In message <4b6fd672$0$6734$9b4e6...@newsspool2.arcor-online.net>, Stefan
Behnel wrote:
> Jim, 06.02.2010 20:09:
>
>> I generate some HTML and I want to include in my unit tests a check
>> for syntax. So I am looking for a program that will complain at any
>> syntax irregularities.
>
> First th
Jim, 06.02.2010 20:09:
> I generate some HTML and I want to include in my unit tests a check
> for syntax. So I am looking for a program that will complain at any
> syntax irregularities.
First thing to note here is that you should consider switching to an HTML
generation tool that does this auto
On Sat, 06 Feb 2010 11:09:31 -0800, Jim wrote:
> I generate some HTML and I want to include in my unit tests a check
> for syntax. So I am looking for a program that will complain at any
> syntax irregularities.
>
> I am familiar with Beautiful Soup (use it all the time) but it is
> intended to
Thank you, John. I did not find that by looking around; I must not
have used the right words. The speed of the unit tests are not
critical so this seems like the solution for me.
Jim
--
http://mail.python.org/mailman/listinfo/python-list
Jim wrote:
I generate some HTML and I want to include in my unit tests a check
for syntax. So I am looking for a program that will complain at any
syntax irregularities.
I am familiar with Beautiful Soup (use it all the time) but it is
intended to cope with bad syntax. I just tried feeding
HTM
I generate some HTML and I want to include in my unit tests a check
for syntax. So I am looking for a program that will complain at any
syntax irregularities.
I am familiar with Beautiful Soup (use it all the time) but it is
intended to cope with bad syntax. I just tried feeding
HTMLParser.HTMLP
"Robert" wrote in message
news:hk729b$na...@news.albasani.net...
> Stefan Behnel wrote:
>> Robert, 01.02.2010 14:36:
>>> Stefan Behnel wrote:
Robert, 31.01.2010 20:57:
> I tried lxml, but after walking and making changes in the element
> tree,
> I'm forced to do a full serializ
ginal HTML code.
> makes it rather unreadable.
>
> is there an existing HTML parser which supports tracking/writing
> back particular changes in a cautious way by just making local
> changes? or a least tracks the tag start/end positions in the file?
HTMLParser, sgmllib.SGMLPars
Robert wrote:
> I think you confused the logical level of what I meant with "file
> position":
> Of course its not about (necessarily) writing back to the same open file
> (OS-level), but regarding the whole serializiation string (wherever it
> is finally written to - I typically write the auto-con
Stefan Behnel wrote:
Robert, 01.02.2010 14:36:
Stefan Behnel wrote:
Robert, 31.01.2010 20:57:
I tried lxml, but after walking and making changes in the element tree,
I'm forced to do a full serialization of the whole document
(etree.tostring(tree)) - which destroys the "human edited" format of
Robert, 01.02.2010 14:36:
> Stefan Behnel wrote:
>> Robert, 31.01.2010 20:57:
>>> I tried lxml, but after walking and making changes in the element tree,
>>> I'm forced to do a full serialization of the whole document
>>> (etree.tostring(tree)) - which destroys the "human edited" format of the
>>>
Robert wrote:
Stefan Behnel wrote:
Robert, 31.01.2010 20:57:
I tried lxml, but after walking and making changes in the element tree,
I'm forced to do a full serialization of the whole document
(etree.tostring(tree)) - which destroys the "human edited" format of the
original HTML code. makes it
Stefan Behnel wrote:
Robert, 31.01.2010 20:57:
I tried lxml, but after walking and making changes in the element tree,
I'm forced to do a full serialization of the whole document
(etree.tostring(tree)) - which destroys the "human edited" format of the
original HTML code. makes it rather unreadab
Robert, 31.01.2010 20:57:
> I tried lxml, but after walking and making changes in the element tree,
> I'm forced to do a full serialization of the whole document
> (etree.tostring(tree)) - which destroys the "human edited" format of the
> original HTML code. makes it rather unreadable.
What do you
I tried lxml, but after walking and making changes in the element
tree, I'm forced to do a full serialization of the whole document
(etree.tostring(tree)) - which destroys the "human edited" format
of the original HTML code.
makes it rather unreadable.
is there an existing HT
Johannes Bauer wrote:
Hello group,
I'm trying to use a htmllib.HTMLParser derivate class to parse a website
which I fetched via
httplib.HTTPConnection().request().getresponse().read(). Now the problem
is: As soon as I pass the htmllib.HTMLParser UTF-8 code, it chokes. The
code is something like
On Fri, 10 Oct 2008 00:13:36 +0200, Johannes Bauer wrote:
> Terry Reedy schrieb:
>> I believe you are confusing unicode with unicode encoded into bytes
>> with the UTF-8 encoding. Having a problem feeding a unicode string,
>> not 'UFT-8 code', which in Python can only mean a UTF-8 encoded byte
>>
Johannes Bauer wrote:
Terry Reedy schrieb:
Johannes Bauer wrote:
Hello group,
I'm trying to use a htmllib.HTMLParser derivate class to parse a website
which I fetched via
httplib.HTTPConnection().request().getresponse().read(). Now the problem
is: As soon as I pass the htmllib.HTMLParser UTF-8
On Thu, Oct 9, 2008 at 4:54 PM, Johannes Bauer <[EMAIL PROTECTED]> wrote:
> Hello group,
>
> Now when I take "website" directly from the parser, everything is fine.
> However I want to do some modifications before I parse it, namely UTF-8
> modifications in the style:
>
> website = website.replace(
al not in range(128)
>
> When you do not bother to specify some other encoding in an encoding
> operation, sgmllib or something deeper in Python tries the default
> encoding, which does not work. Stop being annoyed and tell the
> interpreter what you want. It is not a mind-read
0xfc in position 0:
ordinal not in range(128)
When you do not bother to specify some other encoding in an encoding
operation, sgmllib or something deeper in Python tries the default
encoding, which does not work. Stop being annoyed and tell the
interpreter what you want. It is not a mind-read
ttag
self._convert_ref, attrvalue)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 0:
ordinal not in range(128)
Annoying, IMHO, that the internal html Parser cannot cope with UTF-8
input - which should (again, IMHO) be the absolute standard for s
Chris wrote:
> Can anyone recommend a good HTML/XHTML parser, similar to
> HTMLParser.HTMLParser or htmllib.HTMLParser, but able to intelligently
> know that certain tags, like , are implicitly closed? I need to
> iterate through the entire DOM, building up a DOM path, but the stdlib
> parsers aren
Chris wrote:
> Can anyone recommend a good HTML/XHTML parser, similar to
> HTMLParser.HTMLParser or htmllib.HTMLParser, but able to intelligently
> know that certain tags, like , are implicitly closed? I need to
> iterate through the entire DOM, building up a DOM path, but the stdlib
> parsers are
Can anyone recommend a good HTML/XHTML parser, similar to
HTMLParser.HTMLParser or htmllib.HTMLParser, but able to intelligently
know that certain tags, like , are implicitly closed? I need to
iterate through the entire DOM, building up a DOM path, but the stdlib
parsers aren't calling handle_endta
[EMAIL PROTECTED] wrote:
> Hi, I am looking for a HTML parser who can parse a given page into
> a DOM tree, and can reconstruct the exact original html sources.
> Strictly speaking, I should be allowed to retrieve the original
> sources at each internal nodes of the DOM tree.
>
On 2008-01-23, kliu <[EMAIL PROTECTED]> wrote:
> On Jan 23, 7:39 pm, "A.T.Hofkamp" <[EMAIL PROTECTED]> wrote:
>> On 2008-01-23, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
>>
>> > Hi, I am looking for a HTML parser who can parse a given
Hi,
kliu wrote:
> what I really need is the mapping between each DOM nodes and
> the corresponding original source segment.
I don't think that will be easy to achieve. You could get away with a parser
that provides access to the position of an element in the source, and then map
changes back into
On 23 Jan, 14:20, kliu <[EMAIL PROTECTED]> wrote:
>
> Thank u for your reply. but what I really need is the mapping between
> each DOM nodes and the corresponding original source segment.
At the risk of promoting unfashionable DOM technologies, you can at
least serialise fragments of the DOM in li
On Jan 23, 7:39 pm, "A.T.Hofkamp" <[EMAIL PROTECTED]> wrote:
> On 2008-01-23, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
>
> > Hi, I am looking for a HTML parser who can parse a given page into
> > a DOM tree, and can reconstruct the exact original
On 2008-01-23, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
> Hi, I am looking for a HTML parser who can parse a given page into
> a DOM tree, and can reconstruct the exact original html sources.
Why not keep a copy of the original data instead?
That would be VERY MUCH SIMPLER
Hi, I am looking for a HTML parser who can parse a given page into
a DOM tree, and can reconstruct the exact original html sources.
Strictly speaking, I should be allowed to retrieve the original
sources at each internal nodes of the DOM tree.
I have tried Beautiful Soup who is really
On Oct 17, 9:50 am, Carsten Haese <[EMAIL PROTECTED]> wrote:
> Recent releases of BeautifulSoup need Python 2.3+, so they won't work on
> current Jython, but BeatifulSoup 1.x will work.
Thank you.
--
http://mail.python.org/mailman/listinfo/python-list
> Does anybody know of a decent HTML parser for Jython? I have to do
> some screen scraping, and would rather use a tested module instead of
> rolling my own.
GIYF[0][1]
There are the batteries-included HTMLParser[2] and htmllib[3]
modules, and the ever-popular (and more developer
On Wed, 2007-10-17 at 17:36 +0200, Stefan Behnel wrote:
> Falcolas wrote:
> > Does anybody know of a decent HTML parser for Jython? I have to do
> > some screen scraping, and would rather use a tested module instead of
> > rolling my own.
>
> Not sure if it works, but h
Falcolas wrote:
> Does anybody know of a decent HTML parser for Jython? I have to do
> some screen scraping, and would rather use a tested module instead of
> rolling my own.
Not sure if it works, but have you tried BeautifulSoup? Or maybe an older
version of it?
Stefan
Does anybody know of a decent HTML parser for Jython? I have to do
some screen scraping, and would rather use a tested module instead of
rolling my own.
Thanks!
GP
--
http://mail.python.org/mailman/listinfo/python-list
In article <[EMAIL PROTECTED]>,
Stephen R Laniel <[EMAIL PROTECTED]> wrote:
> On Fri, Jun 15, 2007 at 07:11:56AM -0700, HMS Surprise wrote:
> > Could you recommend an html parser that works with python (jython
> > 2.2)?
>
> I'm new here, but I believe B
On Jun 15, 7:11 am, HMS Surprise <[EMAIL PROTECTED]> wrote:
> Could you recommend an html parser that works with python (jython
> 2.2)? HTMLParser does not seem to be in this library. To test some
> of our browser based (mailnly php) code I seek for field names and
> values as
Thanks,
jh
--
http://mail.python.org/mailman/listinfo/python-list
On Fri, Jun 15, 2007 at 07:11:56AM -0700, HMS Surprise wrote:
> Could you recommend an html parser that works with python (jython
> 2.2)?
I'm new here, but I believe BeautifulSoup is the canonical
answer:
http://www.crummy.com/software/BeautifulSoup/
--
Stephen R. Laniel
[EMAIL PROT
Could you recommend an html parser that works with python (jython
2.2)? HTMLParser does not seem to be in this library. To test some
of our browser based (mailnly php) code I seek for field names and
values associated with them.
Thanks,
jh
--
http://mail.python.org/mailman/listinfo/python
Beautiful Soup. http://www.crummy.com/software/BeautifulSoup/
Works, well...beautifully.
--
http://mail.python.org/mailman/listinfo/python-list
On Apr 6, 1:05 pm, "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> wrote:
> Hi,
>
> Is there a HTML parser (not xml) in python?
> I need a html parser which has the ability to handle mal-format html
> pages.
>
> Thank you.
Yeah...it's called Beautiful Soup.
ht
Hi,
Is there a HTML parser (not xml) in python?
I need a html parser which has the ability to handle mal-format html
pages.
Thank you.
--
http://mail.python.org/mailman/listinfo/python-list
[EMAIL PROTECTED] wrote:
> I have created an example using libxml2 based in the code that appears
> in http://xmlsoft.org/python.html.
> My example processes an enough amount of html files to see that the
> memory consumption rises till the process ends (I check it with the
> 'top' command).
Try
Hi all,
I have created an example using libxml2 based in the code that appears
in http://xmlsoft.org/python.html.
My example processes an enough amount of html files to see that the
memory consumption rises till the process ends (I check it with the
'top' command).
I don´t know if I am forgetting
Agreed that the web sites are probably broken. Try running the HTML
though HTMLTidy (http://tidy.sourceforge.net/). Doing that has allowed
me to parse where I had problem such as yours.
I have also had luck with BeautifulSoup, which also includes a tidy
function in it.
Just Another Victim of t
Fredrik Lundh escreveu:
> > Except it appears to be buggy or, at least, not very robust. There are
> > websites for which it falsely terminates early in the parsing.
>
> which probably means that the sites are broken. the amount of broken
> HTML on the net is staggering, as is the amount of
> Except it appears to be buggy or, at least, not very robust. There are
> websites for which it falsely terminates early in the parsing.
which probably means that the sites are broken. the amount of broken
HTML on the net is staggering, as is the amount of code in a typical web
browser
"Just Another Victim of the Ambient Morality" <[EMAIL PROTECTED]> wrote
in message news:[EMAIL PROTECTED]
>
>Okay, I think I found what I'm looking for in HTMLParser in the
> HTMLParser module.
Except it appears to be buggy or, at least, not very robust. There are
websites for which i
"Just Another Victim of the Ambient Morality" <[EMAIL PROTECTED]> wrote
in message news:[EMAIL PROTECTED]
>I'm trying to parse HTML in a very generic way.
>So far, I'm using SGMLParser in the sgmllib module. The problem is
> that it forces you to parse very specific tags through object
I'm trying to parse HTML in a very generic way.
So far, I'm using SGMLParser in the sgmllib module. The problem is that
it forces you to parse very specific tags through object methods like
start_a(), start_p() and the like, forcing you to know exactly which tags
you want to handle. I
Oopss!
You are totally right guys, i did miss the closing '>' thinking about
maybe errors in the use of ' or ".
Jesus
Tim Roberts wrote:
>"Jesus Rivero - (Neurogeek)" <[EMAIL PROTECTED]> wrote:
>
>
>>hmmm, that's kind of different issue then.
>>
>>I can guess, from the error you pasted earlie
thanks for the suggestions,
this is not happening frequently, actually this is the first time I
have seen this exception in the system, which means that some spam
message was generated with ill-formated html.
i guess the best way would be to check using regular expression and
delete the unclosed t
"Jesus Rivero - (Neurogeek)" <[EMAIL PROTECTED]> wrote:
>
>hmmm, that's kind of different issue then.
>
>I can guess, from the error you pasted earlier, that the problem shown
>is due to the fact Python is interpreting a "<" as an expression and not
>as a char. review your code or try to figure out
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
hmmm, that's kind of different issue then.
I can guess, from the error you pasted earlier, that the problem shown
is due to the fact Python is interpreting a "<" as an expression and not
as a char. review your code or try to figure out the exact input
thanks for the reply
well probabbly I should explain more. this is part of an email . after
the mta delivers the email, it is stored in a local dir.
After that the email is being parsed by the parser inside an web based
imap client at display time.
I dont think I have the choice of rewriting the
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
Sakcee wrote:
> html =
> ' \r\n Foo foo , blah blah
> '
>
>
html =
"""
Foo foo , blah blah
"""
Try checking your html code. It looks really messy. '
html =
' \r\n Foo foo , blah blah
'
>>> import htmllib
>>> import formatter
>>> parser=htmllib.HTMLParser(formatter.NullFormatter())
>>> parser.feed(html)
Traceback (most recent call last):
File "", line 1, in ?
File "/usr/lib/python2.4/sgmllib.py", line 95, in feed
self.goahead(0)
File
Hi,
I am parsing html documents using the html parser from libxml2, and if
the encoding is included in the document it works perfectly but if it
is not, I think it does not work well (probably because I am doing
something wrong).
As it is said in http://xmlsoft.org/encoding.html the parser
To extract links without the overhead of Beautiful Soup, one option is
to copy what Beautiful Soup does, and write a SGMLParser subclass that
only looks at 'a' tags. In general I think writing SGMLParser
subclasses is a big pain (which is why I wrote Beautiful Soup), but
since you only care about o
Thorsten Kampe wrote:
>* Christoph Söllner (2005-10-18 12:20 +0100)
>
>
>>right, that's what I was looking for. Thanks very much.
>>
>>
>
>For simple things like that "BeautifulSoup" might be overkill.
>
>import formatter, \
> htmllib, \
> urllib
>
>url = 'http://python.org'
Thorsten Kampe wrote:
> For simple things like that "BeautifulSoup" might be overkill.
[HTMLParser example]
I've used SGMLParser with some success before, although the SAX-style
processing is objectionable to many people. One alternative is to use
libxml2dom [1] and to parse documents as HTML:
i
* Christoph Söllner (2005-10-18 12:20 +0100)
> right, that's what I was looking for. Thanks very much.
For simple things like that "BeautifulSoup" might be overkill.
import formatter, \
htmllib, \
urllib
url = 'http://python.org'
htmlp = htmllib.HTMLParser(formatter.NullForm
right, that's what I was looking for. Thanks very much.
--
http://mail.python.org/mailman/listinfo/python-list
Christoph Söllner wrote:
>Hi *,
>
>is there a html parser available, which could i.e. extract all links from a
>given text like that:
>"""
>BAR
>BAR2
>"""
>
>and return a set of dicts like that:
>"""
>{
> [
Hi *,
is there a html parser available, which could i.e. extract all links from a
given text like that:
"""
BAR
BAR2
"""
and return a set of dicts like that:
"""
{
['foo.php','BAR','param1','test'],
http://www.crummy.com/software/BeautifulSoup/
On Monday 15 August 2005 03:33 pm, BRA_MIK wrote:
> I'm looking for a good and robust html parser that could parse complex
> html/xhtml document without crashing (possibly free)
>
> Could you help me please ?
>
> TIA
> MB
I'm looking for a good and robust html parser that could parse complex
html/xhtml document without crashing (possibly free)
Could you help me please ?
TIA
MB
--
http://mail.python.org/mailman/listinfo/python-list
Hi,
I have an HTML page that displays some content, and a part of that
content is HTML changed into regular text. The encoding of the page
is UTF-8.
Here's the code that makes the change (the HTML in self.contents is
UTF-8 encoded):
file = cStringIO.StringIO()
parser = htmllib.HTMLParser(format
81 matches
Mail list logo