Re: Is there a HTML parser who can reconstruct the original html EXACTLY?

2008-01-23 Thread A.T.Hofkamp
On 2008-01-23, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
 Hi, I am looking for a HTML parser who can parse a given page into
 a DOM tree,  and can reconstruct the exact original html sources.

Why not keep a copy of the original data instead?

That would be VERY MUCH SIMPLER than trying to reconstruct a parsed tree back
to original source text.


sincerely,
Albert
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is there a HTML parser who can reconstruct the original html EXACTLY?

2008-01-23 Thread kliu
On Jan 23, 7:39 pm, A.T.Hofkamp [EMAIL PROTECTED] wrote:
 On 2008-01-23, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:

  Hi, I am looking for a HTML parser who can parse a given page into
  a DOM tree,  and can reconstruct the exact original html sources.

 Why not keep a copy of the original data instead?

 That would be VERY MUCH SIMPLER than trying to reconstruct a parsed tree back
 to original source text.

 sincerely,
 Albert

Thank u for your reply. but what I really need is the mapping between
each DOM nodes and
the corresponding original source segment.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is there a HTML parser who can reconstruct the original html EXACTLY?

2008-01-23 Thread Paul Boddie
On 23 Jan, 14:20, kliu [EMAIL PROTECTED] wrote:

 Thank u for your reply. but what I really need is the mapping between
 each DOM nodes and the corresponding original source segment.

At the risk of promoting unfashionable DOM technologies, you can at
least serialise fragments of the DOM in libxml2dom [1]:

  import libxml2dom
  d = libxml2dom.parseURI(http://www.diveintopython.org/;, html=1)
  print d.xpath(//p)[7].toString()

Storage and retrieval of the original line and offset information may
be supported by libxml2, but such information isn't exposed by
libxml2dom.

Paul

[1] http://www.python.org/pypi/libxml2dom
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is there a HTML parser who can reconstruct the original html EXACTLY?

2008-01-23 Thread Stefan Behnel
Hi,

kliu wrote:
 what I really need is the mapping between each DOM nodes and
 the corresponding original source segment.

I don't think that will be easy to achieve. You could get away with a parser
that provides access to the position of an element in the source, and then map
changes back into the document. But that won't work well in the case where the
parser inserts or deletes content to fix up the structure.

Anyway, the normal focus of broken HTML parsing is in fixing the source
document, not in writing out a broken document. Maybe we could help you better
if you explained what your actual intention is?

Stefan
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is there a HTML parser who can reconstruct the original html EXACTLY?

2008-01-23 Thread A.T.Hofkamp
On 2008-01-23, kliu [EMAIL PROTECTED] wrote:
 On Jan 23, 7:39 pm, A.T.Hofkamp [EMAIL PROTECTED] wrote:
 On 2008-01-23, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:

  Hi, I am looking for a HTML parser who can parse a given page into
  a DOM tree,  and can reconstruct the exact original html sources.

 Why not keep a copy of the original data instead?

 That would be VERY MUCH SIMPLER than trying to reconstruct a parsed tree back
 to original source text.

 Thank u for your reply. but what I really need is the mapping between
 each DOM nodes and
 the corresponding original source segment.

Why do you think there is a simple one-to-one relation between nodes in some
abstract DOM tree, and pieces of source?, For example, the outermost tag
HTML.../HTML is not an explicit point in the tree. If if it is, what piece
of source should be attached to it? Everything? Just the text before and after
it? If so, what about the source text of the second tag? Last but not least,
what do you intend to do with the source-text before the HTML and after
the /HTML tags?

In other words, you are going to have a huge problem deciding what
corresponding original source segment means for each tag. This is exactly the
reason why current tools do not do what you want.

If you really want this, you probably have to do it yourself mostly from
scratch (ie starting with a parsing framework and writing a custom parser
yourself). That usually boils down to attaching source text to tokens in the
lexical parsing phase. If you have a good understanding of the meaning of
corresponding original source segment, AND you have perfect HTML, this is
doable, but not very nice.

There exist parsers that can do what you want IF YOU HAVE PERFECT HTML, but
using those tools implies a very steep learning curve of about 2-3 months under
the assumption that you know functional languages (if you don't, add 2-3 months
or so steep learning curve :) ).


If you don't have perfect HTML, you are probably more or less lost. Most tools
cannot deal with that situation, and those that can do smart re-shuffling to
make things parsable, which means there is really no one-to-one mapping any
more (after re-shuffling).


In other words, I think you really don't want what you want, at least not in
the way that you consider now.


Please give us information about your goal, so we can think about alternative
approaches to solve your problem.

sincerely,
Albert

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is there a HTML parser who can reconstruct the original html EXACTLY?

2008-01-23 Thread Fuzzyman


[EMAIL PROTECTED] wrote:
 Hi, I am looking for a HTML parser who can parse a given page into
 a DOM tree,  and can reconstruct the exact original html sources.
 Strictly speaking, I should be allowed to retrieve the original
 sources at each internal nodes of the DOM tree.
 I have tried Beautiful Soup who is really nice when dealing with
 those god damned ill-formed documents, but it's a pity for me to find
 that this guy cannot retrieve original sources due to its great tidy
 job.
 Since Beautiful Soup, like most of the other HTML parsers in
 python, is a subclass of sgmllib.SGMLParser to some extent,  I have
 investigated the source code of sgmllib.SGMLParser,  see if there is
 anything I can do to tell Beautiful Soup where he can find every tag
 segment from HTML source, but this will be a time-consuming job.
 so... any ideas?



A while ago I had a similar need, but my solution may not solve your
problem.

I wanted to rewrite URLs contained in links and images etc, but not
modify any of the rest of the HTML. I created an HTML parser (based on
sgmllib) with callbacks as it encounters tags and attributes etc.

It is easy to process a stream without 'damaging' the beautiful
orginal structure of crap HTML - but it doesn't provide a DOM.


http://www.voidspace.org.uk/python/recipebook.shtml#scraper

All the best,

Michael Foord
http://www.manning.com/foord


 cheers
 kai liu
-- 
http://mail.python.org/mailman/listinfo/python-list


Is there a HTML parser who can reconstruct the original html EXACTLY?

2008-01-22 Thread ioscas
Hi, I am looking for a HTML parser who can parse a given page into
a DOM tree,  and can reconstruct the exact original html sources.
Strictly speaking, I should be allowed to retrieve the original
sources at each internal nodes of the DOM tree.
I have tried Beautiful Soup who is really nice when dealing with
those god damned ill-formed documents, but it's a pity for me to find
that this guy cannot retrieve original sources due to its great tidy
job.
Since Beautiful Soup, like most of the other HTML parsers in
python, is a subclass of sgmllib.SGMLParser to some extent,  I have
investigated the source code of sgmllib.SGMLParser,  see if there is
anything I can do to tell Beautiful Soup where he can find every tag
segment from HTML source, but this will be a time-consuming job.
so... any ideas?


cheers
kai liu
-- 
http://mail.python.org/mailman/listinfo/python-list