Re: HTML Parser which allows low-keyed local changes?

2010-02-01 Thread Stefan Behnel
Robert, 31.01.2010 20:57:
 I tried lxml, but after walking and making changes in the element tree,
 I'm forced to do a full serialization of the whole document
 (etree.tostring(tree)) - which destroys the human edited format of the
 original HTML code. makes it rather unreadable.

What do you mean? Could you give an example? lxml certainly does not
destroy anything it parsed, unless you tell it to do so.

Stefan
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: HTML Parser which allows low-keyed local changes (upon serialization)

2010-02-01 Thread Robert

Stefan Behnel wrote:

Robert, 31.01.2010 20:57:

I tried lxml, but after walking and making changes in the element tree,
I'm forced to do a full serialization of the whole document
(etree.tostring(tree)) - which destroys the human edited format of the
original HTML code. makes it rather unreadable.


What do you mean? Could you give an example? lxml certainly does not
destroy anything it parsed, unless you tell it to do so.



of course it does not destroy during parsing.(?)

I mean: I want to walk with a Python script through the parsed 
tree HTML and modify here and there things  (auto alt tags from 
DB/similar, link corrections, text sections/translated 
sentences... due to HTML code and content checks.)


Then I want to output the changed tree - but as close to the 
original format as far as possible. No changes to my white space 
identation, etc..  Only lokal changes, where really tags where 
changed.


Thats similiar like that what a good HTML editor does: After you 
made little changes, it doesn't reformat/re-spit-out your whole 
code layout from tree/attribute logic only. you have lokal changes 
only.
But a simple HTML editor like that in Mozilla-Seamonkey outputs a 
whole new HTML, produces the HTML from logical tree only 
(regarding his (ugly) style), destroys my whitspace layout and 
much more  - forgetting anything about the original layout.


Such a good HTML editor must somehow track the original 
positions of the tags in the file. And during each logical change 
in the tree it must tracks the file position changes/offsets. That 
thing seems to miss in lxml and BeautifulSoup which I tried so far.


This is a frequent need I have. Nobody else's?

Seems I need to write my own or patch BS to do that extra tracking?


Robert
--
http://mail.python.org/mailman/listinfo/python-list


Re: HTML Parser which allows low-keyed local changes (upon serialization)

2010-02-01 Thread Robert

Robert wrote:

Stefan Behnel wrote:

Robert, 31.01.2010 20:57:

I tried lxml, but after walking and making changes in the element tree,
I'm forced to do a full serialization of the whole document
(etree.tostring(tree)) - which destroys the human edited format of the
original HTML code. makes it rather unreadable.


What do you mean? Could you give an example? lxml certainly does not
destroy anything it parsed, unless you tell it to do so.



of course it does not destroy during parsing.(?)

I mean: I want to walk with a Python script through the parsed tree HTML 
and modify here and there things  (auto alt tags from DB/similar, link 
corrections, text sections/translated sentences... due to HTML code and 
content checks.)


Then I want to output the changed tree - but as close to the original 
format as far as possible. No changes to my white space identation, 
etc..  Only lokal changes, where really tags where changed.


Thats similiar like that what a good HTML editor does: After you made 
little changes, it doesn't reformat/re-spit-out your whole code layout 
from tree/attribute logic only. you have lokal changes only.
But a simple HTML editor like that in Mozilla-Seamonkey outputs a whole 
new HTML, produces the HTML from logical tree only (regarding his (ugly) 
style), destroys my whitspace layout and much more  - forgetting 
anything about the original layout.


Such a good HTML editor must somehow track the original positions of 
the tags in the file. And during each logical change in the tree it must 
tracks the file position changes/offsets. That thing seems to miss in 
lxml and BeautifulSoup which I tried so far.


This is a frequent need I have. Nobody else's?

Seems I need to write my own or patch BS to do that extra tracking?



basic feature(s) of such parser perhaps:

* can it tell for each tag object in the parsed tree, at what 
original file position start:end it resided? even a basic need: 
tell me the line number e.g. (for warning/analysis reports e.g.)


(* do the tree objects auto track/know if they were changed. (for 
convenience; a tree copy may serve this otherwise .. )


the creation of a output with local changes whould be rather 
simple from that ...



Robert
--
http://mail.python.org/mailman/listinfo/python-list


Re: HTML Parser which allows low-keyed local changes (upon serialization)

2010-02-01 Thread Stefan Behnel
Robert, 01.02.2010 14:36:
 Stefan Behnel wrote:
 Robert, 31.01.2010 20:57:
 I tried lxml, but after walking and making changes in the element tree,
 I'm forced to do a full serialization of the whole document
 (etree.tostring(tree)) - which destroys the human edited format of the
 original HTML code. makes it rather unreadable.

 What do you mean? Could you give an example? lxml certainly does not
 destroy anything it parsed, unless you tell it to do so.
 
 of course it does not destroy during parsing.(?)

I meant parsed in the sense of has parsed and is now working on.


 I mean: I want to walk with a Python script through the parsed tree HTML
 and modify here and there things  (auto alt tags from DB/similar, link
 corrections, text sections/translated sentences... due to HTML code and
 content checks.)

Sure, perfectly valid use case.


 Then I want to output the changed tree - but as close to the original
 format as far as possible. No changes to my white space identation,
 etc..  Only lokal changes, where really tags where changed.

That's up to you. If you only apply local changes that do not change any
surrounding whitespace, you'll be fine.


 Thats similiar like that what a good HTML editor does: After you made
 little changes, it doesn't reformat/re-spit-out your whole code layout
 from tree/attribute logic only. you have lokal changes only.

HTML editors don't work that way. They always re-spit-out the whole code
when you click on save. They certainly don't track the original file
position of tags. What they preserve is the content, including whitespace
(or not, if they reformat the code, but that's usually an *option*).


 Such a good HTML editor must somehow track the original positions of
 the tags in the file. And during each logical change in the tree it must
 tracks the file position changes/offsets.

Sorry, but that's nonsense. The file position of a tag is determined by
whitespace, i.e. line endings and indentation. lxml does not alter that,
unless you tell it do do so.

Since you keep claiming that it *does* alter it, please come up with a
reproducible example that shows a) what you do in your code, b) what your
input is and c) what unexpected output it creates. Do not forget to include
the version number of lxml and libxml2 that you are using, as well as a
comment on /how/ the output differs from what you expected.

My stab in the dark is that you forgot to copy the tail text of elements
that you replace by new content, and that you didn't properly indent new
content that you added. But that's just that, a stab in the dark. You
didn't provide enough information for even an educated guess.

Stefan
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: HTML Parser which allows low-keyed local changes (upon serialization)

2010-02-01 Thread Robert

Stefan Behnel wrote:

Robert, 01.02.2010 14:36:

Stefan Behnel wrote:

Robert, 31.01.2010 20:57:

I tried lxml, but after walking and making changes in the element tree,
I'm forced to do a full serialization of the whole document
(etree.tostring(tree)) - which destroys the human edited format of the
original HTML code. makes it rather unreadable.

What do you mean? Could you give an example? lxml certainly does not
destroy anything it parsed, unless you tell it to do so.

of course it does not destroy during parsing.(?)


I meant parsed in the sense of has parsed and is now working on.



I mean: I want to walk with a Python script through the parsed tree HTML
and modify here and there things  (auto alt tags from DB/similar, link
corrections, text sections/translated sentences... due to HTML code and
content checks.)


Sure, perfectly valid use case.



Then I want to output the changed tree - but as close to the original
format as far as possible. No changes to my white space identation,
etc..  Only lokal changes, where really tags where changed.


That's up to you. If you only apply local changes that do not change any
surrounding whitespace, you'll be fine.



Thats similiar like that what a good HTML editor does: After you made
little changes, it doesn't reformat/re-spit-out your whole code layout
from tree/attribute logic only. you have lokal changes only.


HTML editors don't work that way. They always re-spit-out the whole code
when you click on save. They certainly don't track the original file
position of tags. What they preserve is the content, including whitespace
(or not, if they reformat the code, but that's usually an *option*).



Such a good HTML editor must somehow track the original positions of
the tags in the file. And during each logical change in the tree it must
tracks the file position changes/offsets.


Sorry, but that's nonsense. The file position of a tag is determined by
whitespace, i.e. line endings and indentation. lxml does not alter that,
unless you tell it do do so.

Since you keep claiming that it *does* alter it, please come up with a
reproducible example that shows a) what you do in your code, b) what your
input is and c) what unexpected output it creates. Do not forget to include
the version number of lxml and libxml2 that you are using, as well as a
comment on /how/ the output differs from what you expected.

My stab in the dark is that you forgot to copy the tail text of elements
that you replace by new content, and that you didn't properly indent new
content that you added. But that's just that, a stab in the dark. You
didn't provide enough information for even an educated guess.



I think you confused the logical level of what I meant with file 
position:
Of course its not about (necessarily) writing back to the same 
open file (OS-level), but regarding the whole serializiation 
string (wherever it is finally written to - I typically write the 
auto-converted HTML files to a 2nd test folder first, and want use 
diff -u ... to see human-readable what changed happened - which 
again is only reasonable if the original layout is preserved as 
good as possible )


lxml and BeautifulSoup e.g. : loadparse a HTML file to a tree, 
immediately serialize the tree without changes = you see big 
differences of original and serialized files with quite any file.


The main issue: those libs seem to not track any info about the 
original string/file positions of the objects they parse. The just 
forget the past. Thus they cannot by principle do what I want it 
seems ...


Or does anybody see attributes of the tree objects - which I 
overlooked? Or a lib which can do or at least enable better this 
source-back-connected editing?



Robert
--
http://mail.python.org/mailman/listinfo/python-list


Re: HTML Parser which allows low-keyed local changes (upon serialization)

2010-02-01 Thread M.-A. Lemburg
Robert wrote:
 I think you confused the logical level of what I meant with file
 position:
 Of course its not about (necessarily) writing back to the same open file
 (OS-level), but regarding the whole serializiation string (wherever it
 is finally written to - I typically write the auto-converted HTML files
 to a 2nd test folder first, and want use diff -u ... to see
 human-readable what changed happened - which again is only reasonable if
 the original layout is preserved as good as possible )
 
 lxml and BeautifulSoup e.g. : loadparse a HTML file to a tree,
 immediately serialize the tree without changes = you see big
 differences of original and serialized files with quite any file.
 
 The main issue: those libs seem to not track any info about the original
 string/file positions of the objects they parse. The just forget the
 past. Thus they cannot by principle do what I want it seems ...
 
 Or does anybody see attributes of the tree objects - which I overlooked?
 Or a lib which can do or at least enable better this
 source-back-connected editing?

You'd have to write your own parse (or extend the example HTML
one we include), but mxTextTools allows you to work on original
code quite easily: it tags parts of the input string with objects.

You can then have those objects manipulate the underlying text as
necessary and write back the text using the original formatting
plus your local changes.

http://www.egenix.com/products/python/mxBase/mxTextTools/

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Feb 01 2010)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: HTML Parser which allows low-keyed local changes?

2010-02-01 Thread Nobody
On Sun, 31 Jan 2010 20:57:31 +0100, Robert wrote:

 I tried lxml, but after walking and making changes in the element 
 tree, I'm forced to do a full serialization of the whole document 
 (etree.tostring(tree)) - which destroys the human edited format 
 of the original HTML code.
 makes it rather unreadable.
 
 is there an existing HTML parser which supports tracking/writing 
 back particular changes in a cautious way by just making local 
 changes? or a least tracks the tag start/end positions in the file?

HTMLParser, sgmllib.SGMLParser and htmllib.HTMLParser all allow you to
retrieve the literal text of a start tag (but not an end tag).
Unfortunately, they're only tokenisers, not parsers, so you'll need to
handle minimisation yourself.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: HTML Parser which allows low-keyed local changes (upon serialization)

2010-02-01 Thread Tim Arnold

Robert no-s...@non-existing.invalid wrote in message 
news:hk729b$na...@news.albasani.net...
 Stefan Behnel wrote:
 Robert, 01.02.2010 14:36:
 Stefan Behnel wrote:
 Robert, 31.01.2010 20:57:
 I tried lxml, but after walking and making changes in the element 
 tree,
 I'm forced to do a full serialization of the whole document
 (etree.tostring(tree)) - which destroys the human edited format of 
 the
 original HTML code. makes it rather unreadable.
 What do you mean? Could you give an example? lxml certainly does not
 destroy anything it parsed, unless you tell it to do so.
 of course it does not destroy during parsing.(?)


I think I understand what you want, but I don't understand why yet. Do you 
want to view the differences in an IDE or something like that? If so, why 
not pretty-print both and compare that?
--Tim


-- 
http://mail.python.org/mailman/listinfo/python-list


HTML Parser which allows low-keyed local changes?

2010-01-31 Thread Robert
I tried lxml, but after walking and making changes in the element 
tree, I'm forced to do a full serialization of the whole document 
(etree.tostring(tree)) - which destroys the human edited format 
of the original HTML code.

makes it rather unreadable.

is there an existing HTML parser which supports tracking/writing 
back particular changes in a cautious way by just making local 
changes? or a least tracks the tag start/end positions in the file?



Robert
--
http://mail.python.org/mailman/listinfo/python-list