On Wed, Nov 2, 2011 at 9:17 PM, Terry Brown <terry_n_br...@yahoo.com> wrote:

> If it helps at all, the ElementTree model is that elements have
> both .text and .tail attributes.

[snip]

> Pythons flagship XML library is lxml of course, http://lxml.de/, but it
> basically uses ElementTree for element representation.

I don't want to use specialized parsers for any language.  The problem
isn't parsing, it's code generation and (especially) checking.

Edward

P.S., here is the common tokenizer for the import code.  It's hard to
imagine anything simpler::

    def tokenize (self,s):

        result,i,n = [],0,0
        while i < len(s):
            progress = j = i
            if s[i] == '\n':
                i,kind = i+1,'nl'
            elif s[i].isspace():
                i,kind = self.skipWs(s,i),'ws'
            elif self.startsComment(s,i):
                i,kind = self.skipComment(s,i),'comment'
            elif self.startsString(s,i):
                i,kind = self.skipString(s,i),'string'
            elif self.startsId(s,i):
                i,kind = self.skipId(s,i),'id'
            else:
                i,kind = i+1,'other'

            assert progress < i and j == progress
            val = s[j:i]
            result.append((kind,val,n),)
            n += val.count('\n')
            # g.trace('%3s %7s %s' % (n,kind,repr(val[:20])))

        return result

But as I write this, I see that there is something simpler. There
should be an isSpace method that replaces the test::

    s[i].isspace()

Put this test before the "raw" test for newline, and have the html
version of skipWs suck all contiguous whitespace into a single 'ws'
token whose val is always a single blank.

This does *not* change the code generators: it only simplifies the
checking logic, which is where the complications are.  For html, there
is still the problem of leading whitespace in comments.

The answer to *that* is to replace all the skipX methods with
skipXtoken methods.  They will return (i,kind,val) with the obvious
defaults in the base class that can be over-ridden in subclasses,
especially the html parser class.  This creates a general framework
for doing any kind of token munging as required by the verification
code.  It's good, and good enough.

Edward

-- 
You received this message because you are subscribed to the Google Groups 
"leo-editor" group.
To post to this group, send email to leo-editor@googlegroups.com.
To unsubscribe from this group, send email to 
leo-editor+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/leo-editor?hl=en.

Reply via email to