Re: simple xml parsing within html

G.W. Haywood Thu, 9 Dec 1999 06:25:57 -0800
Hi all,

On Wed, 8 Dec 1999, Alex Menendez wrote in the mod_perl list:

> I currently have developed a dynamic content engine in mod_perl
> I initially tried to do this by subclassing HTML::Parser and over-riding
> the usual methods. However, this was painfully slow
> 
> any suggestions on making HTML::Parser work faster

Performance is a real issue in mod_perl systems, so I've put some work
into this.  Maybe it will spawn a thread.

Here are a couple of suggestions for speeding up HTML::Parser.

Apparently the author is looking at this at present too.  He will know
a lot more about what the code is doing that I do.

My background is in engineering, mostly real-time instrumentation and
control.  I have never used these methods in text processing but the
principles are the same.  I guess you have some leeway here to decide
if it's worth all the effort - it can be a lot of effort.  Controlling
machines, there often isn't a choice in the matter.  These techniques
need careful thought in their application and they also need to be
followed up by testing in `real' conditions.  Without testing, you may
find that they do serious damage rather than improve performance.

Please also note that you asked for _suggestions_ and that's what 
these are.  I haven't

(a) considered all the options and consequences,
(b) tried it in practice in this case and
(c) perhaps most importantly, heard what the mod_perl group in general
    and what Christiansen, Schwartz, Wall et al. in particular will
    say if they see this.  I am especially aware of my own limitations
    as an inexperienced Perl user.  Holds breath in anticipation.

I have attached a copy of HTML::Parser with which I have taken great
liberties.  For that I apologize unreservedly to the author.  I do not
mean to imply that there is any bad practice in its construction.  In
fact I haven't any idea what most of it does.  However, I can see some
similarities in the constructs which may possibly allow the code to be
made a little quicker.  This will be at the expense of some elegance,
a few extra bytes of compiled code and possibly some complication in
maintenance.

Tradeoffs abound.

Hopefully the formatting of the attachment will make clear what it is
I'm trying to do.  Of course Perl itself might do a better job of some
or all of this.  That's one reason for testing.  It depends how well
any optimizers do their bit, and crucially what assumptions they make.
For example, few optimizers make the kind of guesses that I am about
to make in [B] below.  Or if they do, I won't be happy using them.

To apply the methods below I rely to some extent on the fact that Perl
(or C, or Pascal, whatever) is to a first approximation a free format
language.  Even if it were not, you could do most of what follows and
simply trust your expertise in the language to avoid altering the
effective logic.

[A] [A1] Remove as many comments as feels comfortable.  In this case
         I was comfortable without any comments.  Single line Perl `#' 
         comments _must_ be removed from within blocks at [A3] below!
    [A2] Reformat the code by hand so that firstly if..elsif...else
         clauses are stacked as per the attachment and secondly loops
         and especially nested loops are clearly visible.
    [A3] As far as possible format entire blocks into one-liners, but
         be aware that blocks within blocks may need special attention.
         The one liners may be very long lines but that's unimportant.
    [A4] Make sure that the code in this format is identical with the 
         old, if that's what you want.  Compiling to objects is good,
         Perl to C might help here for example.

    Depending on the complexity of the code, some of the above may not
    be necessary.  Simple inspection may do it for you.  I like to make
    the whole thing as dense as possible, and sometimes I print it out
    to get a clearer overall picture.

[B] Make some plausible assumptions about the input.  For example in
    this case I am assuming that the input is to some degree under the
    control of the person running the code, and that it is more likely
    that a snippet of HTML which looks something like this:

    <...variable="value"...>

    will usually be written exactly like that, or in this way:

    <...variable = "value"...>

    than with some arbitrary amount of embedded arbitrary white space.

[C] Add a few statements which will not change the results (!) but
    which will result in fewer (and possibly quicker) tests being
    made.  Where there are several different and complex tests which
    share some common and readily disprovable feature, a simple test
    for that feature first may entirely avoid the need for the others.
    In this case we can short-circuit some tests containing relatively
    complex REGULAR EXPRESSIONS in which the first character is the
    equals sign by making a couple of tests for simple two-character
    STRINGS which we think are the most likely to occur in the real
    input.  If we are wrong it all goes pear-shaped (==suboptimal),
    another reason for testing.  I've marked an example [2] in the
    attachment.  My confidence coefficient here is low for various
    reasons, but I doubt if Perl would spot this.

    BEWARE side-effects, and the order things get done - or not done.

[D] Where there are loops within loops, optimizing the inner loops for
    speed may give huge improvements if the loops are executed often.
    In the example I have added a test labeled [1] which does both of
    the things mentioned in [C] and [D].  I think.

[E] Look carefully at called subroutines and other modules, especially
    if they are called from within loops.  In some cases it is helpful
    to replace subroutines with inline code using macro substitutions,
    but in this case I doubt there could be much gain.  In other cases
    it might be appropriate to `drop down' into C or even assembler,
    where the performance gains can be astonishing.

[F] Make sure that the new code does what you want, or what it's
    supposed to do, whichever better pleases you.

[G] Measure the performance gain/loss.  You may find it advantageous
    to do the exercise in stages, testing different optimizations.

[H] If all else fails, sprinkle the code with counters and throw some
    representative input at it.  Muse on the results over strong coffee
    with a few colleagues.

In this case I haven't gone through the module with a fine-toothed
comb and I haven't done [A4], [E], [F], [G] nor [H] at all, as that
obviously is your problem...

These methods will be useful for improving speed in any code which
uses loops and branches (and in Perl, substitutions and matches) in
this way.  That might mean many scripts for parsing Markup Languages.

Hoping this is of some use to somebody,

73
Ged.
Re: simple xml parsing within html

Reply via email to