Re: spaces in URLs and other nonstandard html

David A. Desrosiers Fri, 06 Jun 2003 03:54:20 -0700

> It can be done, but it is a bit of a pain and requires rewriting far more
> than it should.  The problem is that the parser relies (at many levels) on
> the fact that a space separates tokens.


        I think we've confused spaces in the body "text" vs. spaces in the
URI "links" found in a page when parsed.

> So to do it right is not a one-liner, and to do it ugly requires a
> pre-scan of every single document, which will slow the system down for
> little gain - unless you happen to be reading sites with that particular
> problem.

        It does? Can't you just extract all of the link tokens in a page
found in 'a' elements, and run through the value in their href attribute and
convert spaces found? That's what I do here, and it takes about .0001ms for
a page with roughly 2,048 separate URLs. Surely python can do the same, no?

> umm ... they're not unrelated to plucker.

        I was referring to things that contains completely unparsable
content, like incorrectly-nested tags, that god-awful pods:// and avantgo://
scheme, MS-HTML, and other obvious non-HTML things found in a page. How much
work do we as developers have to do, just to compensate for something that
isn't even our problem to fix? When bugs are reported for things like broken
links, etc. and other stuff, what do we do? Add a new patch or fix for each
one?

        I'm more of the opinion that site-specific exclusions and fixes
should be inside a mini macro-language that Plucker can use, much like
SiteScooper, where you have a template that governs how the content will be
treated. For example, theonion.com uses that pods:// scheme from AvantGo,
because they assume AvantGo clients are the only ones to hit that page. A
template for theonion could easily translate that scheme into http://
instead. Hacking this into the parser itself, for a very small percentage of
users who would actually use it, doesn't make good use of development time
or effort.

        Look at bugs 100, 619, 651, 455, 618, 603, and dozens of others for
some examples of the kinds of bugs that get reported, which Plucker can't
easily compensate for.

> All the user can tell is that plucker fails on certain documents, in ways
> that are difficult to predict until you've already synced up, left the
> computer -- and found that your document isn't there after all.

        Indeed. Perhaps the ability to sense "bad html" (upstream issue) vs.
"improperly parsed html" (possible parser issue), and report those kinds of
errors to the user, or maybe work around them.

        As I've mentioned before, I've been doing a ton of work to get all
of the garbage HTML I parse out of the way, corrected, and validated, before
I parse it into Plucker format.. and I cover all of the things discussed
already here, but you can never compensate for everything. There is a level
of diminishing returns spending 3 hours debugging a parser issue, for the
less-than-1% of users who may run into that issue. But, like I've said,
that's just my time.. anyone else is more than welcome to spend some of
their own to diagnose and fix the issue, as long as the fix doesn't degrade
existing functionality.


d.

_______________________________________________
plucker-list mailing list
[EMAIL PROTECTED]
http://lists.rubberchicken.org/mailman/listinfo/plucker-list

Re: spaces in URLs and other nonstandard html

Reply via email to