spaces in URLs and other nonstandard html

Jewett, Jim J Fri, 06 Jun 2003 03:35:15 -0700

MJ Ray wrote:
> Personally, I suspect it's not possible to write such a 
> "damage-free" fix, else one of the  supporters who 
> writes such huge essays on the topic would have 
> done it by now.


It can be done, but it is a bit of a pain and requires 
rewriting far more than it should.  The problem is
that the parser relies (at many levels) on the fact
that a space separates tokens.

It should not separate tokens _within_a_quoted_string_,
but keeping track of when you're in a quote gets a bit
hairy.  

(What's "quoted' here?" This") -> 
        what 
        's "quoted'
         here
        "This"

Because it is hairy, it is slow.
Because it is hairy and slow, it isn't quite supported by 
all the python standard libraries.
Because it isn't quite supported, or often used, it is buggy.

(I have submitted a patch to python's urllib related to 
quoted-string parsing for attribute lists, but it isn't 
available in the version currently used by plucker, and
it might not be used in strings that are expected to hold
only a single value.)

So to do it right is not a one-liner, and to do it ugly 
requires a pre-scan of every single document, which
will slow the system down for little gain - unless you
happen to be reading sites with that particular problem.



Tony McNamara:
> This group more than any other I consort with often 
> slips  into zealotry about such things.

David A. Desrosiers:
>       Do you know why? Because we get more bugs 
> reported for things that are completely unrelated 
> to Plucker, but which "affect" Plucker to the user.

umm ... they're not unrelated to plucker.

Plucker is a document-reader which can read web pages.
The user doesn't control the web page.

All the user can tell is that plucker fails on certain 
documents, in ways that are difficult to predict until
you've already synced up, left the computer -- and
found that your document isn't there after all.

-jJ
_______________________________________________
plucker-list mailing list
[EMAIL PROTECTED]
http://lists.rubberchicken.org/mailman/listinfo/plucker-list

spaces in URLs and other nonstandard html

Reply via email to