RE: plucker-list digest, Vol 1 #346 - 7 msgs

Jewett, Jim J Fri, 06 Jun 2003 10:03:35 -0700

Laurens:

>         # Start of fix for spaces
>         r1 =3D re.compile(r' ', re.IGNORECASE)
>         url =3D r1.sub('%20', str(url), 0)
>         # End of fix for spaces
>         url =3D string.strip (str (url))


Yes, that will replace spaces (including trailing spaces)
with a %20.  The problem is in getting the right thing
passed to CleanUrl in the first place.

"David A. Desrosiers" 
> > Note that the URL is not quoted.  I'll agree that it
> > should  be, but the standard doesn't require it, and 
> > it often isn't.

>       The standard actually _does_ require it:
 
>       http://www.w3.org/TR/html401/intro/sgmltut.html
 
>       "By default, SGML requires ... We recommend using
>        quotation marks even when it is possible to eliminate them."

HTML is based on SGML, but does not follow precisely the
same rules.  This is one of the times when HTML does not
require it (though it is recommended).
 
> > But that means this [<a href="lost weekend" ...] gets parsed as

> > attr1:              href="lost
> > attr2:      weekend.html" (no value)
 
>       By what tool? I think the tool/lib you are using is 
> flawed, if it does this.

It is.  That was why I mentioned submitting a bugfix for string
parsing to the Python project.  (Alas, my fix still won't fix this
particular problem ... to do that, you would need to subclass
the string class to treated quoted subsections specially when
splitting, and then you would need to call this newer/better/slower
split everywhere you tokenize, including places where the work is
now done by the standard libraries.)

>       http://www.gnu-designs.com/code/test in\g.html

[ about using %20 to really mean %20, and whether \ <=> / ]

On this topic ... does anyone know what the "..." as ..\.. support
is for?  It seems to be in the windows-only code, but windows
2000 doesn't support it.  (I keep forgetting to check when near
other Win boxes.)

One thing I hoped to do once patches are getting in again is
to simplify the parser, by using the standard library as often
as possible - in many cases, the current python library has
fewer bugs than the current PyPlucker, probably because
there are more people using it.

>> [Degrade gracefully]

>       I agree, but we still may fail on some things we've never
> encountered before, and fixing/updating the distillers/parsers
> to work with that shouldn't require patches to the parser 

I agree that it should be possible to apply some processing rules.
At the moment, it is not, unless either they happen before the
channel starts (before the pages are fetched) or after it finishes
(when the pdb file is already written).  I'm working on this, but
don't know how far I'll get before I need to start over with a new
parser.

> [what to do about bad tags]

The standard says to render the content, but not the tagname
or attributes.  I think it would be better to give the user an
option, but that waits for the cleanup, if I do it.

> > I do think it would be useful to say "x pages, y kilobytes, 
> > z problems" and to pop up a warning if the size if there 
> > are problems, or the size is very different from expected.

>       Size of what?

Actually, any size.  I was thinking of the pdb size, since that
is what I currently use as a check.  Most channels are fairly
stable in size, so long as I don't change parameters.  When
a 90K channel becomes 60K or 125K, it may mean something
went wrong.  When it becomes 6K or 315K, it has *always*
meant that something went wrong.

I assume that similar logic would hold for pretty much anything
you measured.  [number of links followed, size of response before
parsing, size of html before compression, etc...]

> > The python distiller can put out that information, but 
> > the desktop doesn't display it or act on it. There is
> > nothing anywhere to pop up warnings that
> > the pluck should be checked before going home.

>       You could just have an error.log created for that 
> channel, and if the desktop component sees ...

For me personally, I now get the information.

It should be integrated into the parser and desktop so that
users won't have to understand the entire project before
they can validate individual channels.

> > I also haven't seen a good way to see what it plucked 
> > before syncing, though I think there may be viewers
> > out there - just not included with the main package.

>       I'm not sure what you mean here. You mean write
> the files to disk, before they are concatenated into the
> final .pdb file?  Doesn't the Python parser still support
> the caching of files to disk?

Actually, I mean "read the .pdb file", because a fair number
of errors have come after everything is fetched correctly.

But no, the parser doesn't seem to support disk caching.
It can save the files to a cache instead of a PDB, but I
haven't found a way to turn the cache into a PDB; it just
refetches and you hope the next time works as well as the
first.  (Also, the cache format isn't terribly convenient, 
because of file renaming - but that is another issue.)

-jJ
_______________________________________________
plucker-list mailing list
[EMAIL PROTECTED]
http://lists.rubberchicken.org/mailman/listinfo/plucker-list

RE: plucker-list digest, Vol 1 #346 - 7 msgs

Reply via email to