> > So to do it right is not a one-liner, and to do it ugly 
> > requires a pre-scan of every single document, which 
> > will slow the system down for little gain - unless you
> > happen to be reading sites with that particular
> > problem.

>       It does? Can't you just extract all of the link tokens 
> in a page found in 'a' elements, and run through the 
> value in their  href attribute and convert spaces found? 

Example1:
<a href=../images/smily.gif alt="Happy Face">See the Joy!</a>

Note that the URL is not quoted.  I'll agree that it should be,
but the standard doesn't require it, and it often isn't.

We need to extract the URL as "../images/smily.gif"
We do not want to extract it as "../images/smily.gif alt="

So we assume that a space ends the attribute value and 
anything afterwards is the name of the next attribute.

Example2:
<a href="lost weekend.html" title="pics from last weekend">pics!</a>

But that means this gets parsed as

attr1:          href="lost 
attr2:  weekend.html" (no value)
attr3:  title="pics from last weekend"

To avoid this, we need to deal with quotes before we tokenize the anchor.

Unfortunately, the class-based programming model (and the current
classes) means that we'll have to catch this in lots of separate places 
at separate abstraction levels.  

Alternatively, we could do a prescan, before the parser has a chance to
tokenize things, and clean it up then.  (As you probably do in Perl.)
But is it really worth rereading every document on every site to 
guard against this one problem?  Not to me, since it doesn't occur
on the sites I read.

> > umm ... they're not unrelated to plucker.

>       I was referring to things that contains completely unparsable
> content, like incorrectly-nested tags, that god-awful pods:// 
> and avantgo://  scheme, MS-HTML, and other obvious
> non-HTML things found in a page.

We don't need to support it fully, but we should degrade
as gracefully as possible, so users don't lose the rest of 
the site.  

For the particular examples you give,

I haven't seen avantgo://.

pods was once just a tiny subset of javascript for things like 
home and back.  We could have handled it; I ignore it since
it doesn't add value for me, and ignoring it causes no problem.

pods is now a much fuller language, effectively a library call.
This would be harder to support, but I've never seen the full
functionality in use on the sites I visit.  (I think there was
one site that used the add-to-schedule functionality.)

MS extensions can generally be treated as unknown tags.

Improper nesting is a pain.  Ideally, we should handle it 
better than we do, but I agree that this could be a time sink.

>       I'm more of the opinion that site-specific exclusions and fixes
> should be inside a mini macro-language that Plucker can use, much like
> SiteScooper, where you have a template that governs how the 
> content will be treated.

Laurens has a good start.

> For example, theonion.com uses that pods:// scheme 
> from AvantGo, because they assume AvantGo clients 
> are the only ones to hit that page. A template for 
> theonion could easily translate that scheme into http://
> instead. Hacking this into the parser itself, for a very 
> small percentage of users who would actually use it, 
> doesn't make good use of  development time or effort.

I treat it as unrecognized tag.  I can't go "back" or 
straight to the home page from a link, but I can still
use plucker's own navigation.  This seems like a
pretty general fallback to me.

> > All the user can tell is that plucker fails on certain 
> > documents, in ways that are difficult to predict
> > until you've already synced up, left the computer
> > -- and found that your document isn't there after all.

>       Indeed. Perhaps the ability to sense "bad html" 
> (upstream issue) vs. "improperly parsed html" 

I'm not sure how to do that.

I do think it would be useful to say "x pages, y kilobytes, z problems"
and to pop up a warning if the size if there are problems, or the size
is very different from expected.

The python distiller can put out that information, but the desktop
doesn't display it or act on it.  There is nothing anywhere to pop
up warnings that the pluck should be checked before going home.
I also haven't seen a good way to see what it plucked before syncing,
though I think there may be viewers out there - just not included 
with the main package.

_______________________________________________
plucker-list mailing list
[EMAIL PROTECTED]
http://lists.rubberchicken.org/mailman/listinfo/plucker-list

Reply via email to