> > So to do it right is not a one-liner, and to do it ugly > > requires a pre-scan of every single document, which > > will slow the system down for little gain - unless you > > happen to be reading sites with that particular > > problem.
> It does? Can't you just extract all of the link tokens > in a page found in 'a' elements, and run through the > value in their href attribute and convert spaces found? Example1: <a href=../images/smily.gif alt="Happy Face">See the Joy!</a> Note that the URL is not quoted. I'll agree that it should be, but the standard doesn't require it, and it often isn't. We need to extract the URL as "../images/smily.gif" We do not want to extract it as "../images/smily.gif alt=" So we assume that a space ends the attribute value and anything afterwards is the name of the next attribute. Example2: <a href="lost weekend.html" title="pics from last weekend">pics!</a> But that means this gets parsed as attr1: href="lost attr2: weekend.html" (no value) attr3: title="pics from last weekend" To avoid this, we need to deal with quotes before we tokenize the anchor. Unfortunately, the class-based programming model (and the current classes) means that we'll have to catch this in lots of separate places at separate abstraction levels. Alternatively, we could do a prescan, before the parser has a chance to tokenize things, and clean it up then. (As you probably do in Perl.) But is it really worth rereading every document on every site to guard against this one problem? Not to me, since it doesn't occur on the sites I read. > > umm ... they're not unrelated to plucker. > I was referring to things that contains completely unparsable > content, like incorrectly-nested tags, that god-awful pods:// > and avantgo:// scheme, MS-HTML, and other obvious > non-HTML things found in a page. We don't need to support it fully, but we should degrade as gracefully as possible, so users don't lose the rest of the site. For the particular examples you give, I haven't seen avantgo://. pods was once just a tiny subset of javascript for things like home and back. We could have handled it; I ignore it since it doesn't add value for me, and ignoring it causes no problem. pods is now a much fuller language, effectively a library call. This would be harder to support, but I've never seen the full functionality in use on the sites I visit. (I think there was one site that used the add-to-schedule functionality.) MS extensions can generally be treated as unknown tags. Improper nesting is a pain. Ideally, we should handle it better than we do, but I agree that this could be a time sink. > I'm more of the opinion that site-specific exclusions and fixes > should be inside a mini macro-language that Plucker can use, much like > SiteScooper, where you have a template that governs how the > content will be treated. Laurens has a good start. > For example, theonion.com uses that pods:// scheme > from AvantGo, because they assume AvantGo clients > are the only ones to hit that page. A template for > theonion could easily translate that scheme into http:// > instead. Hacking this into the parser itself, for a very > small percentage of users who would actually use it, > doesn't make good use of development time or effort. I treat it as unrecognized tag. I can't go "back" or straight to the home page from a link, but I can still use plucker's own navigation. This seems like a pretty general fallback to me. > > All the user can tell is that plucker fails on certain > > documents, in ways that are difficult to predict > > until you've already synced up, left the computer > > -- and found that your document isn't there after all. > Indeed. Perhaps the ability to sense "bad html" > (upstream issue) vs. "improperly parsed html" I'm not sure how to do that. I do think it would be useful to say "x pages, y kilobytes, z problems" and to pop up a warning if the size if there are problems, or the size is very different from expected. The python distiller can put out that information, but the desktop doesn't display it or act on it. There is nothing anywhere to pop up warnings that the pluck should be checked before going home. I also haven't seen a good way to see what it plucked before syncing, though I think there may be viewers out there - just not included with the main package. _______________________________________________ plucker-list mailing list [EMAIL PROTECTED] http://lists.rubberchicken.org/mailman/listinfo/plucker-list

