Re: A new CLI (was Re: [RT] The environment abstraction, part II)

Thorsten Scherler Mon, 03 Apr 2006 03:56:19 -0700

El lun, 03-04-2006 a las 09:00 +0100, Upayavira escribió:
> David Crossley wrote:
> > Upayavira wrote:
> >> Sylvain Wallez wrote:
> >>> Carsten Ziegeler wrote:
> >>>> Sylvain Wallez wrote:
> >>>>> Hmm... the current CLI uses Cocoon's links view to crawl the website. So
> >>>>> although the new crawler can be based on servlets, it will assume these
> >>>>> servlets to answer to a ?cocoon-view=links :-)
> >>>>>     
> >>>> Hmm, I think we don't need the links view in this case anymore. A simple
> >>>>  HTML crawler should be enough as it will follow all links on the page.
> >>>> The view would only make sense in the case where you don't output html
> >>>> where the usual crawler tools would not work.
> >>>>   
> >>> In the case of Forrest, you're probably right. Now the links view also
> >>> allows to follow links in pipelines producing something that's not HTML,
> >>> such as PDF, SVG, WML, etc.
> >>>
> >>> We have to decide if we want to loose this feature.
> > 
> > I am not sure if we use this in Forrest. If not
> > then we probably should be. 
> > 
> >> In my view, the whole idea of crawling (i.e. gathering links from pages)
> >> is suboptimal anyway. For example, some sites don't directly link to all
> >> pages (e.g. they are accessed via javascript, or whatever) so you get
> >> pages missed.
> >>
> >> Were I to code a new CLI, whilst I would support crawling I would mainly
> >> configure the CLI to get the list of pages to visit by calling one or
> >> more URLs. Those URLs would specify the pages to generate.
> >>
> >> Thus, Forrest would transform its site.xml file into this list of pages,
> >> and drive the CLI via that.
> > 
> > This is what we do do. We have a property
> > "start-uri=linkmap.html"
> > http://forrest.zones.apache.org/ft/build/cocoon-docs/linkmap.html
> > (we actually use corresponding xml of course).
> > 
> > We define a few extra URIs in the Cocoon cli.xconf
> > 
> > There are issues of course. Sometimes we want to
> > include directories of files that are not referenced
> > in site.xml navigation. For my sites i just use a
> > DirectoryGenerator to build an index page which feeds
> > the crawler. Sometime that technique is not sufficent.
> > 
> > We also gather links from text files (e.g. CSS)
> > using Chaperon. This works nicely but introduces
> > some overhead.
> 
> This more or less confirms my suggested approach - allow crawling at the
> 'end-point' HTML, but more importantly, use a page/URL to identify the
> pages to be crawled. The interesting thing from what you say is that
> this page could itself be nothing more than HTML.


Well, yes and not really, since e.g. Chaperon is text based and no
markup. You need a lex-writer to generate links for the crawler. 

Forrest actually is *not* aimed for html only support and one can think
of the situation that you want your site to be only txt (kind of a
book). Here you need to crawler the lex-rewriter outcome and follow the
links.

The current limitation of forrest regarding the crawler are IMO not
caused by the crawler design but rather by our (as in forrest) usage of
it.

salu2
-- 
thorsten

"Together we stand, divided we fall!" 
Hey you (Pink Floyd)

Re: A new CLI (was Re: [RT] The environment abstraction, part II)

Reply via email to