Sylvain Wallez wrote:
> Carsten Ziegeler wrote:
>> Sylvain Wallez wrote:
>>   
> 
>>> Hmm... the current CLI uses Cocoon's links view to crawl the website. So
>>> although the new crawler can be based on servlets, it will assume these
>>> servlets to answer to a ?cocoon-view=links :-)
>>>     
>> Hmm, I think we don't need the links view in this case anymore. A simple
>>  HTML crawler should be enough as it will follow all links on the page.
>> The view would only make sense in the case where you don't output html
>> where the usual crawler tools would not work.
>>   
> 
> In the case of Forrest, you're probably right. Now the links view also
> allows to follow links in pipelines producing something that's not HTML,
> such as PDF, SVG, WML, etc.
> 
> We have to decide if we want to loose this feature.

In my view, the whole idea of crawling (i.e. gathering links from pages)
is suboptimal anyway. For example, some sites don't directly link to all
pages (e.g. they are accessed via javascript, or whatever) so you get
pages missed.

Were I to code a new CLI, whilst I would support crawling I would mainly
configure the CLI to get the list of pages to visit by calling one or
more URLs. Those URLs would specify the pages to generate.

Thus, Forrest would transform its site.xml file into this list of pages,
and drive the CLI via that.

Whilst gathering links from within pipelines is clever, it always struck
me as awkward at the same time.

Regards, Upayavira

Reply via email to