Re: A new CLI (was Re: [RT] The environment abstraction, part II)

2006-04-04 Thread Thorsten Scherler
El lun, 03-04-2006 a las 12:34 +0100, Upayavira escribió:
 Thorsten Scherler wrote:
  El lun, 03-04-2006 a las 09:00 +0100, Upayavira escribió:
  David Crossley wrote:
  Upayavira wrote:
  Sylvain Wallez wrote:
  Carsten Ziegeler wrote:
  Sylvain Wallez wrote:
  Hmm... the current CLI uses Cocoon's links view to crawl the website. 
  So
  although the new crawler can be based on servlets, it will assume 
  these
  servlets to answer to a ?cocoon-view=links :-)
  
  Hmm, I think we don't need the links view in this case anymore. A 
  simple
   HTML crawler should be enough as it will follow all links on the page.
  The view would only make sense in the case where you don't output html
  where the usual crawler tools would not work.

  In the case of Forrest, you're probably right. Now the links view also
  allows to follow links in pipelines producing something that's not HTML,
  such as PDF, SVG, WML, etc.
 
  We have to decide if we want to loose this feature.
  I am not sure if we use this in Forrest. If not
  then we probably should be. 
 
  In my view, the whole idea of crawling (i.e. gathering links from pages)
  is suboptimal anyway. For example, some sites don't directly link to all
  pages (e.g. they are accessed via javascript, or whatever) so you get
  pages missed.
 
  Were I to code a new CLI, whilst I would support crawling I would mainly
  configure the CLI to get the list of pages to visit by calling one or
  more URLs. Those URLs would specify the pages to generate.
 
  Thus, Forrest would transform its site.xml file into this list of pages,
  and drive the CLI via that.
  This is what we do do. We have a property
  start-uri=linkmap.html
  http://forrest.zones.apache.org/ft/build/cocoon-docs/linkmap.html
  (we actually use corresponding xml of course).
 
  We define a few extra URIs in the Cocoon cli.xconf
 
  There are issues of course. Sometimes we want to
  include directories of files that are not referenced
  in site.xml navigation. For my sites i just use a
  DirectoryGenerator to build an index page which feeds
  the crawler. Sometime that technique is not sufficent.
 
  We also gather links from text files (e.g. CSS)
  using Chaperon. This works nicely but introduces
  some overhead.
  This more or less confirms my suggested approach - allow crawling at the
  'end-point' HTML, but more importantly, use a page/URL to identify the
  pages to be crawled. The interesting thing from what you say is that
  this page could itself be nothing more than HTML.
  
  Well, yes and not really, since e.g. Chaperon is text based and no
  markup. You need a lex-writer to generate links for the crawler. 
 
 Yes. You misunderstand me I think.

Yes, sorry I did misunderstood you.

  Even if you use Chaperon etc to parse
 markup, there'd be no difficulty expressing the links that you found as
 an HTML page - one intended to be consumed by the CLI, not to be
 publically viewed.

Well in the case of css you want them as well publically viewed but I
got your point. ;)

  In fact, if it were written to disc, forrest would
 probably delete it afterwards.
 
  Forrest actually is *not* aimed for html only support and one can think
  of the situation that you want your site to be only txt (kind of a
  book). Here you need to crawler the lex-rewriter outcome and follow the
  links.
 
 Hopefully I've shown that I had understood that already :-)

yeah ;)

 
  The current limitation of forrest regarding the crawler are IMO not
  caused by the crawler design but rather by our (as in forrest) usage of
  it.
 
 Yep, fair enough. But if the CLI is going to survive the shift that is
 happening in Cocoon trunk, something big needs to be done by someone. It
 cannot survive in its current form as the code it uses is changing
 almost beyond recognition.
 
 Heh, perhaps the Cocoon CLI should just be a Maven plugin.

...or forrest plugin. ;) This would makes it possible that cocoon, lenya
and forrest committer can help.

Kind of http://svn.apache.org/viewcvs.cgi/lenya/sandbox/doco/ ;)

salu2
-- 
thorsten

Together we stand, divided we fall! 
Hey you (Pink Floyd)



Re: A new CLI (was Re: [RT] The environment abstraction, part II)

2006-04-04 Thread Upayavira
Thorsten Scherler wrote:
 El lun, 03-04-2006 a las 12:34 +0100, Upayavira escribió:
 Thorsten Scherler wrote:
 El lun, 03-04-2006 a las 09:00 +0100, Upayavira escribió:
 David Crossley wrote:
 Upayavira wrote:
 Sylvain Wallez wrote:
 Carsten Ziegeler wrote:
 Sylvain Wallez wrote:
 Hmm... the current CLI uses Cocoon's links view to crawl the website. 
 So
 although the new crawler can be based on servlets, it will assume 
 these
 servlets to answer to a ?cocoon-view=links :-)
 
 Hmm, I think we don't need the links view in this case anymore. A 
 simple
  HTML crawler should be enough as it will follow all links on the page.
 The view would only make sense in the case where you don't output html
 where the usual crawler tools would not work.
   
 In the case of Forrest, you're probably right. Now the links view also
 allows to follow links in pipelines producing something that's not HTML,
 such as PDF, SVG, WML, etc.

 We have to decide if we want to loose this feature.
 I am not sure if we use this in Forrest. If not
 then we probably should be. 

 In my view, the whole idea of crawling (i.e. gathering links from pages)
 is suboptimal anyway. For example, some sites don't directly link to all
 pages (e.g. they are accessed via javascript, or whatever) so you get
 pages missed.

 Were I to code a new CLI, whilst I would support crawling I would mainly
 configure the CLI to get the list of pages to visit by calling one or
 more URLs. Those URLs would specify the pages to generate.

 Thus, Forrest would transform its site.xml file into this list of pages,
 and drive the CLI via that.
 This is what we do do. We have a property
 start-uri=linkmap.html
 http://forrest.zones.apache.org/ft/build/cocoon-docs/linkmap.html
 (we actually use corresponding xml of course).

 We define a few extra URIs in the Cocoon cli.xconf

 There are issues of course. Sometimes we want to
 include directories of files that are not referenced
 in site.xml navigation. For my sites i just use a
 DirectoryGenerator to build an index page which feeds
 the crawler. Sometime that technique is not sufficent.

 We also gather links from text files (e.g. CSS)
 using Chaperon. This works nicely but introduces
 some overhead.
 This more or less confirms my suggested approach - allow crawling at the
 'end-point' HTML, but more importantly, use a page/URL to identify the
 pages to be crawled. The interesting thing from what you say is that
 this page could itself be nothing more than HTML.
 Well, yes and not really, since e.g. Chaperon is text based and no
 markup. You need a lex-writer to generate links for the crawler. 
 Yes. You misunderstand me I think.
 
 Yes, sorry I did misunderstood you.
 
  Even if you use Chaperon etc to parse
 markup, there'd be no difficulty expressing the links that you found as
 an HTML page - one intended to be consumed by the CLI, not to be
 publically viewed.
 
 Well in the case of css you want them as well publically viewed but I
 got your point. ;)
 
  In fact, if it were written to disc, forrest would
 probably delete it afterwards.

 Forrest actually is *not* aimed for html only support and one can think
 of the situation that you want your site to be only txt (kind of a
 book). Here you need to crawler the lex-rewriter outcome and follow the
 links.
 Hopefully I've shown that I had understood that already :-)
 
 yeah ;)
 
 The current limitation of forrest regarding the crawler are IMO not
 caused by the crawler design but rather by our (as in forrest) usage of
 it.
 Yep, fair enough. But if the CLI is going to survive the shift that is
 happening in Cocoon trunk, something big needs to be done by someone. It
 cannot survive in its current form as the code it uses is changing
 almost beyond recognition.

 Heh, perhaps the Cocoon CLI should just be a Maven plugin.
 
 ...or forrest plugin. ;) This would makes it possible that cocoon, lenya
 and forrest committer can help.
 
 Kind of http://svn.apache.org/viewcvs.cgi/lenya/sandbox/doco/ ;)

Well, in the end, it is he who implements that decides.

Upayavira


Re: A new CLI (was Re: [RT] The environment abstraction, part II)

2006-04-03 Thread Upayavira
David Crossley wrote:
 Upayavira wrote:
 Sylvain Wallez wrote:
 Carsten Ziegeler wrote:
 Sylvain Wallez wrote:
 Hmm... the current CLI uses Cocoon's links view to crawl the website. So
 although the new crawler can be based on servlets, it will assume these
 servlets to answer to a ?cocoon-view=links :-)
 
 Hmm, I think we don't need the links view in this case anymore. A simple
  HTML crawler should be enough as it will follow all links on the page.
 The view would only make sense in the case where you don't output html
 where the usual crawler tools would not work.
   
 In the case of Forrest, you're probably right. Now the links view also
 allows to follow links in pipelines producing something that's not HTML,
 such as PDF, SVG, WML, etc.

 We have to decide if we want to loose this feature.
 
 I am not sure if we use this in Forrest. If not
 then we probably should be. 
 
 In my view, the whole idea of crawling (i.e. gathering links from pages)
 is suboptimal anyway. For example, some sites don't directly link to all
 pages (e.g. they are accessed via javascript, or whatever) so you get
 pages missed.

 Were I to code a new CLI, whilst I would support crawling I would mainly
 configure the CLI to get the list of pages to visit by calling one or
 more URLs. Those URLs would specify the pages to generate.

 Thus, Forrest would transform its site.xml file into this list of pages,
 and drive the CLI via that.
 
 This is what we do do. We have a property
 start-uri=linkmap.html
 http://forrest.zones.apache.org/ft/build/cocoon-docs/linkmap.html
 (we actually use corresponding xml of course).
 
 We define a few extra URIs in the Cocoon cli.xconf
 
 There are issues of course. Sometimes we want to
 include directories of files that are not referenced
 in site.xml navigation. For my sites i just use a
 DirectoryGenerator to build an index page which feeds
 the crawler. Sometime that technique is not sufficent.
 
 We also gather links from text files (e.g. CSS)
 using Chaperon. This works nicely but introduces
 some overhead.

This more or less confirms my suggested approach - allow crawling at the
'end-point' HTML, but more importantly, use a page/URL to identify the
pages to be crawled. The interesting thing from what you say is that
this page could itself be nothing more than HTML.

Regards, Upayavira


Re: A new CLI (was Re: [RT] The environment abstraction, part II)

2006-04-03 Thread Thorsten Scherler
El lun, 03-04-2006 a las 09:00 +0100, Upayavira escribió:
 David Crossley wrote:
  Upayavira wrote:
  Sylvain Wallez wrote:
  Carsten Ziegeler wrote:
  Sylvain Wallez wrote:
  Hmm... the current CLI uses Cocoon's links view to crawl the website. So
  although the new crawler can be based on servlets, it will assume these
  servlets to answer to a ?cocoon-view=links :-)
  
  Hmm, I think we don't need the links view in this case anymore. A simple
   HTML crawler should be enough as it will follow all links on the page.
  The view would only make sense in the case where you don't output html
  where the usual crawler tools would not work.

  In the case of Forrest, you're probably right. Now the links view also
  allows to follow links in pipelines producing something that's not HTML,
  such as PDF, SVG, WML, etc.
 
  We have to decide if we want to loose this feature.
  
  I am not sure if we use this in Forrest. If not
  then we probably should be. 
  
  In my view, the whole idea of crawling (i.e. gathering links from pages)
  is suboptimal anyway. For example, some sites don't directly link to all
  pages (e.g. they are accessed via javascript, or whatever) so you get
  pages missed.
 
  Were I to code a new CLI, whilst I would support crawling I would mainly
  configure the CLI to get the list of pages to visit by calling one or
  more URLs. Those URLs would specify the pages to generate.
 
  Thus, Forrest would transform its site.xml file into this list of pages,
  and drive the CLI via that.
  
  This is what we do do. We have a property
  start-uri=linkmap.html
  http://forrest.zones.apache.org/ft/build/cocoon-docs/linkmap.html
  (we actually use corresponding xml of course).
  
  We define a few extra URIs in the Cocoon cli.xconf
  
  There are issues of course. Sometimes we want to
  include directories of files that are not referenced
  in site.xml navigation. For my sites i just use a
  DirectoryGenerator to build an index page which feeds
  the crawler. Sometime that technique is not sufficent.
  
  We also gather links from text files (e.g. CSS)
  using Chaperon. This works nicely but introduces
  some overhead.
 
 This more or less confirms my suggested approach - allow crawling at the
 'end-point' HTML, but more importantly, use a page/URL to identify the
 pages to be crawled. The interesting thing from what you say is that
 this page could itself be nothing more than HTML.

Well, yes and not really, since e.g. Chaperon is text based and no
markup. You need a lex-writer to generate links for the crawler. 

Forrest actually is *not* aimed for html only support and one can think
of the situation that you want your site to be only txt (kind of a
book). Here you need to crawler the lex-rewriter outcome and follow the
links.

The current limitation of forrest regarding the crawler are IMO not
caused by the crawler design but rather by our (as in forrest) usage of
it.

salu2
-- 
thorsten

Together we stand, divided we fall! 
Hey you (Pink Floyd)



Re: A new CLI (was Re: [RT] The environment abstraction, part II)

2006-04-03 Thread Upayavira
Thorsten Scherler wrote:
 El lun, 03-04-2006 a las 09:00 +0100, Upayavira escribió:
 David Crossley wrote:
 Upayavira wrote:
 Sylvain Wallez wrote:
 Carsten Ziegeler wrote:
 Sylvain Wallez wrote:
 Hmm... the current CLI uses Cocoon's links view to crawl the website. So
 although the new crawler can be based on servlets, it will assume these
 servlets to answer to a ?cocoon-view=links :-)
 
 Hmm, I think we don't need the links view in this case anymore. A simple
  HTML crawler should be enough as it will follow all links on the page.
 The view would only make sense in the case where you don't output html
 where the usual crawler tools would not work.
   
 In the case of Forrest, you're probably right. Now the links view also
 allows to follow links in pipelines producing something that's not HTML,
 such as PDF, SVG, WML, etc.

 We have to decide if we want to loose this feature.
 I am not sure if we use this in Forrest. If not
 then we probably should be. 

 In my view, the whole idea of crawling (i.e. gathering links from pages)
 is suboptimal anyway. For example, some sites don't directly link to all
 pages (e.g. they are accessed via javascript, or whatever) so you get
 pages missed.

 Were I to code a new CLI, whilst I would support crawling I would mainly
 configure the CLI to get the list of pages to visit by calling one or
 more URLs. Those URLs would specify the pages to generate.

 Thus, Forrest would transform its site.xml file into this list of pages,
 and drive the CLI via that.
 This is what we do do. We have a property
 start-uri=linkmap.html
 http://forrest.zones.apache.org/ft/build/cocoon-docs/linkmap.html
 (we actually use corresponding xml of course).

 We define a few extra URIs in the Cocoon cli.xconf

 There are issues of course. Sometimes we want to
 include directories of files that are not referenced
 in site.xml navigation. For my sites i just use a
 DirectoryGenerator to build an index page which feeds
 the crawler. Sometime that technique is not sufficent.

 We also gather links from text files (e.g. CSS)
 using Chaperon. This works nicely but introduces
 some overhead.
 This more or less confirms my suggested approach - allow crawling at the
 'end-point' HTML, but more importantly, use a page/URL to identify the
 pages to be crawled. The interesting thing from what you say is that
 this page could itself be nothing more than HTML.
 
 Well, yes and not really, since e.g. Chaperon is text based and no
 markup. You need a lex-writer to generate links for the crawler. 

Yes. You misunderstand me I think. Even if you use Chaperon etc to parse
markup, there'd be no difficulty expressing the links that you found as
an HTML page - one intended to be consumed by the CLI, not to be
publically viewed. In fact, if it were written to disc, forrest would
probably delete it afterwards.

 Forrest actually is *not* aimed for html only support and one can think
 of the situation that you want your site to be only txt (kind of a
 book). Here you need to crawler the lex-rewriter outcome and follow the
 links.

Hopefully I've shown that I had understood that already :-)

 The current limitation of forrest regarding the crawler are IMO not
 caused by the crawler design but rather by our (as in forrest) usage of
 it.

Yep, fair enough. But if the CLI is going to survive the shift that is
happening in Cocoon trunk, something big needs to be done by someone. It
cannot survive in its current form as the code it uses is changing
almost beyond recognition.

Heh, perhaps the Cocoon CLI should just be a Maven plugin.

Upayavira


Re: A new CLI (was Re: [RT] The environment abstraction, part II)

2006-04-02 Thread Sylvain Wallez
Upayavira wrote:

 Ah, I wasn't getting that subtle. I was simply saying that I can agree
 with using the servlet API for _all_ environments. The CLI becomes
 nothing more than a custom servlet container that uses a servlet to
 generate its pages.

 In fact, having said that, it becomes yet another tool that is actually
 independent of Cocoon - it could be used to crawl pages generated by
 _any_ servlet, not just the Cocoon one.
   

Hmm... the current CLI uses Cocoon's links view to crawl the website. So
although the new crawler can be based on servlets, it will assume these
servlets to answer to a ?cocoon-view=links :-)

Sylvain

-- 
Sylvain Wallez
http://bluxte.net
Apache Software Foundation Member



Re: A new CLI (was Re: [RT] The environment abstraction, part II)

2006-04-02 Thread Carsten Ziegeler
Sylvain Wallez wrote:
 Upayavira wrote:
 
 Ah, I wasn't getting that subtle. I was simply saying that I can agree
 with using the servlet API for _all_ environments. The CLI becomes
 nothing more than a custom servlet container that uses a servlet to
 generate its pages.

 In fact, having said that, it becomes yet another tool that is actually
 independent of Cocoon - it could be used to crawl pages generated by
 _any_ servlet, not just the Cocoon one.
   
 
 Hmm... the current CLI uses Cocoon's links view to crawl the website. So
 although the new crawler can be based on servlets, it will assume these
 servlets to answer to a ?cocoon-view=links :-)
 
Hmm, I think we don't need the links view in this case anymore. A simple
 HTML crawler should be enough as it will follow all links on the page.
The view would only make sense in the case where you don't output html
where the usual crawler tools would not work.

Carsten

-- 
Carsten Ziegeler - Open Source Group, SN AG
http://www.s-und-n.de
http://www.osoco.org/weblogs/rael/


Re: A new CLI (was Re: [RT] The environment abstraction, part II)

2006-04-02 Thread Sylvain Wallez
Carsten Ziegeler wrote:
 Sylvain Wallez wrote:
   

 Hmm... the current CLI uses Cocoon's links view to crawl the website. So
 although the new crawler can be based on servlets, it will assume these
 servlets to answer to a ?cocoon-view=links :-)
 
 Hmm, I think we don't need the links view in this case anymore. A simple
  HTML crawler should be enough as it will follow all links on the page.
 The view would only make sense in the case where you don't output html
 where the usual crawler tools would not work.
   

In the case of Forrest, you're probably right. Now the links view also
allows to follow links in pipelines producing something that's not HTML,
such as PDF, SVG, WML, etc.

We have to decide if we want to loose this feature.

Sylvain

-- 
Sylvain Wallez
http://bluxte.net
Apache Software Foundation Member



Re: A new CLI (was Re: [RT] The environment abstraction, part II)

2006-04-02 Thread Carsten Ziegeler
Sylvain Wallez wrote:
 Carsten Ziegeler wrote:
 In the case of Forrest, you're probably right. Now the links view also
 allows to follow links in pipelines producing something that's not HTML,
 such as PDF, SVG, WML, etc.
Yepp.

 
 We have to decide if we want to loose this feature.
Right. So the question is, if someone is using this feature :)

Carsten
-- 
Carsten Ziegeler - Open Source Group, SN AG
http://www.s-und-n.de
http://www.osoco.org/weblogs/rael/


Re: A new CLI (was Re: [RT] The environment abstraction, part II)

2006-04-02 Thread Upayavira
Sylvain Wallez wrote:
 Carsten Ziegeler wrote:
 Sylvain Wallez wrote:
   
 
 Hmm... the current CLI uses Cocoon's links view to crawl the website. So
 although the new crawler can be based on servlets, it will assume these
 servlets to answer to a ?cocoon-view=links :-)
 
 Hmm, I think we don't need the links view in this case anymore. A simple
  HTML crawler should be enough as it will follow all links on the page.
 The view would only make sense in the case where you don't output html
 where the usual crawler tools would not work.
   
 
 In the case of Forrest, you're probably right. Now the links view also
 allows to follow links in pipelines producing something that's not HTML,
 such as PDF, SVG, WML, etc.
 
 We have to decide if we want to loose this feature.

In my view, the whole idea of crawling (i.e. gathering links from pages)
is suboptimal anyway. For example, some sites don't directly link to all
pages (e.g. they are accessed via javascript, or whatever) so you get
pages missed.

Were I to code a new CLI, whilst I would support crawling I would mainly
configure the CLI to get the list of pages to visit by calling one or
more URLs. Those URLs would specify the pages to generate.

Thus, Forrest would transform its site.xml file into this list of pages,
and drive the CLI via that.

Whilst gathering links from within pipelines is clever, it always struck
me as awkward at the same time.

Regards, Upayavira



Re: A new CLI (was Re: [RT] The environment abstraction, part II)

2006-04-02 Thread David Crossley
Upayavira wrote:
 Sylvain Wallez wrote:
  Carsten Ziegeler wrote:
  Sylvain Wallez wrote:
  
  Hmm... the current CLI uses Cocoon's links view to crawl the website. So
  although the new crawler can be based on servlets, it will assume these
  servlets to answer to a ?cocoon-view=links :-)
  
  Hmm, I think we don't need the links view in this case anymore. A simple
   HTML crawler should be enough as it will follow all links on the page.
  The view would only make sense in the case where you don't output html
  where the usual crawler tools would not work.

  
  In the case of Forrest, you're probably right. Now the links view also
  allows to follow links in pipelines producing something that's not HTML,
  such as PDF, SVG, WML, etc.
  
  We have to decide if we want to loose this feature.

I am not sure if we use this in Forrest. If not
then we probably should be. 

 In my view, the whole idea of crawling (i.e. gathering links from pages)
 is suboptimal anyway. For example, some sites don't directly link to all
 pages (e.g. they are accessed via javascript, or whatever) so you get
 pages missed.
 
 Were I to code a new CLI, whilst I would support crawling I would mainly
 configure the CLI to get the list of pages to visit by calling one or
 more URLs. Those URLs would specify the pages to generate.
 
 Thus, Forrest would transform its site.xml file into this list of pages,
 and drive the CLI via that.

This is what we do do. We have a property
start-uri=linkmap.html
http://forrest.zones.apache.org/ft/build/cocoon-docs/linkmap.html
(we actually use corresponding xml of course).

We define a few extra URIs in the Cocoon cli.xconf

There are issues of course. Sometimes we want to
include directories of files that are not referenced
in site.xml navigation. For my sites i just use a
DirectoryGenerator to build an index page which feeds
the crawler. Sometime that technique is not sufficent.

We also gather links from text files (e.g. CSS)
using Chaperon. This works nicely but introduces
some overhead.

-David

 Whilst gathering links from within pipelines is clever, it always struck
 me as awkward at the same time.
 
 Regards, Upayavira


A new CLI (was Re: [RT] The environment abstraction, part II)

2006-04-01 Thread Upayavira
Carsten Ziegeler wrote:
 Upayavira wrote:
 David Crossley wrote:
 Carsten Ziegeler wrote:
 I can't speak for Daniel, but my idea/suggestion was to forget about the
 different environments and let Cocoon always run in a servlet container.
 The CLI would then be kind of a http client which starts up jetty and
 then generates the site using http requests. This would simplify some
 things in Cocoon, the question is if this would make the life of Forrest
 too hard?
 Thanks to you all for the followup. I don't have a
 ready answer yet. Will make sure that the other
 Forrest people are aware.
 In the end, it doesn't really matter that much, and will be up to
 whoever volunteers to implement the new CLI.
 
 It depends a little bit on how we see things. My opinion :) is to remove
 the environment
 abstraction completly and simply use the servlet environment while
 others might think that we should only base our environment abstraction
 on the servlet api but allow to run Cocoon in a different environment
 which provides *some* features of the servlet environment but not all.
 The difference might be subtle, but its not the same.

Ah, I wasn't getting that subtle. I was simply saying that I can agree
with using the servlet API for _all_ environments. The CLI becomes
nothing more than a custom servlet container that uses a servlet to
generate its pages.

In fact, having said that, it becomes yet another tool that is actually
independent of Cocoon - it could be used to crawl pages generated by
_any_ servlet, not just the Cocoon one.

Regards, Upayavira