Re: protocol-foo: How to tell nutch about more URLs to fetch?

Sebastian Nagel Wed, 27 Sep 2017 02:44:09 -0700

Hi,

> It would be something like a directory listing, or no directory listing but 
> content.

Have a look at the protocol-file plugin: it wraps a directory listing into a 
HTML page
similar as the Apache web server does if there is no index.html in a directory.

> It is possible that a protocol-plugin cannot do much without a parser-plugin?

No. Nutch is a web crawler and crawling file systems or file servers is only 
done
by "emulating" web pages.

> And if I were to implement such a parser-plugin

There is already parse-html and parse-tika...

In general, if it's only about indexing a file system, it may be
easier to send the documents directly to Solr (or another indexer).
But often you have a mix of content providers (file system/server,
web site, wiki, etc.) and usually many of them already provide
a web frontend to browse (or crawl) the content.

Best,
Sebastian

On 09/27/2017 06:57 AM, Hiran CHAUDHURI wrote:
> Hi there.
> 
>  
> 
> While I am trying to create the protocol-foo, an implementation for the 
> example protocol with URLs
> like foo://something I see difficulty in distinguishing when to tell nutch to 
> search for more URLs
> and when not to. It would be something like a directory listing, or no 
> directory listing but content.
> 
>  
> 
> It is possible that a protocol-plugin cannot do much without a parser-plugin? 
> And if I were to
> implement such a parser-plugin, would I then have to implement the directory 
> listing plus all the
> content parsing like Tika?
> 
>  
> 
> Hiran
> 
>  
> 
>  
> 
> *Hiran Chaudhuri**
> Principal Support Engineer*
> 
> Service Reliability Engineering - Custom
> 
> Amadeus Data Processing GmbH
> Berghamer Strasse 6
> 85435 Erding
> T: +49-8122-43x3662
> hiran.chaudhuri@amadeus.com_
> http://amadeus.com <http://amadeus.com/>_**
> 
>  
>

Re: protocol-foo: How to tell nutch about more URLs to fetch?

Reply via email to