Re: solr application for website crawling and indexing html, pdf, word, ... files

Markus Jelsma Mon, 25 Jan 2010 13:24:15 -0800

Hello Frank,

Answers are inline:


Frank van Lingen said:
> I recently started working with solr and find it easy to setup and
> tinker with.
>
> I now want to scale up my setup and was wondering if there is an
> application/component that can do the following (I was not able to find
> documentation on this on the solr site):
>
> -Can I send solr an xml document with a url (html, pdf, word, ppt,
> etc..) and solr indexes it after analyzing (can it analyze pdf and other
> documents?). Solr would use some generic basic fields like
> header and content when analyzing the files.

Yes you can! Solr has an integration with Tika [1], yet another Apache
Lucene project. It can index many different formats. Please see the Solr
Cell wiki for more information [2].
>
> -Can I send solr a site url and it indexes the whole site?

No you can't. But there is yet another fine Apache Lucene project called
Nutch [3]. It offers a very convenient API and is very flexible. Since
version 1.0 Nutch can integrate more easily with a standby Solr index, and
together with Tika you can index almost anything you want with the
greatest ease.

You can find information on Nutch [4], also, our friends at
LucidImagination have written a very decent article on this subject [5].
You will find what you're looking for.

Cheers


>
> If the answer to the above is yes; are there some examples? If the
> answer is no; Is there a simple (basic) extractor for html, pdf, word,
> etc.. files that would translates this in a basic xml document (e.g.
> with field names, url, header and content) that solr can ingest, or
> preferably an application that does this for a whole site?
>
> The idea is to configure solr for generic indexing and search of a
> website.
>
> Frank.

[1]: http://lucene.apache.org/tika/index.html
[2]: http://wiki.apache.org/solr/ExtractingRequestHandler
[3]: http://lucene.apache.org/nutch/
[4]: http://wiki.apache.org/nutch/RunningNutchAndSolr
[5]: http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/

Re: solr application for website crawling and indexing html, pdf, word, ... files

Reply via email to