Hello Frank, Answers are inline:
Frank van Lingen said: > I recently started working with solr and find it easy to setup and > tinker with. > > I now want to scale up my setup and was wondering if there is an > application/component that can do the following (I was not able to find > documentation on this on the solr site): > > -Can I send solr an xml document with a url (html, pdf, word, ppt, > etc..) and solr indexes it after analyzing (can it analyze pdf and other > documents?). Solr would use some generic basic fields like > header and content when analyzing the files. Yes you can! Solr has an integration with Tika [1], yet another Apache Lucene project. It can index many different formats. Please see the Solr Cell wiki for more information [2]. > > -Can I send solr a site url and it indexes the whole site? No you can't. But there is yet another fine Apache Lucene project called Nutch [3]. It offers a very convenient API and is very flexible. Since version 1.0 Nutch can integrate more easily with a standby Solr index, and together with Tika you can index almost anything you want with the greatest ease. You can find information on Nutch [4], also, our friends at LucidImagination have written a very decent article on this subject [5]. You will find what you're looking for. Cheers > > If the answer to the above is yes; are there some examples? If the > answer is no; Is there a simple (basic) extractor for html, pdf, word, > etc.. files that would translates this in a basic xml document (e.g. > with field names, url, header and content) that solr can ingest, or > preferably an application that does this for a whole site? > > The idea is to configure solr for generic indexing and search of a > website. > > Frank. [1]: http://lucene.apache.org/tika/index.html [2]: http://wiki.apache.org/solr/ExtractingRequestHandler [3]: http://lucene.apache.org/nutch/ [4]: http://wiki.apache.org/nutch/RunningNutchAndSolr [5]: http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/