On Aug 30, 2006, at 2:42 AM, Bruno wrote:
browsing through the message thread I tried to find a trail addressing file system crawls. I want to implement an enterprise search over a networked filesystem, crawling all sorts of documents, such as html, doc, ppt and pdf.
Nutch provides plugins enabling it to read proprietary formats.
Is there support for the same functionality in solr?

No. Solr is strictly a search server that takes plain text for the fields of documents added to it. The client is responsible parsing the text out of these types of documents. You could borrow the document parsing pieces from Lucene's contrib and Nutch and glue them together into your client that speaks to Solr, or perhaps Solr isn't the right approach for your needs? It certainly is possible to add these capabilities into Solr, but it would be awkward to have to stream binary data into XML documents such that Solr could parse them on the server side.

        Erik


Reply via email to