Re: document support for file system crawling

Chris Hostetter Mon, 15 Jan 2007 17:18:13 -0800

: In that respect I agree with the original posting that Solr lacks
: functionality with respect to desired functionality. One can argue that
: more or less random data should be structured by the user writing a
: decent application. However a more easy to use and configurable plugin
: architecture for different filtering and document parsing could make
: Solr more attractive. I think that many potential users would welcome
: such additions.


i don't think you'll get any argument about the benefits of supporting
more plugins to handle updates - both in terms of how the data is
expressed, and how the data is fetched, in fact you'll find some rather
involved discussions on that very topic going on on the solr-dev list
right now.

the thread you cite was specificly asking about:
  a) crawling a filesystem
  b) detecting document types and indexing text portions accordingly.

I honestly can't imagine either of those things being supported out of the
box by Solr -- there's just no reason for Solr to duplicate what Nutch
alrady does very well.

What i see being far more likely are:

1) more documentation (and posisbly some locking configuration options) on
how you can use Solr to access an index generated by the nutch crawler (i
think Thorsten has allready done this) or by Compass, or any other system
that builds a Lucene index.

2) "contrib" code that runs as it's own process to crawl documents and
send them to a Solr server. (mybe it parses them, or maybe it relies on
the next item...)

3) Stock "update" plugins that can each read a raw inputstreams of a some
widely used file format (PDF, RDF, HTML, XML of any schema) and have
configuration options telling them them what fields in the schema each
part of their document type should go in.

4) easy hooks for people to write their own update plugins for non widely
used fileformats.


-Hoss

Re: document support for file system crawling

Reply via email to