Can this be achieved? (Was: document support for file system crawling)

2007-01-16 Thread Eivind Hasle Amundsen
First: Please pardon the cross-post to solr-user for reference. I hope 
to continue this thread in solr-dev. Please answer to solr-dev.



1) more documentation (and posisbly some locking configuration options) on
how you can use Solr to access an index generated by the nutch crawler (i
think Thorsten has allready done this) or by Compass, or any other system
that builds a Lucene index.


Thorsten Scherler? Is this code available anywhere? Sounds very 
interesting to me. Maybe someone could ellaborate on the differences 
between the indexes created by Nutch/Solr/Compass/etc., or point me in 
the direction of an answer?



2) contrib code that runs as it's own process to crawl documents and
send them to a Solr server. (mybe it parses them, or maybe it relies on
the next item...)


Do you know FAST? It uses a step-by-step approach (pipeline) in which 
all of these tasks are done. Much of it is tuned in a easy web tool.


The point I'm trying to make is that contrib code is nice, but a 
complete package with these possibilities could broaden Solr's appeal 
somewhat.



3) Stock update plugins that can each read a raw inputstreams of a some
widely used file format (PDF, RDF, HTML, XML of any schema) and have
configuration options telling them them what fields in the schema each
part of their document type should go in.


Exactly, this sounds more like it. But if similar inputstreams can be 
handled by Nutch, what's the point in using Solr at all? The http API's? 
 In other words, both Nutch and Solr seem to have functionality that 
enterprises would want. But neither gives you the total solution.


Don't get it wrong, I don't want to bloat the products, even though it 
would be nice to have a crossover solution which is easy to set up.


The architecture could look something like this:

Connector - Parser - DocProc - (via schema) - Index

Possible connectors: JDBC, filesystem, crawler, manual feed
Possible parsers: PDF, whatever

Both connectors, parsers AND the document processors would be plugins. 
The DocProcs would typically be adjusted for each enterprise' needs, so 
that it fits with their schema.xml.


Problem is; I haven't worked enough with Solr, Nutch, Lucene etc. to 
really know all possibilities and limitations. But I do believe that the 
outlined architecture would be flexible and answer many needs. So the 
question is:


What is Solr missing? Could parts of Nutch be used in Solr to achieve 
this? How? Have I misunderstood completely? :)


Eivind


Re: document support for file system crawling

2007-01-15 Thread Chris Hostetter

: In that respect I agree with the original posting that Solr lacks
: functionality with respect to desired functionality. One can argue that
: more or less random data should be structured by the user writing a
: decent application. However a more easy to use and configurable plugin
: architecture for different filtering and document parsing could make
: Solr more attractive. I think that many potential users would welcome
: such additions.

i don't think you'll get any argument about the benefits of supporting
more plugins to handle updates - both in terms of how the data is
expressed, and how the data is fetched, in fact you'll find some rather
involved discussions on that very topic going on on the solr-dev list
right now.

the thread you cite was specificly asking about:
  a) crawling a filesystem
  b) detecting document types and indexing text portions accordingly.

I honestly can't imagine either of those things being supported out of the
box by Solr -- there's just no reason for Solr to duplicate what Nutch
alrady does very well.

What i see being far more likely are:

1) more documentation (and posisbly some locking configuration options) on
how you can use Solr to access an index generated by the nutch crawler (i
think Thorsten has allready done this) or by Compass, or any other system
that builds a Lucene index.

2) contrib code that runs as it's own process to crawl documents and
send them to a Solr server. (mybe it parses them, or maybe it relies on
the next item...)

3) Stock update plugins that can each read a raw inputstreams of a some
widely used file format (PDF, RDF, HTML, XML of any schema) and have
configuration options telling them them what fields in the schema each
part of their document type should go in.

4) easy hooks for people to write their own update plugins for non widely
used fileformats.


-Hoss



document support for file system crawling

2006-08-30 Thread Bruno

Hi there,

browsing through the message thread I tried to find a trail addressing file
system crawls. I want to implement an enterprise search over a networked
filesystem, crawling all sorts of documents, such as html, doc, ppt and pdf.
Nutch provides plugins enabling it to read proprietary formats. 
Is there support for the same functionality in solr?

Bruno
-- 
View this message in context: 
http://www.nabble.com/document-support-for-file-system-crawling-tf2188066.html#a6053318
Sent from the Solr - User forum at Nabble.com.



Re: document support for file system crawling

2006-08-30 Thread Erik Hatcher


On Aug 30, 2006, at 2:42 AM, Bruno wrote:
browsing through the message thread I tried to find a trail  
addressing file
system crawls. I want to implement an enterprise search over a  
networked
filesystem, crawling all sorts of documents, such as html, doc, ppt  
and pdf.

Nutch provides plugins enabling it to read proprietary formats.
Is there support for the same functionality in solr?


No.  Solr is strictly a search server that takes plain text for the  
fields of documents added to it.  The client is responsible parsing  
the text out of these types of documents.  You could borrow the  
document parsing pieces from Lucene's contrib and Nutch and glue them  
together into your client that speaks to Solr, or perhaps Solr isn't  
the right approach for your needs?   It certainly is possible to add  
these capabilities into Solr, but it would be awkward to have to  
stream binary data into XML documents such that Solr could parse them  
on the server side.


Erik