Connectors, Parsers, Plugin architecture

Eivind Hasle Amundsen Mon, 15 Jan 2007 07:01:28 -0800

Hi,

(I mentioned this on solr-user, but people didn't seem to respond.)

It was a claim that Solr was probably not the right tool for indexinglots of different files (e.g. PDF files) across file systems, and thatNutch would be more appropriate. Does everyone agree with this opinion?

Solr aims at being an answer to "enterprise needs", by indexingstructured data for different applications. However I think that manyenterprises would like to be able to structure information themselves.

The only requirement as I see it, is that documents be compatible withdefined schema.xml's. So why not extend the functionality to meetclosed-source competition? It would be nice to index all of the following:

1) structured data
2) semi-strucured data
3) unstructured data

As it seems Solr meets demand (1) and somewhat demand (2), but providesno easy or built-in way to meet demand (3). It is therefore currently upto the application developer to create this functionality. This is veryappropriate for many cases, but I would love to see Solr's appealincrease by including the following:

- A "pre-XML" step for parsing and extracting content into a confirmingXML file

- It should be easy to configure/program
- Support for plug-in parsers (not necessarily a la Nutch, but maybe?)
- Support for plug-in connectors (DB, filesystem, manual feeding, etc.)

In other words: A standard way of doing what many people already do.

I am open to all kinds of feedback. Please let me know what you think ofthis. Is it worthwhile? Is Nutch really the alternative? Shouldn't anenterprise search platform really offer this? (Well, you certainly knowwhat I think.) :)


Regards,

Eivind

Connectors, Parsers, Plugin architecture

Reply via email to