Hi,

(I mentioned this on solr-user, but people didn't seem to respond.)

It was a claim that Solr was probably not the right tool for indexing lots of different files (e.g. PDF files) across file systems, and that Nutch would be more appropriate. Does everyone agree with this opinion?

Solr aims at being an answer to "enterprise needs", by indexing structured data for different applications. However I think that many enterprises would like to be able to structure information themselves.

The only requirement as I see it, is that documents be compatible with defined schema.xml's. So why not extend the functionality to meet closed-source competition? It would be nice to index all of the following:
1) structured data
2) semi-strucured data
3) unstructured data

As it seems Solr meets demand (1) and somewhat demand (2), but provides no easy or built-in way to meet demand (3). It is therefore currently up to the application developer to create this functionality. This is very appropriate for many cases, but I would love to see Solr's appeal increase by including the following:

- A "pre-XML" step for parsing and extracting content into a confirming XML file
- It should be easy to configure/program
- Support for plug-in parsers (not necessarily a la Nutch, but maybe?)
- Support for plug-in connectors (DB, filesystem, manual feeding, etc.)

In other words: A standard way of doing what many people already do.

I am open to all kinds of feedback. Please let me know what you think of this. Is it worthwhile? Is Nutch really the alternative? Shouldn't an enterprise search platform really offer this? (Well, you certainly know what I think.) :)

Regards,

Eivind

Reply via email to