Hi,
(I mentioned this on solr-user, but people didn't seem to respond.)
It was a claim that Solr was probably not the right tool for indexing
lots of different files (e.g. PDF files) across file systems, and that
Nutch would be more appropriate. Does everyone agree with this opinion?
Solr aims at being an answer to "enterprise needs", by indexing
structured data for different applications. However I think that many
enterprises would like to be able to structure information themselves.
The only requirement as I see it, is that documents be compatible with
defined schema.xml's. So why not extend the functionality to meet
closed-source competition? It would be nice to index all of the following:
1) structured data
2) semi-strucured data
3) unstructured data
As it seems Solr meets demand (1) and somewhat demand (2), but provides
no easy or built-in way to meet demand (3). It is therefore currently up
to the application developer to create this functionality. This is very
appropriate for many cases, but I would love to see Solr's appeal
increase by including the following:
- A "pre-XML" step for parsing and extracting content into a confirming
XML file
- It should be easy to configure/program
- Support for plug-in parsers (not necessarily a la Nutch, but maybe?)
- Support for plug-in connectors (DB, filesystem, manual feeding, etc.)
In other words: A standard way of doing what many people already do.
I am open to all kinds of feedback. Please let me know what you think of
this. Is it worthwhile? Is Nutch really the alternative? Shouldn't an
enterprise search platform really offer this? (Well, you certainly know
what I think.) :)
Regards,
Eivind