Please bear with me here, I'm pretty new to Solr with most of me DB experience being of the relational variety. I'm planning a new project, which I believe Solr (and Nutch) will solve well. Although I've installed Solr 5.2 and Nutch 1.10 (on Centos) and tinkered about a bit, I'd be grateful for advice and tips regarding my plan.

I'm looking to build a vertical search engine to cover a very specific and narrow dataset. Sources will number in the hundreds and mostly managed by hand, these will be a mixture of forums and product based e-commerce sites. For some of these I was hoping to leverage the SOLR DataImportHandler system with their RSS feeds primarily for the ease of acquiring clean, reasonably sanitised and well structured data. For the rest, I'm going to fall back to Nutch crawling them, with some heavy regulation via Regex of urls. So to sum up, a Solr DB populated through a couple of different ways, then search via some custom user facing PHP webpages. Finally a cronjob script would delete any docs older than X weeks, to keep on top of data retention.

Does that sound sensible at all?

Regarding RSS feeds:-
Many only provide a limited number of recent items, however I'd like to retain items for many weeks. I've already discovered the clean=false param on DataImport, after wondering why old rss items vanished! Question 1) is there an easy way to filter items to import in the URLDataSource entity? Or is it best to go down route of XSLT preprocessing? Question 2) Multiple URLDataSources: reference all in one DataImport handler? Or have multiple DataImport handlers?

What's the best approach to supplement imported data with additional static fields/keywords based associated with the source feed or crawled site? e.g. all docs from sites A, B & C are of subcategory Foo. I'm guessing with RSS feeds this would be straightforward via the XSLT preprocessor. But for Nutch submitted docs - I've no idea?

Scheduling import: Do people just cron up a curl post request (or shell execute of Nutch crawl script)? Or is there a more elegant solution available?

Any other more general tips and advice on the above greatly appreciated.

--
Arthur Yarwood

Reply via email to