Architectural advice & questions on using Solr XML DataImport Handlers (and Nutch) for a Vertical Search engine.

Arthur Yarwood Mon, 29 Jun 2015 07:22:42 -0700

Please bear with me here, I'm pretty new to Solr with most of me DBexperience being of the relational variety. I'm planning a new project,which I believe Solr (and Nutch) will solve well. Although I'veinstalled Solr 5.2 and Nutch 1.10 (on Centos) and tinkered about a bit,I'd be grateful for advice and tips regarding my plan.

I'm looking to build a vertical search engine to cover a very specificand narrow dataset. Sources will number in the hundreds and mostlymanaged by hand, these will be a mixture of forums and product basede-commerce sites. For some of these I was hoping to leverage the SOLRDataImportHandler system with their RSS feeds primarily for the ease ofacquiring clean, reasonably sanitised and well structured data. For therest, I'm going to fall back to Nutch crawling them, with some heavyregulation via Regex of urls. So to sum up, a Solr DB populated througha couple of different ways, then search via some custom user facing PHPwebpages. Finally a cronjob script would delete any docs older than Xweeks, to keep on top of data retention.


Does that sound sensible at all?

Regarding RSS feeds:-

Many only provide a limited number of recent items, however I'd like toretain items for many weeks. I've already discovered the clean=falseparam on DataImport, after wondering why old rss items vanished!Question 1) is there an easy way to filter items to import in theURLDataSource entity? Or is it best to go down route of XSLTpreprocessing?Question 2) Multiple URLDataSources: reference all in one DataImporthandler? Or have multiple DataImport handlers?

What's the best approach to supplement imported data with additionalstatic fields/keywords based associated with the source feed or crawledsite? e.g. all docs from sites A, B & C are of subcategory Foo. I'mguessing with RSS feeds this would be straightforward via the XSLTpreprocessor. But for Nutch submitted docs - I've no idea?

Scheduling import: Do people just cron up a curl post request (or shellexecute of Nutch crawl script)? Or is there a more elegant solutionavailable?


Any other more general tips and advice on the above greatly appreciated.

--
Arthur Yarwood

Architectural advice & questions on using Solr XML DataImport Handlers (and Nutch) for a Vertical Search engine.

Reply via email to