Hi, I am starting up a somewhat complicated project. We need to crawl different sites (using Nutch) but before we push the crawled content to Solr we will refine some of the content. It is necessary in order to provide a really good Solr search experience to the end user.
This will be by a set of manually edited rules that can be: 1) Per single URL: - some specific urls will be marked in our database as more significant, - some urls should be ignored, - some urls should be tagged with certain keywords etc. 2) Pattern in URL, title or content, i.e.: - pages with certain regular expression patterns in URL will be auto-tagged with certain keywords etc. My plan is to set this up using Nutch 2 with a MySQL storage and run a set of stored database procedures between crawl and Solr indexing. Using stored procedures within a relational database seems somewhat more straightforward to me than writing a piece of code to manipulate NoSQL. However, I read that the MySQL backend i buggy so I am not sure if this is the best way forward. Does anybody on the list have: - suggestions to what the easiest way to do this would be? - any experiences with similar projects? - and is anybody able to confirm or reject that the MySQL backend for Nutch 2 is buggy? We are talking around 100K pages so I think performance is not crucial. Cheers, Bjørn Axelsen Independent web consultant, Copenhagen, Denmark

