Hi Bjørn, -----Original message----- > From:Bjørn Axelsen <[email protected]> > Sent: Tuesday 8th October 2013 0:24 > To: [email protected] > Subject: Nutch crawl, custom refinement, Solr indexing > > Hi, > > I am starting up a somewhat complicated project. We need to crawl different > sites (using Nutch) but before we push the crawled content to Solr we will > refine some of the content. It is necessary in order to provide a really > good Solr search experience to the end user. > > This will be by a set of manually edited rules that can be: > > 1) Per single URL: > - some specific urls will be marked in our database as more significant,
By which criterium? > - some urls should be ignored, (Regex) URL filters? > - some urls should be tagged with certain keywords > etc. You can use the subcollection plugin for that. It adds value for a field (default subcollection) based on URL prefix. It does not support wildcards or regex (yet). > > 2) Pattern in URL, title or content, i.e.: > - pages with certain regular expression patterns in URL will be auto-tagged > with certain keywords etc. Well yes, see subcollection plugin. You could easily modify the plugin to support expressions. > > My plan is to set this up using Nutch 2 with a MySQL storage and run a set > of stored database procedures between crawl and Solr indexing. Using stored > procedures within a relational database seems somewhat more straightforward > to me than writing a piece of code to manipulate NoSQL. > > However, I read that the MySQL backend i buggy so I am not sure if this is > the best way forward. > > Does anybody on the list have: > - suggestions to what the easiest way to do this would be? > - any experiences with similar projects? > - and is anybody able to confirm or reject that the MySQL backend for Nutch > 2 is buggy? > > We are talking around 100K pages so I think performance is not crucial. You should indeed not use the MySQL backend. Easiest to start with is 1.7 and is also fastest and most stable. I don't think there's anything difficult in your project. Cheers Markus > > Cheers, > Bjørn Axelsen > Independent web consultant, Copenhagen, Denmark >

