Hi Bjørn,

-----Original message-----
> From:Bjørn Axelsen <[email protected]>
> Sent: Tuesday 8th October 2013 0:24
> To: [email protected]
> Subject: Nutch crawl, custom refinement, Solr indexing
> 
> Hi,
> 
> I am starting up a somewhat complicated project. We need to crawl different
> sites (using Nutch) but before we push the crawled content to Solr we will
> refine some of the content. It is necessary in order to provide a really
> good Solr search experience to the end user.
> 
> This will be by a set of manually edited rules that can be:
> 
> 1) Per single URL:
> - some specific urls will be marked in our database as more significant,

By which criterium?

> - some urls should be ignored,

(Regex) URL filters?

> - some urls should be tagged with certain keywords
> etc.

You can use the subcollection plugin for that. It adds value for a field 
(default subcollection) based on URL prefix. It does not support wildcards or 
regex (yet).

> 
> 2) Pattern in URL, title or content, i.e.:
> - pages with certain regular expression patterns in URL will be auto-tagged
> with certain keywords etc.

Well yes, see subcollection plugin. You could easily modify the plugin to 
support expressions.

> 
> My plan is to set this up using Nutch 2 with a MySQL storage and run a set
> of stored database procedures between crawl and Solr indexing. Using stored
> procedures within a relational database seems somewhat more straightforward
> to me than writing a piece of code to manipulate NoSQL.
> 
> However, I read that the MySQL backend i buggy so I am not sure if this is
> the best way forward.
> 
> Does anybody on the list have:
> - suggestions to what the easiest way to do this would be?
> - any experiences with similar projects?
> - and is anybody able to confirm or reject that the MySQL backend for Nutch
> 2 is buggy?
> 
> We are talking around 100K pages so I think performance is not crucial.

You should indeed not use the MySQL backend. Easiest to start with is 1.7 and 
is also fastest and most stable. I don't think there's anything difficult in 
your project.

Cheers
Markus


> 
> Cheers,
> Bjørn Axelsen
> Independent web consultant, Copenhagen, Denmark
> 

Reply via email to