Nutch crawl, custom refinement, Solr indexing

Bjørn Axelsen Mon, 07 Oct 2013 15:24:44 -0700

Hi,

I am starting up a somewhat complicated project. We need to crawl different
sites (using Nutch) but before we push the crawled content to Solr we will
refine some of the content. It is necessary in order to provide a really
good Solr search experience to the end user.


This will be by a set of manually edited rules that can be:

1) Per single URL:
- some specific urls will be marked in our database as more significant,
- some urls should be ignored,
- some urls should be tagged with certain keywords
etc.

2) Pattern in URL, title or content, i.e.:
- pages with certain regular expression patterns in URL will be auto-tagged
with certain keywords etc.

My plan is to set this up using Nutch 2 with a MySQL storage and run a set
of stored database procedures between crawl and Solr indexing. Using stored
procedures within a relational database seems somewhat more straightforward
to me than writing a piece of code to manipulate NoSQL.

However, I read that the MySQL backend i buggy so I am not sure if this is
the best way forward.

Does anybody on the list have:
- suggestions to what the easiest way to do this would be?
- any experiences with similar projects?
- and is anybody able to confirm or reject that the MySQL backend for Nutch
2 is buggy?

We are talking around 100K pages so I think performance is not crucial.

Cheers,
Bjørn Axelsen
Independent web consultant, Copenhagen, Denmark

Nutch crawl, custom refinement, Solr indexing

Reply via email to