Crawling (better: indexing) only certain URLS

Andrea Gazzarini Fri, 15 Apr 2016 06:18:09 -0700

Hi guys,
just playing as a Nutch newbie in a simple (at least I think) use case:

I have a website (e.g. http://www.xyz.com) that allows searching forbooks. Here, as any straight search website I have two kind of pages:


 * a page that shows search results (depending on the user entered
   search terms)
 * a details page about a given book. Each details page is a permalink
   which follows a given naming convention (e.g.
   http://www.xyz.com/book/{book id})

The details page has something like a "more like this" section thatcontains permalinks to other (similar) books.

Now, my requirement is to index in Solr *all* details page of such website.

If Nutch is a suitable tool for doing that (and this is actually thefirst question), could you please give me some hint about how toconfigure it?


Specifically, I tried put a seed file with just one entry

http://www.xyx.com/book/1

and then I configured my regex-urlfilter.txt

+^http://www.xyx.com/book

But it indexes only the /1 page. I imagined that the "more like this"section of the /1 page would act as a set of outlinks for gettingfurther details pages (where in turns there are further MLT sections,and so on)


Best,
Andrea

Crawling (better: indexing) only certain URLS

Reply via email to