Hi guys,
just playing as a Nutch newbie in a simple (at least I think) use case:

I have a website (e.g. http://www.xyz.com) that allows searching for books. Here, as any straight search website I have two kind of pages:

 * a page that shows search results (depending on the user entered
   search terms)
 * a details page about a given book. Each details page is a permalink
   which follows a given naming convention (e.g.
   http://www.xyz.com/book/{book id})

The details page has something like a "more like this" section that contains permalinks to other (similar) books.
Now, my requirement is to index in Solr *all* details page of such website.

If Nutch is a suitable tool for doing that (and this is actually the first question), could you please give me some hint about how to configure it?

Specifically, I tried put a seed file with just one entry

http://www.xyx.com/book/1

and then I configured my regex-urlfilter.txt

+^http://www.xyx.com/book

But it indexes only the /1 page. I imagined that the "more like this" section of the /1 page would act as a set of outlinks for getting further details pages (where in turns there are further MLT sections, and so on)

Best,
Andrea

Reply via email to