Hi guys,
just playing as a Nutch newbie in a simple (at least I think) use case:
I have a website (e.g. http://www.xyz.com) that allows searching for
books. Here, as any straight search website I have two kind of pages:
* a page that shows search results (depending on the user entered
search terms)
* a details page about a given book. Each details page is a permalink
which follows a given naming convention (e.g.
http://www.xyz.com/book/{book id})
The details page has something like a "more like this" section that
contains permalinks to other (similar) books.
Now, my requirement is to index in Solr *all* details page of such website.
If Nutch is a suitable tool for doing that (and this is actually the
first question), could you please give me some hint about how to
configure it?
Specifically, I tried put a seed file with just one entry
http://www.xyx.com/book/1
and then I configured my regex-urlfilter.txt
+^http://www.xyx.com/book
But it indexes only the /1 page. I imagined that the "more like this"
section of the /1 page would act as a set of outlinks for getting
further details pages (where in turns there are further MLT sections,
and so on)
Best,
Andrea