Re: Crawling (better: indexing) only certain URLS

Furkan KAMACI Fri, 15 Apr 2016 09:22:28 -0700

Hi Andrea,

Regex URL Filter works like that:


This accepts anything else:

*+.*

Let's assume that you want to crawl Nutch's website. If you wished to limit
the crawl to the nutch.apache.org domain, than definition should be that:

* +^http://([a-z0-9]*\.)*nutch.apache.org/ <http://nutch.apache.org/>*

So, if your more like this section has that pattern:

*http://www.xyz.com/book/{book_id} <http://www.xyz.com/book/{book_id}>*

Than your definition should be that:

*+^http://www.xyz.com/book/([0-9]*\.)*
<http://www.xyz.com/book/([0-9]*\.)*>*

For your first question, you should tell us what is the approximate size of
the data you will crawl, etc. and do you have any other needs?

Kind Regards,
Furkan KAMACI


On Fri, Apr 15, 2016 at 4:17 PM, Andrea Gazzarini <[email protected]> wrote:

> Hi guys,
> just playing as a Nutch newbie in a simple (at least I think) use case:
>
> I have a website (e.g. http://www.xyz.com) that allows searching for
> books. Here, as any straight search website I have two kind of pages:
>
>  * a page that shows search results (depending on the user entered
>    search terms)
>  * a details page about a given book. Each details page is a permalink
>    which follows a given naming convention (e.g.
>    http://www.xyz.com/book/{book id})
>
> The details page has something like a "more like this" section that
> contains permalinks to other (similar) books.
> Now, my requirement is to index in Solr *all* details page of such website.
>
> If Nutch is a suitable tool for doing that (and this is actually the first
> question), could you please give me some hint about how to configure it?
>
> Specifically, I tried put a seed file with just one entry
>
> http://www.xyx.com/book/1
>
> and then I configured my regex-urlfilter.txt
>
> +^http://www.xyx.com/book
>
> But it indexes only the /1 page. I imagined that the "more like this"
> section of the /1 page would act as a set of outlinks for getting further
> details pages (where in turns there are further MLT sections, and so on)
>
> Best,
> Andrea
>
>

Re: Crawling (better: indexing) only certain URLS

Reply via email to