If there is no restriction on the number at the end of the url, you might
just use this:
(note that the rule must be above the one which filters urls with a "?"
character)

*+http://www.xyz.com/\?page=*
*
*
*# skip URLs containing certain characters as probable queries, etc.*
*-[?*!@=]*




On Sun, May 12, 2013 at 12:40 AM, Renato Marroquín Mogrovejo <
[email protected]> wrote:

> Hi all,
>
> I have been trying to fetch a query similar to:
>
> http://www.xyz.com/?page=1
>
> But where the number can vary from 1 to 100. Inside the first page
> there are links to the next ones. So I updated the
> conf/regex-urlfilter file and added:
>
> ^[0-9]{1,45}$
>
> When I do this, the generate job fails saying that it is "Invalid
> first character". I have tried generating with topN 5 and depth 5 and
> trying to fetch more urls but that does not work.
>
> Could anyone advise me on how to accomplish this? I am running Nutch 2.x.
> Thanks in advance!
>
>
> Renato M.
>

Reply via email to