Hi Adamantios, On Sat, Jan 24, 2015 at 2:05 PM, <[email protected]> wrote:
> > How to tell Apache Nutch 2.3 to go through all http://URL/?pg={X} pages, > with {X} going from 1 to 348, ^(0?[1-9]|[1-4][0-9]|348)$ Please try the above substituting you variable with the proposed regex. I've not tried to validate this so apologies if it is wrong. Your rules will need to go in to regex-urlfilter.txt assuming that this is the plugin you are using. Please ensure to comment out or remove the rule disqualifying URL's with a '?' as probably queries. > collect all http://URL/view/{Y}/ links, with > {Y} an arbitrary long number, Similar to the above right? > and finally dump all these links into a > single file? > > You can use the readdb tool as follows readdb -dump /path/to/outputDIr -regex url_regex_goes_here [-content] [-headers] [-links] [-text] This should achieve what you require. Please see our documentation on command line tools http://wiki.apache.org/nutch/bin/nutch%20readdb Thanks Lewis

