Re: How to crawl specific pages of a website

2015-02-16 Thread Phong Nguyen
org.apache.nutch.net.URLFilterChecker -allCombined - redirected from a file % bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined myurls.txt The +/- signs indicate whether a URL is accepted or rejected. On 02/15/2015 06:05 AM, Phong Nguyen wrote: Thanks for your help, I tried to run

Re: How to crawl specific pages of a website

2015-02-15 Thread Phong Nguyen
/2015/02/06/difficult-to-work-with-sometimes-2/ It's possible to test the URL filters via % bin/nutch org.apache.nutch.net.URLFilterChecker Sebastian On 02/08/2015 07:18 PM, Phong Nguyen wrote: Hi all, I want to crawl all posts of a blog except home, category, tag page of https

How to crawl specific pages of a website

2015-02-08 Thread Phong Nguyen
Hi all, I want to crawl all posts of a blog except home, category, tag page of https://thinkarchitect.wordpress.com. For example: https://thinkarchitect.wordpress.com/2015/02/06/difficult-to-work-with-sometimes-2/ So I add the following rule in regex-urlfilter.txt