Re: How to crawl specific pages of a website

2015-02-16 Thread Phong Nguyen
Hi Sebastian, Thakns for your support, My correct regex would be +^ https://thinkarchitect.wordpress.com/([0-9]{4})/([0-9]{2})/([0-9]{2})/(.*)/$ https://thinkarchitect.wordpress.com/%28%5B0-9%5D%7B4%7D%29/%28%5B0-9%5D%7B2%7D%29/%28%5B0-9%5D%7B2%7D%29/*/$ However, I want to crawl all post pages

Re: How to crawl specific pages of a website

2015-02-16 Thread Sebastian Nagel
Hi, the URLFilterChecker reads from stdin so URLs have to be either - entered in command line - passed by pipe % echo myurl | bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined - redirected from a file % bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined myurls.txt The

Re: How to crawl specific pages of a website

2015-02-15 Thread Phong Nguyen
Thanks for your help, I tried to run *bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined* to test my regex-urlfilter.txt, it take long-time without no results. What should I do? Is there any methods to test my regex in Nutch? On Wed, Feb 11, 2015 at 3:55 AM, Sebastian Nagel

Re: How to crawl specific pages of a website

2015-02-10 Thread Sebastian Nagel
Hi, So I add the following rule in regex-urlfilter.txt +^https://thinkarchitect.wordpress.com/([0-9]{4})/([0-9]{2})/([0-9]{2})/*/$ This regex allows https://thinkarchitect.wordpress.com/2015/02/06/ but does not allow

How to crawl specific pages of a website

2015-02-08 Thread Phong Nguyen
Hi all, I want to crawl all posts of a blog except home, category, tag page of https://thinkarchitect.wordpress.com. For example: https://thinkarchitect.wordpress.com/2015/02/06/difficult-to-work-with-sometimes-2/ So I add the following rule in regex-urlfilter.txt