org.apache.nutch.net.URLFilterChecker
-allCombined
- redirected from a file
% bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined
myurls.txt
The +/- signs indicate whether a URL is accepted or rejected.
On 02/15/2015 06:05 AM, Phong Nguyen wrote:
Thanks for your help,
I tried to run
/2015/02/06/difficult-to-work-with-sometimes-2/
It's possible to test the URL filters via
% bin/nutch org.apache.nutch.net.URLFilterChecker
Sebastian
On 02/08/2015 07:18 PM, Phong Nguyen wrote:
Hi all,
I want to crawl all posts of a blog except home, category, tag page of
https
Hi all,
I want to crawl all posts of a blog except home, category, tag page of
https://thinkarchitect.wordpress.com. For example:
https://thinkarchitect.wordpress.com/2015/02/06/difficult-to-work-with-sometimes-2/
So I add the following rule in regex-urlfilter.txt
3 matches
Mail list logo