Hi Sebastian,
Thakns for your support,
My correct regex would be +^
https://thinkarchitect.wordpress.com/([0-9]{4})/([0-9]{2})/([0-9]{2})/(.*)/$
https://thinkarchitect.wordpress.com/%28%5B0-9%5D%7B4%7D%29/%28%5B0-9%5D%7B2%7D%29/%28%5B0-9%5D%7B2%7D%29/*/$
However, I want to crawl all post pages
Hi,
the URLFilterChecker reads from stdin so URLs have to be either
- entered in command line
- passed by pipe
% echo myurl | bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined
- redirected from a file
% bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined myurls.txt
The
Thanks for your help,
I tried to run *bin/nutch org.apache.nutch.net.URLFilterChecker
-allCombined* to test my regex-urlfilter.txt, it take long-time without no
results.
What should I do? Is there any methods to test my regex in Nutch?
On Wed, Feb 11, 2015 at 3:55 AM, Sebastian Nagel
Hi,
So I add the following rule in regex-urlfilter.txt
+^https://thinkarchitect.wordpress.com/([0-9]{4})/([0-9]{2})/([0-9]{2})/*/$
This regex allows
https://thinkarchitect.wordpress.com/2015/02/06/
but does not allow
Hi all,
I want to crawl all posts of a blog except home, category, tag page of
https://thinkarchitect.wordpress.com. For example:
https://thinkarchitect.wordpress.com/2015/02/06/difficult-to-work-with-sometimes-2/
So I add the following rule in regex-urlfilter.txt
5 matches
Mail list logo