Re: Correct syntax for regex-urlfilter.txt - trying to exclude single path results

Sebastian Nagel Tue, 20 Nov 2012 12:52:37 -0800

Hi,

>>> As far as i know all URL's are long resolved before ever being passed to any
>>> filter. The parser is responsible for resolving relative to absolute.
>
> Well, my rules with explicit pattern matches for absolute URLs including the 
> protocol
> and domain failed until I made the protocol and domain optional.


Markus is definitively right. URLs are
1. transformed into global/absolute URLs by the parser
2. normalized given the defined URLNormalizers
3. filtered by all active URLFilters
There is no other way. This happens in many places (inject, generate, parse, 
...),
and it must always happen exactly the same way.

> Doesn't work...
> -^(http://[^/]+)/([\w\-]+)
>
What intended by this pattern?
It rejects all URLs
- with http protocol
- any host
- and any path which starts with a word character or hyphen
That's probably most http URLs (except those with empty path / root):
% cat conf/regex-urlfilter.txt
-^(http://[^/]+)/([\w\-]+)
+.
% cat urls.txt
http://nutch.apache.org/
http://nutch.apache.org/a
http://nutch.apache.org/index.html
http://nutch.apache.org/dir/
% cat urls.txt | nutch plugin urlfilter-regex 
org.apache.nutch.urlfilter.regex.RegexURLFilter
+http://nutch.apache.org/
-http://nutch.apache.org/a
-http://nutch.apache.org/index.html
-http://nutch.apache.org/dir/


>>>>> I have the following directives in regex-urlfilter.txt:
>>>>>
>>>>> # Accept anything
>>>>> +.
>>>>>
>>>>> # Exclude URLs under these top level paths
>>>>> -.*/example/.*
>>>>>
>>>>> # Exclude pages located immediately under root
>>>>> -^(http://)([^/]+/)([a-z]+)$
>>>>>
>>>>> #Allow exception URL located under root
>>>>> +http://my.site.com/exception
>>>>>
Ordering is mandatory:
First,
   +.
matches anything (except the empty string which is not a valid URL),
It's the first matching line => all URLs are accepted.
Second, after first problem is fixed, there is another pattern
which masks following patterns:
   -^(http://)([^/]+/)([a-z]+)$
hides (^ is added)
   +^http://my.site.com/exception

In general, the regex-urlfilter.txt should contain more specific patterns
first, and the more general last. The regex URL filter is powerful but
even for experienced Nutch users it takes time to configure it properly:
- use the command-line tools to test URL filters
- prepare a test set which covers all your use cases.

Cheers,
Sebastian



On 12/20/2011 08:38 PM, Matt Poff wrote:
>>> As far as i know all URL's are long resolved before ever being passed to 
>>> any 
>>> filter. The parser is responsible for resolving relative to absolute.
> 
> Well, my rules with explicit pattern matches for absolute URLs including the 
> protocol and domain failed until I made the protocol and domain optional.
> 
> Doesn't work...
> -^(http://[^/]+)/([\w\-]+)
> 
> Works...
> -^(http://[^/]+)?/([\w\-]+)
> 
> 
> 
> 
> On 21/12/2011, at 8:04 AM, Markus Jelsma wrote:
> 
>>
>>> Thanks, I was aware of these precedence rules but strayed a bit from them
>>> as I tweaked to try and get the results I wanted.
>>>
>>> What really helped was realising that URLs are not resolved into absolute
>>> links before they are tested so patterns need to match however they appear
>>> in parsed content. The hadoop.log file only displays absolute URLs which
>>> can be misleading.
>>
>> As far as i know all URL's are long resolved before ever being passed to any 
>> filter. The parser is responsible for resolving relative to absolute.
>>
>>>
>>> Second, this command line test for URL filtering saves a load of time and
>>> effort when tuning rules.
>>>
>>> bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined
>>>
>>> Now enter a test URL and hit Enter. StdOut will show whether the URL passes
>>> or fails current checks by displaying a plus or minus.
>>>
>>>> 20
>>>> # Each non-comment, non-blank line contains a regular expression
>>>> 21
>>>> # prefixed by '+' or '-'. The first matching pattern in the file
>>>> 22
>>>> # determines whether a URL is included or ignored. If no pattern
>>>> 23
>>>> # matches, the URL is ignored.
>>>>
>>>>
>>>>
>>>> http://svn.apache.org/viewvc/nutch/trunk/conf/regex-
>>>> urlfilter.txt.template?view=markup
>>>>
>>>>> Hi,
>>>>>
>>>>> I'm crawling a single web site and am going round in circles specifying
>>>>> the correct type and order of regex expressions in regex-urlfilter.txt
>>>>> to produce the following results:
>>>>>
>>>>> * Crawl no single level paths on the site other than the exceptions
>>>>> specified * Crawl two or more level paths other than those under top
>>>>> level paths I've excluded
>>>>>
>>>>>
>>>>> I have the folllowing directives in regex-urlfilter.txt:
>>>>>
>>>>>
>>>>> # Accept anything
>>>>> +.
>>>>>
>>>>> # Exclude URLs under these top level paths
>>>>> -.*/example/.*
>>>>>
>>>>> # Exclude pages located immediately under root
>>>>> -^(http://)([^/]+/)([a-z]+)$
>>>>>
>>>>> #Allow exception URL located under root
>>>>> +http://my.site.com/exception
>>>>>
>>>>>
>>>>> I can't get it to work. Variations are either too restrictive or ignore
>>>>> the first level exclusion. I've tested the expressions elsewhere and
>>>>> they match as required. Can anyone point me in the right direction here
>>>>> please.
>>>>>
>>>>> Thanks,
>>>>> Matt
> 
> 
> .headfirst
> WEB DEVELOPERS .ENGAGING .USEFUL .WORKS
> web:www.headfirst.co.nz
> email:[email protected]
> phone:(04) 498 5737
> mobile:022 384 3874
> 
> 
> 
>

Re: Correct syntax for regex-urlfilter.txt - trying to exclude single path results

Reply via email to