On Tue, Apr 4, 2017 at 4:07 PM, Markus Jelsma
<[email protected]> wrote:
> Hello - see inline.
> Markus
>
> -----Original message-----
>> From:Stephen R Guglielmo <[email protected]>
>> Sent: Tuesday 4th April 2017 21:46
>> To: [email protected]
>> Subject: Regex URL Filter Question
>>
>> Hi list,
>>
>> I'm working on configuring Nutch with ElasticSearch to provide a
>> website search functionality. I've been reading the Nutch
>> documentation and the NutchTutorial. In the NutchTutorial on the Wiki,
>> the section "Configure Regular Expression Filters" gives the example
>> of:
>>
>> +^http://([a-z0-9]*\.)*nutch.apache.org/
>>
>> However, I am a bit confused by this. Firstly, do / not need to be
>> escaped as usual in a regular expression? As in ^http:\/\/(a-z.....
>> instead of ^http://(a-z....
>
> No, forward slashes do not need to escaping.
>
>>
>> Also, I notice the first period is escaped, but the two periods in
>> "nutch.apache.org" are not escaped. Periods are normally wildcards in
>> regular expressions, hence my confusion.
>
> The dots in the hostname should be escaped, ideally, but they are not. It is 
> not a problem though, a wildcard also matches a dot.
>
>>
>> Is this an error in the documentation? Are these regexes PCRE or POSIX?
>
> Error in documentation, yes. I think Java's Pattern is POSIX but i am 
> actually not sure. Here are the docs of the implementation used in Nutch, it 
> explains all you need to know.
>
> https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html
>> Thank you!
>> Steve
>>


Thank you for the clarification and information, Markus!

Reply via email to