Hello - see inline.
Markus 
 
-----Original message-----
> From:Stephen R Guglielmo <[email protected]>
> Sent: Tuesday 4th April 2017 21:46
> To: [email protected]
> Subject: Regex URL Filter Question
> 
> Hi list,
> 
> I'm working on configuring Nutch with ElasticSearch to provide a
> website search functionality. I've been reading the Nutch
> documentation and the NutchTutorial. In the NutchTutorial on the Wiki,
> the section "Configure Regular Expression Filters" gives the example
> of:
> 
> +^http://([a-z0-9]*\.)*nutch.apache.org/
> 
> However, I am a bit confused by this. Firstly, do / not need to be
> escaped as usual in a regular expression? As in ^http:\/\/(a-z.....
> instead of ^http://(a-z....

No, forward slashes do not need to escaping.

> 
> Also, I notice the first period is escaped, but the two periods in
> "nutch.apache.org" are not escaped. Periods are normally wildcards in
> regular expressions, hence my confusion.

The dots in the hostname should be escaped, ideally, but they are not. It is 
not a problem though, a wildcard also matches a dot.

> 
> Is this an error in the documentation? Are these regexes PCRE or POSIX?

Error in documentation, yes. I think Java's Pattern is POSIX but i am actually 
not sure. Here are the docs of the implementation used in Nutch, it explains 
all you need to know.

https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html 
> Thank you!
> Steve
> 

Reply via email to