On Tue, Apr 4, 2017 at 4:07 PM, Markus Jelsma <[email protected]> wrote: > Hello - see inline. > Markus > > -----Original message----- >> From:Stephen R Guglielmo <[email protected]> >> Sent: Tuesday 4th April 2017 21:46 >> To: [email protected] >> Subject: Regex URL Filter Question >> >> Hi list, >> >> I'm working on configuring Nutch with ElasticSearch to provide a >> website search functionality. I've been reading the Nutch >> documentation and the NutchTutorial. In the NutchTutorial on the Wiki, >> the section "Configure Regular Expression Filters" gives the example >> of: >> >> +^http://([a-z0-9]*\.)*nutch.apache.org/ >> >> However, I am a bit confused by this. Firstly, do / not need to be >> escaped as usual in a regular expression? As in ^http:\/\/(a-z..... >> instead of ^http://(a-z.... > > No, forward slashes do not need to escaping. > >> >> Also, I notice the first period is escaped, but the two periods in >> "nutch.apache.org" are not escaped. Periods are normally wildcards in >> regular expressions, hence my confusion. > > The dots in the hostname should be escaped, ideally, but they are not. It is > not a problem though, a wildcard also matches a dot. > >> >> Is this an error in the documentation? Are these regexes PCRE or POSIX? > > Error in documentation, yes. I think Java's Pattern is POSIX but i am > actually not sure. Here are the docs of the implementation used in Nutch, it > explains all you need to know. > > https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html >> Thank you! >> Steve >>
Thank you for the clarification and information, Markus!

