Hello - see inline. Markus -----Original message----- > From:Stephen R Guglielmo <[email protected]> > Sent: Tuesday 4th April 2017 21:46 > To: [email protected] > Subject: Regex URL Filter Question > > Hi list, > > I'm working on configuring Nutch with ElasticSearch to provide a > website search functionality. I've been reading the Nutch > documentation and the NutchTutorial. In the NutchTutorial on the Wiki, > the section "Configure Regular Expression Filters" gives the example > of: > > +^http://([a-z0-9]*\.)*nutch.apache.org/ > > However, I am a bit confused by this. Firstly, do / not need to be > escaped as usual in a regular expression? As in ^http:\/\/(a-z..... > instead of ^http://(a-z....
No, forward slashes do not need to escaping. > > Also, I notice the first period is escaped, but the two periods in > "nutch.apache.org" are not escaped. Periods are normally wildcards in > regular expressions, hence my confusion. The dots in the hostname should be escaped, ideally, but they are not. It is not a problem though, a wildcard also matches a dot. > > Is this an error in the documentation? Are these regexes PCRE or POSIX? Error in documentation, yes. I think Java's Pattern is POSIX but i am actually not sure. Here are the docs of the implementation used in Nutch, it explains all you need to know. https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html > Thank you! > Steve >

