We've dug a bit deeper...
We're actually upgrading from Nutch 1.0 to 1.4. It seems the regex stuff
has moved away from the Perl5Substitution implementation, which
supported various methods in the substitution string (such as \L etc) to
a standard Java pattern matcher, which takes the substitution string
purely as a string, with no other operations possible.
This can be seen in RegexURLNormalizer (1.4), which uses the replaceAll
method of the java Matcher class:
http://docs.oracle.com/javase/1.5.0/docs/api/java/util/regex/Matcher.html#replaceAll(java.lang.String)
So I suspect we'll need to look at using some other plug in to resovle
this problem.
Dean.
On 08/05/2012 13:46, Markus Jelsma wrote:
I'm not sure this is going to work as a lowercase flag is used on the
regular expressions.
On Tue, 08 May 2012 13:37:47 +0100, Dean Pullen
<[email protected]> wrote:
Hi all,
I'm trying to lower case all URLs via Nutch's regex-normalize.xml
The regex looks like:
<regex>
<pattern>(.*)</pattern>
<substitution>\L$1\E</substitution>
</regex>
This appears to be correct, yet we're seeing this when we dump the DB:
"Lhttp://some.page.org/?page=2633&pid=1042ELE&site=191";1;"db_unfetched";Tue
May 01 17:37:56 BST 2012;Thu Jan 01 01:00:00 GMT
1970;0;2592000.0;30.0;500.0;"null"
"Lhttp://some.page.org/?page=2633&pid=1043ELE&site=191";1;"db_unfetched";Tue
May 01 17:37:56 BST 2012;Thu Jan 01 01:00:00 GMT
1970;0;2592000.0;30.0;500.0;"null"
Notice the URL starts with an L? (Thus not matching http/https in
another config). Is this some problem with the regex above?
Regards,
Dean Pullen