Re: Lower case URLs - correct regex?

Dean Pullen Tue, 08 May 2012 07:33:14 -0700

We've dug a bit deeper...

We're actually upgrading from Nutch 1.0 to 1.4. It seems the regex stuffhas moved away from the Perl5Substitution implementation, whichsupported various methods in the substitution string (such as \L etc) toa standard Java pattern matcher, which takes the substitution stringpurely as a string, with no other operations possible.

This can be seen in RegexURLNormalizer (1.4), which uses the replaceAllmethod of the java Matcher class:


http://docs.oracle.com/javase/1.5.0/docs/api/java/util/regex/Matcher.html#replaceAll(java.lang.String)

So I suspect we'll need to look at using some other plug in to resovlethis problem.



Dean.

On 08/05/2012 13:46, Markus Jelsma wrote:

I'm not sure this is going to work as a lowercase flag is used on theregular expressions.
On Tue, 08 May 2012 13:37:47 +0100, Dean Pullen<[email protected]> wrote:
Hi all,


I'm trying to lower case all URLs via Nutch's regex-normalize.xml

The regex looks like:

<regex>
<pattern>(.*)</pattern>
<substitution>\L$1\E</substitution>
</regex>

This appears to be correct, yet we're seeing this when we dump the DB:
"Lhttp://some.page.org/?page=2633&pid=1042ELE&site=191";1;"db_unfetched";Tue
May 01 17:37:56 BST 2012;Thu Jan 01 01:00:00 GMT
1970;0;2592000.0;30.0;500.0;"null"
"Lhttp://some.page.org/?page=2633&pid=1043ELE&site=191";1;"db_unfetched";Tue
May 01 17:37:56 BST 2012;Thu Jan 01 01:00:00 GMT
1970;0;2592000.0;30.0;500.0;"null"


Notice the URL starts with an L? (Thus not matching http/https in
another config). Is this some problem with the regex above?

Regards,

Dean Pullen

Re: Lower case URLs - correct regex?

Reply via email to