Hi all,

I'm trying to lower case all URLs via Nutch's regex-normalize.xml

The regex looks like:

<regex>
<pattern>(.*)</pattern>
<substitution>\L$1\E</substitution>
</regex>

This appears to be correct, yet we're seeing this when we dump the DB:

"Lhttp://some.page.org/?page=2633&pid=1042ELE&site=191";1;"db_unfetched";Tue May 01 17:37:56 BST 2012;Thu Jan 01 01:00:00 GMT 1970;0;2592000.0;30.0;500.0;"null" "Lhttp://some.page.org/?page=2633&pid=1043ELE&site=191";1;"db_unfetched";Tue May 01 17:37:56 BST 2012;Thu Jan 01 01:00:00 GMT 1970;0;2592000.0;30.0;500.0;"null"


Notice the URL starts with an L? (Thus not matching http/https in another config). Is this some problem with the regex above?

Regards,

Dean Pullen

Reply via email to