Hi all,
I'm trying to lower case all URLs via Nutch's regex-normalize.xml
The regex looks like:
<regex>
<pattern>(.*)</pattern>
<substitution>\L$1\E</substitution>
</regex>
This appears to be correct, yet we're seeing this when we dump the DB:
"Lhttp://some.page.org/?page=2633&pid=1042ELE&site=191";1;"db_unfetched";Tue
May 01 17:37:56 BST 2012;Thu Jan 01 01:00:00 GMT
1970;0;2592000.0;30.0;500.0;"null"
"Lhttp://some.page.org/?page=2633&pid=1043ELE&site=191";1;"db_unfetched";Tue
May 01 17:37:56 BST 2012;Thu Jan 01 01:00:00 GMT
1970;0;2592000.0;30.0;500.0;"null"
Notice the URL starts with an L? (Thus not matching http/https in
another config). Is this some problem with the regex above?
Regards,
Dean Pullen