No idea if it is the "proper" way to do it, but I did this: <regex> <pattern> </pattern> <substitution>%20</substitution> </regex>
And added that to regex-normalize.xml (I modified the template, but you get the idea). That resolved URL's with spaces inside of them for me. There is probably a faster way, but that one worked. According the comment I left in the file, I found this here: https://issues.apache.org/jira/browse/NUTCH-661 Thanks, Kirby On Thu, Sep 3, 2009 at 2:57 PM, Mohamed Parvez<[email protected]> wrote: > Thanks for the suggestion fuad. > > I used your suggestion but does not seem to work, the space does not get > replaces by %20 or + > > Senario-1 > urls/seed.txt: > ------------------ > http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true&_pageLabel=SMBPortal_page_newsandresources_headlinedetail&newsId=10553&categoryname=SmallBusiness&portletTitle=Small > Business Features > > I get the fallowing error: > --------------------------------- > fetch of > http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb > =true&_pageLabel=SMBPortal_page_newsandresources_headlinedetail&newsId=10553&cat > egoryname=Small Business&portletTitle=Small Business > *Features failed with: Httpcode=406* > > > But if I Start with an URL with %20 instead of space > > Senario-2 > urls/seed.txt: > ------------------ > http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true&_pageLabel=SMBPortal_page_newsandresources_headlinedetail&newsId=10553&categoryname=Small%20Business&portletTitle=Small%20Business%20Features > > Everything works as expected. > > > ---- > Thanks/Regards, > Parvez > > > > On Thu, Sep 3, 2009 at 1:45 PM, Fuad Efendi <[email protected]> wrote: > >> >> > I am suing the urlnormalizer plugin (urlnormalizer-(pass|regex|basic)) >> and >> I >> > put the below rule in the conf/regex-normalize.xml file >> > >> > <regex> >> > <pattern>\s</pattern> >> > <substitution>%20</substitution> >> > </regex> >> > >> >> >> Should be escaped backslash: >> <pattern>\\s</pattern> >> >> >> You can also use + (plus) instead of %20. >> >> >> >> >> >
