Thanks for the suggestion Kirby. It works for URL in the seed.txt file but wont work for URLs in the parsed content of a page
I used a URL that has spaces in the cong/seed.txt file and it replaces the space with %20 and I was able to crawl the page. Senario-1: urls/seed.txt: ------------------ http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true&_pageLabel=SMBPortal_page_newsandresources_headlinedetail&newsId=10553&categoryname=SmallBusiness&portletTitle=Small Business Features In this scenario the URL gets translated to : http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true&_pageLabel=SMBPortal_page_newsandresources_headlinedetail&newsId=10553&categoryname=Small%20Business&portletTitle=Small%20Business%20Features Senario-2: urls/seed.txt: ------------------- http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true&_pageLabel=SMBPortal_page_main_newsandresources The content of this page has many URLs that have space and Nutch can not crawl beyond one level. As it gets error when it encounters an URL with space, in the content of the page. Part of the content of the crawled page with Error: ----------------------------------------------------------------------- Small Business Features ERROR... URL Message http://business.verizon.net:80/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true&_pageLabel=SMBPortal_page_main_newsandresources Small Business Expert Advice ERROR... URL Message http://business.verizon.net:80/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true&_pageLabel=SMBPortal_page_main_newsandresources Wall Street Journal ERROR... URL Message http://business.verizon.net:80/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true&_pageLabel=SMBPortal_page_main_newsandresources Retail ---- Thanks/Regards, Parvez On Thu, Sep 3, 2009 at 3:39 PM, Fuad Efendi <[email protected]> wrote: > > But 'normalizer' can't be used with 'injector' (seed.txt)... 'normalizer' > is > called after Fetching-Parsing-Outlinks HTML... > > > > -----Original Message----- > > From: Mohamed Parvez [mailto:[email protected]] > > Sent: September-03-09 3:58 PM > > To: [email protected] > > Subject: Re: URL with Space > > > > Thanks for the suggestion fuad. > > > > I used your suggestion but does not seem to work, the space does not get > > replaces by %20 or + > > > > Senario-1 > > urls/seed.txt: > > ------------------ > > > > http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true > &_<http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true%0A&_> > > > > pageLabel=SMBPortal_page_newsandresources_headlinedetail&newsId=10553&catego > ry > > name=SmallBusiness&portletTitle=Small > > Business Features > > > > I get the fallowing error: > > --------------------------------- > > fetch of > > http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb > > > > =true&_pageLabel=SMBPortal_page_newsandresources_headlinedetail&newsId=10553 > &c > > at > > egoryname=Small Business&portletTitle=Small Business > > *Features failed with: Httpcode=406* > > > > > > But if I Start with an URL with %20 instead of space > > > > Senario-2 > > urls/seed.txt: > > ------------------ > > > > http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true > &_<http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true%0A&_> > > > > pageLabel=SMBPortal_page_newsandresources_headlinedetail&newsId=10553&catego > ry > > name=Small%20Business&portletTitle=Small%20Business%20Features > > > > Everything works as expected. > > > > > > ---- > > Thanks/Regards, > > Parvez > > > > > > > > On Thu, Sep 3, 2009 at 1:45 PM, Fuad Efendi <[email protected]> wrote: > > > > > > > > > I am suing the urlnormalizer plugin > (urlnormalizer-(pass|regex|basic)) > > > and > > > I > > > > put the below rule in the conf/regex-normalize.xml file > > > > > > > > <regex> > > > > <pattern>\s</pattern> > > > > <substitution>%20</substitution> > > > > </regex> > > > > > > > > > > > > > Should be escaped backslash: > > > <pattern>\\s</pattern> > > > > > > > > > You can also use + (plus) instead of %20. > > > > > > > > > > > > > > > > > >
