Hi Markus, My apologies, I got caught up other projects this morning.
My fetch for the url http://es.wikipedia.org/wiki/Espa%C3%B1olas with parse or index checker worked fine. FYI, the site I am trying to crawl is a sharepoint site. I guess I need to use ‘protocol-httpclient’ as it needs NTLM authentication credentials in httpclient-auth.xml file. This is my view of the problem area: - Crawling the seed page, nutch gets all the links to be crawled. - The REGEX normalizer transforms the special characters, but fails to substitute ‘%F1’ or ‘%C3%B1’ for ‘ñ’ - The fetcher is having trouble interpreting the links with special character ‘ñ’. I tried your suggestion, see details below. WITH ‘protocol-http’ $ bin/nutch org.apache.nutch.parse.ParserChecker "http://mydomain.com/en Español.aspx" --------- Url --------------- http://mydomain.com/en Espa±ol.aspx--------- ParseData --------- Version: 5 Status: success(1,0) Title: Bad Request Outlinks: 0 Content Metadata: Date=Tue, 04 Oct 2011 18:30:45 GMT Content-Length=311 Connection=close Content-Type=text/html; charset=us-ascii Server=Mic rosoft-HTTPAPI/2.0 Parse Metadata: CharEncodingForConversion=us-ascii OriginalCharEncoding=us-ascii $ bin/nutch org.apache.nutch.parse.ParserChecker "http://mydomain.com/en%20Espa%F1ol.aspx" --------- Url --------------- http://mydomain.com/en%20Espa%F1ol.aspx--------- ParseData --------- Version: 5 Status: success(1,0) Title: Outlinks: 0 Content Metadata: WWW-Authenticate=NTLM X-SharePointHealthScore=3 Date=Tue, 04 Oct 2011 18:30:56 GMT MicrosoftSharePointTeamServices=14.0.0. 4762 Content-Length=16 Connection=close Content-Type=text/html; charset=utf-8 X-Powered-By=ASP.NET Server=Microsoft-IIS/7.5 SPRequestGuid=77 bfafe3-9783-48eb-8d25-d0f30f54c8f9 Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 The results are like above for "http://mydomain.com/en%20Espa%C3%B1ol.aspx" WITH ‘protocol-httpclient’ $ bin/nutch org.apache.nutch.parse.ParserChecker "http://mydomain.com/en Español.aspx" Exception in thread "main" java.lang.NullPointerException at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:84) $ bin/nutch org.apache.nutch.parse.ParserChecker "http://mydomain.com/en%20Espa%F1ol.aspx" --------- Url --------------- http:// mydomain.com/en%20Espa%F1ol.aspx--------- ParseData --------- Version: 5 Status: success(1,0) Title: en Espa±ol Outlinks: 182 .... outlink: toUrl: http://mydomain.com/Gilo.aspx anchor: Gilo outlink: toUrl: http://mydomain.com/Asian,%20Asian%20names.aspx anchor: Asian, Asian names outlink: toUrl: http://mydomain.com/Albania.aspx anchor: Albania outlink: toUrl: http://mydomain.com/Nuclear%20meltdown.aspx anchor: Nuclear meltdown outlink: toUrl: http://mydomain.com/Arab%20Spring.aspx anchor: Arab Spring .... Content Metadata: X-SharePointHealthScore=2 Persistent-Auth=true MicrosoftSharePointTeamServices=14.0.0.4762 Content-Length=65194 Exp n, 19 Sep 2011 18:11:35 GMT Last-Modified=Tue, 04 Oct 2011 18:11:35 GMT Set-Cookie=WSS_KeepSessionAuthenticated={027d0360-d7a6-4607-8 be3a76d75}; path=/ Server=Microsoft-IIS/7.5 X-Powered-By=ASP.NET Cache-Control=private, max-age=0 SPRequestGuid=48edc96d-8f0a-424d-a6 b844952c X-AspNet-Version=2.0.50727 Date=Tue, 04 Oct 2011 18:11:35 GMT Content-Type=text/html; charset=utf-8 Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 The results are like above for "http://mydomain.com/en%20Espa%C3%B1ol.aspx" Please let me know, I ‘ll also continue digging more. thanks & regards, Rajesh Ramana On Oct 3, 2011, at 5:52 PM, "Markus Jelsma" <[email protected]> wrote: > >> Looks like you're using protocol-httpclient, try again with the >> protocol-http plugin instead. We crawler a large part of wikipedia >> for test purposes and all global modern character sets worked just fine. >> >> Can you fetch: >> http://es.wikipedia.org/wiki/Espa%C3%B1olas >> >> with parse or index checker? It works fine here. > > try bin/nutch org.apache.nutch.parse.ParserChecker <URL> with both > protocol-httpclient and protocol-http. > >> >>> I am trying to crawl a website which has link(s) with spanish/latin >>> characters in the url filename. I can't get Nutch to crawl the >>> page(s) with spanish accented chars in URL. >>> >>> Link: http://mydomain.com/en Español.aspx >>> >>> <http://mydomain.com/en%20Español.aspx> or >>> http://mydomain.com/en%20Español.aspx >>> <http://mydomain.com/en%20Español.aspx> >>> >>> >>> >>> I tried to substitute the URL encode(%F1) for the special character >>> (ñ), (and %20 is for " "), the whole list here >>> <http://www.w3schools.com/TAGS/ref_urlencode.asp> . >>> >>> The link http://mydomain.com/en%20Espa%F1ol.aspx works fine in the >>> >>> browser >>> >>> >>> >>> I tried to use regex URL normalizer to do the substitution in >>> regex-normalize.xml file as below (%20 is for " ") and (%F1 for the >>> special character ñ). >>> >>> <!-- replaces blank space(" ") in URL with escaped "%20" --> >>> >>> <regex> >>> >>> <pattern> </pattern> >>> >>> <substitution>%20</substitution> >>> >>> </regex> >>> >>> >>> >>> <!-- replaces accented char("ñ") in URL with escaped "%F1" --> >>> >>> <regex> >>> >>> <pattern>ñ</pattern> >>> >>> <substitution>%F1</substitution> >>> >>> </regex> >>> >>> >>> >>> The former(blank space) substitution works fine, but having trouble >>> with the latter (ñ) substitution, I am getting a FATAL >>> error(pointing to ñ location in the file) in the command prompt and >>> the below error in my hadoop log. >>> >>> ERROR regex.RegexURLNormalizer - error parsing conf file: >>> org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid >>> byte 2 of 4-byte UTF-8 sequence. >>> >>> >>> >>> Then I tried changing the character encoding in nutch-site.xml file >>> >>> <property> >>> >>> <name>parser.character.encoding.default</name> >>> >>> <value>ISO-8859-1</value> >>> >>> <description>The character encoding to fall back to when no other >>> >>> information >>> >>> is available</description> >>> >>> </property> >>> >>> And in the regex-normalize.xml file as below >>> >>> <regex> >>> >>> <pattern>U+00F1</pattern> >>> >>> <substitution>%F1</substitution> >>> >>> </regex> >>> >>> >>> >>> Now, I don't have any error in the command prompt and but the below >>> error in my hadoop log. It looks like the substitution is happening >>> but instead of the "%F1" it uses "?". >>> >>> >>> >>> ERROR httpclient.Http - java.lang.IllegalArgumentException: Invalid >>> uri >>> 'http://mydomain.com/en%20Espa?ol.aspx': escaped absolute path not >>> valid >>> >>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at >>> org.apache.commons.httpclient.HttpMethodBase.<init>(HttpMethodBase.j >>> ava:2 >>> 2 2) >>> >>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at >>> org.apache.commons.httpclient.methods.GetMethod.<init>(GetMethod.jav >>> a:89) >>> >>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at >>> org.apache.nutch.protocol.httpclient.HttpResponse.<init>(HttpRespons >>> e.jav >>> a >>> >>> :70) >>> >>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at >>> org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:154) >>> >>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at >>> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBa >>> se.ja >>> v a:224) >>> >>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at >>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:628) >>> >>> 2011-10-03 16:44:02,126 INFO fetcher.Fetcher - fetch of >>> http://mydomain.com/en%20Espa?ol.aspx failed with: >>> java.lang.IllegalArgumentException: Invalid uri >>> 'http://mydomain.com/en%20Espa?ol.aspx': escaped absolute path not valid. >>> >>> >>> >>> >>> >>> Can anyone help me with this issue? Is there any other config >>> changes I need to do to get this to work? >>> >>> >>> >>> Thanks in advance, any help in resolving this issue is much appreciated. >>> >>> >>> >>> thanks & regards, >>> Rajesh Ramana

