> Looks like you're using protocol-httpclient, try again with the > protocol-http plugin instead. We crawler a large part of wikipedia for > test purposes and all global modern character sets worked just fine. > > Can you fetch: > http://es.wikipedia.org/wiki/Espa%C3%B1olas > > with parse or index checker? It works fine here.
try bin/nutch org.apache.nutch.parse.ParserChecker <URL> with both protocol-httpclient and protocol-http. > > > I am trying to crawl a website which has link(s) with spanish/latin > > characters in the url filename. I can't get Nutch to crawl the page(s) > > with spanish accented chars in URL. > > > > Link: http://mydomain.com/en Español.aspx > > > > <http://mydomain.com/en%20Español.aspx> or > > http://mydomain.com/en%20Español.aspx > > <http://mydomain.com/en%20Español.aspx> > > > > > > > > I tried to substitute the URL encode(%F1) for the special character (ñ), > > (and %20 is for " "), the whole list here > > <http://www.w3schools.com/TAGS/ref_urlencode.asp> . > > > > The link http://mydomain.com/en%20Espa%F1ol.aspx works fine in the > > > > browser > > > > > > > > I tried to use regex URL normalizer to do the substitution in > > regex-normalize.xml file as below (%20 is for " ") and (%F1 for the > > special character ñ). > > > > <!-- replaces blank space(" ") in URL with escaped "%20" --> > > > > <regex> > > > > <pattern> </pattern> > > > > <substitution>%20</substitution> > > > > </regex> > > > > > > > > <!-- replaces accented char("ñ") in URL with escaped "%F1" --> > > > > <regex> > > > > <pattern>ñ</pattern> > > > > <substitution>%F1</substitution> > > > > </regex> > > > > > > > > The former(blank space) substitution works fine, but having trouble with > > the latter (ñ) substitution, I am getting a FATAL error(pointing to ñ > > location in the file) in the command prompt and the below error in my > > hadoop log. > > > > ERROR regex.RegexURLNormalizer - error parsing conf file: > > org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 > > of 4-byte UTF-8 sequence. > > > > > > > > Then I tried changing the character encoding in nutch-site.xml file > > > > <property> > > > > <name>parser.character.encoding.default</name> > > > > <value>ISO-8859-1</value> > > > > <description>The character encoding to fall back to when no other > > > > information > > > > is available</description> > > > > </property> > > > > And in the regex-normalize.xml file as below > > > > <regex> > > > > <pattern>U+00F1</pattern> > > > > <substitution>%F1</substitution> > > > > </regex> > > > > > > > > Now, I don't have any error in the command prompt and but the below error > > in my hadoop log. It looks like the substitution is happening but instead > > of the "%F1" it uses "?". > > > > > > > > ERROR httpclient.Http - java.lang.IllegalArgumentException: Invalid uri > > 'http://mydomain.com/en%20Espa?ol.aspx': escaped absolute path not valid > > > > 2011-10-03 16:44:02,123 ERROR httpclient.Http - at > > org.apache.commons.httpclient.HttpMethodBase.<init>(HttpMethodBase.java:2 > > 2 2) > > > > 2011-10-03 16:44:02,123 ERROR httpclient.Http - at > > org.apache.commons.httpclient.methods.GetMethod.<init>(GetMethod.java:89) > > > > 2011-10-03 16:44:02,123 ERROR httpclient.Http - at > > org.apache.nutch.protocol.httpclient.HttpResponse.<init>(HttpResponse.jav > > a > > > > :70) > > > > 2011-10-03 16:44:02,123 ERROR httpclient.Http - at > > org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:154) > > > > 2011-10-03 16:44:02,123 ERROR httpclient.Http - at > > org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.ja > > v a:224) > > > > 2011-10-03 16:44:02,123 ERROR httpclient.Http - at > > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:628) > > > > 2011-10-03 16:44:02,126 INFO fetcher.Fetcher - fetch of > > http://mydomain.com/en%20Espa?ol.aspx failed with: > > java.lang.IllegalArgumentException: Invalid uri > > 'http://mydomain.com/en%20Espa?ol.aspx': escaped absolute path not valid. > > > > > > > > > > > > Can anyone help me with this issue? Is there any other config changes I > > need to do to get this to work? > > > > > > > > Thanks in advance, any help in resolving this issue is much appreciated. > > > > > > > > thanks & regards, > > Rajesh Ramana

