> Looks like you're using protocol-httpclient, try again with the
> protocol-http plugin instead. We crawler a large part of wikipedia for
> test purposes and all global modern character sets worked just fine.
> 
> Can you fetch:
> http://es.wikipedia.org/wiki/Espa%C3%B1olas
> 
> with parse or index checker? It works fine here.

try bin/nutch org.apache.nutch.parse.ParserChecker <URL>
with both protocol-httpclient and protocol-http.

> 
> > I am trying to crawl a website which has link(s) with spanish/latin
> > characters in the url filename. I can't get Nutch to crawl the page(s)
> > with spanish accented chars in URL.
> > 
> >   Link: http://mydomain.com/en Español.aspx
> > 
> > <http://mydomain.com/en%20Español.aspx>   or
> > http://mydomain.com/en%20Español.aspx
> > <http://mydomain.com/en%20Español.aspx>
> > 
> > 
> > 
> > I tried to substitute the URL encode(%F1) for the special character (ñ),
> > (and %20 is for " "), the whole list here
> > <http://www.w3schools.com/TAGS/ref_urlencode.asp> .
> > 
> >   The link http://mydomain.com/en%20Espa%F1ol.aspx works fine in the
> > 
> > browser
> > 
> > 
> > 
> > I tried to use regex URL normalizer to do the substitution in
> > regex-normalize.xml file as below  (%20 is for " ") and (%F1 for the
> > special character ñ).
> > 
> > <!-- replaces blank space(" ") in URL with escaped "%20"  -->
> > 
> > <regex>
> > 
> >   <pattern> </pattern>
> >   
> >   <substitution>%20</substitution>
> > 
> > </regex>
> > 
> > 
> > 
> > <!-- replaces accented char("ñ") in URL with escaped "%F1"  -->
> > 
> > <regex>
> > 
> >   <pattern>ñ</pattern>
> >   
> >   <substitution>%F1</substitution>
> > 
> > </regex>
> > 
> > 
> > 
> > The former(blank space) substitution works fine, but having trouble with
> > the latter (ñ) substitution, I am getting a FATAL error(pointing to ñ
> > location in the file) in the command prompt and the below error in my
> > hadoop log.
> > 
> >      ERROR regex.RegexURLNormalizer - error parsing conf file:
> > org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2
> > of 4-byte UTF-8 sequence.
> > 
> > 
> > 
> > Then I tried changing the character encoding in nutch-site.xml file
> > 
> > <property>
> > 
> >   <name>parser.character.encoding.default</name>
> >   
> >   <value>ISO-8859-1</value>
> >   
> >   <description>The character encoding to fall back to when no other
> > 
> > information
> > 
> >   is available</description>
> > 
> > </property>
> > 
> >   And in the regex-normalize.xml file as below
> > 
> > <regex>
> > 
> >   <pattern>U+00F1</pattern>
> >   
> >   <substitution>%F1</substitution>
> > 
> > </regex>
> > 
> > 
> > 
> > Now, I don't have any error in the command prompt and but the below error
> > in my hadoop log. It looks like the substitution is happening but instead
> > of the "%F1" it uses "?".
> > 
> > 
> > 
> > ERROR httpclient.Http - java.lang.IllegalArgumentException: Invalid uri
> > 'http://mydomain.com/en%20Espa?ol.aspx': escaped absolute path not valid
> > 
> > 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
> > org.apache.commons.httpclient.HttpMethodBase.<init>(HttpMethodBase.java:2
> > 2 2)
> > 
> > 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
> > org.apache.commons.httpclient.methods.GetMethod.<init>(GetMethod.java:89)
> > 
> > 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
> > org.apache.nutch.protocol.httpclient.HttpResponse.<init>(HttpResponse.jav
> > a
> > 
> > :70)
> > 
> > 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
> > org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:154)
> > 
> > 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
> > org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.ja
> > v a:224)
> > 
> > 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
> > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:628)
> > 
> > 2011-10-03 16:44:02,126 INFO  fetcher.Fetcher - fetch of
> > http://mydomain.com/en%20Espa?ol.aspx failed with:
> > java.lang.IllegalArgumentException: Invalid uri
> > 'http://mydomain.com/en%20Espa?ol.aspx': escaped absolute path not valid.
> > 
> > 
> > 
> > 
> > 
> > Can anyone help me with this issue? Is there any other config changes I
> > need to do to get this to work?
> > 
> > 
> > 
> > Thanks in advance, any help in resolving this issue is much appreciated.
> > 
> > 
> > 
> > thanks & regards,
> > Rajesh Ramana

Reply via email to