RE: Nutch not crawling URLs with spanish accented characters ( ñ)

Ramanathapuram, Rajesh Tue, 04 Oct 2011 11:54:14 -0700

Hi Markus, 

My apologies, I got caught up other projects this morning.


My fetch for the url http://es.wikipedia.org/wiki/Espa%C3%B1olas with parse or 
index checker worked fine.

FYI, the site I am trying to crawl is a sharepoint site. I guess I need to use 
‘protocol-httpclient’ as it needs NTLM authentication credentials in 
httpclient-auth.xml file. 

This is my view of the problem area:
-       Crawling the seed page, nutch gets all the links to be crawled. 
-       The REGEX normalizer transforms the special characters, but fails to 
substitute ‘%F1’ or ‘%C3%B1’ for ‘ñ’ 
-       The fetcher is having trouble interpreting the links with special 
character ‘ñ’. 

I tried your suggestion, see details below. 

WITH ‘protocol-http’ 
                $ bin/nutch org.apache.nutch.parse.ParserChecker 
"http://mydomain.com/en Español.aspx"
                                ---------
                                Url
                                ---------------
                                http://mydomain.com/en Espa±ol.aspx---------
                                ParseData
                                ---------
                                Version: 5
                                Status: success(1,0)
                                Title: Bad Request
                                Outlinks: 0
                                Content Metadata: Date=Tue, 04 Oct 2011 
18:30:45 GMT Content-Length=311 Connection=close Content-Type=text/html; 
charset=us-ascii Server=Mic
                                rosoft-HTTPAPI/2.0
                                Parse Metadata: 
CharEncodingForConversion=us-ascii OriginalCharEncoding=us-ascii
                
                $ bin/nutch org.apache.nutch.parse.ParserChecker 
"http://mydomain.com/en%20Espa%F1ol.aspx";
                                ---------
                                Url
                                ---------------
                                http://mydomain.com/en%20Espa%F1ol.aspx---------
                                ParseData
                                ---------
                                Version: 5
                                Status: success(1,0)
                                Title:
                                Outlinks: 0
                                Content Metadata: WWW-Authenticate=NTLM 
X-SharePointHealthScore=3 Date=Tue, 04 Oct 2011 18:30:56 GMT 
MicrosoftSharePointTeamServices=14.0.0.
                                4762 Content-Length=16 Connection=close 
Content-Type=text/html; charset=utf-8 X-Powered-By=ASP.NET 
Server=Microsoft-IIS/7.5 SPRequestGuid=77
                                bfafe3-9783-48eb-8d25-d0f30f54c8f9
                                Parse Metadata: CharEncodingForConversion=utf-8 
OriginalCharEncoding=utf-8

                The results are like above for 
"http://mydomain.com/en%20Espa%C3%B1ol.aspx";


WITH ‘protocol-httpclient’ 

                $ bin/nutch org.apache.nutch.parse.ParserChecker 
"http://mydomain.com/en Español.aspx"
                                Exception in thread "main" 
java.lang.NullPointerException
                                        at 
org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:84)
                
                $ bin/nutch org.apache.nutch.parse.ParserChecker 
"http://mydomain.com/en%20Espa%F1ol.aspx";
                                ---------
                                Url
                                ---------------
                                http:// 
mydomain.com/en%20Espa%F1ol.aspx---------
                                ParseData
                                ---------
                                Version: 5
                                Status: success(1,0)
                                Title: en Espa±ol
                                Outlinks: 182
                                ....
                                 outlink: toUrl: http://mydomain.com/Gilo.aspx 
anchor: Gilo
                                 outlink: toUrl: 
http://mydomain.com/Asian,%20Asian%20names.aspx anchor: Asian, Asian names
                                 outlink: toUrl: 
http://mydomain.com/Albania.aspx anchor: Albania
                                 outlink: toUrl: 
http://mydomain.com/Nuclear%20meltdown.aspx anchor: Nuclear meltdown
                                 outlink: toUrl: 
http://mydomain.com/Arab%20Spring.aspx anchor: Arab Spring
                                ....
                                Content Metadata: X-SharePointHealthScore=2 
Persistent-Auth=true MicrosoftSharePointTeamServices=14.0.0.4762 
Content-Length=65194 Exp
                                n, 19 Sep 2011 18:11:35 GMT Last-Modified=Tue, 
04 Oct 2011 18:11:35 GMT 
Set-Cookie=WSS_KeepSessionAuthenticated={027d0360-d7a6-4607-8
                                be3a76d75}; path=/ Server=Microsoft-IIS/7.5 
X-Powered-By=ASP.NET Cache-Control=private, max-age=0 
SPRequestGuid=48edc96d-8f0a-424d-a6
                                b844952c X-AspNet-Version=2.0.50727 Date=Tue, 
04 Oct 2011 18:11:35 GMT Content-Type=text/html; charset=utf-8
                                Parse Metadata: CharEncodingForConversion=utf-8 
OriginalCharEncoding=utf-8

                The results are like above for 
"http://mydomain.com/en%20Espa%C3%B1ol.aspx";

Please let me know, I ‘ll also continue digging more.

thanks & regards,
Rajesh Ramana 

On Oct 3, 2011, at 5:52 PM, "Markus Jelsma" <[email protected]> wrote:

> 
>> Looks like you're using protocol-httpclient, try again with the 
>> protocol-http plugin instead. We crawler a large part of wikipedia 
>> for test purposes and all global modern character sets worked just fine.
>> 
>> Can you fetch:
>> http://es.wikipedia.org/wiki/Espa%C3%B1olas
>> 
>> with parse or index checker? It works fine here.
> 
> try bin/nutch org.apache.nutch.parse.ParserChecker <URL> with both 
> protocol-httpclient and protocol-http.
> 
>> 
>>> I am trying to crawl a website which has link(s) with spanish/latin 
>>> characters in the url filename. I can't get Nutch to crawl the 
>>> page(s) with spanish accented chars in URL.
>>> 
>>>  Link: http://mydomain.com/en Español.aspx
>>> 
>>> <http://mydomain.com/en%20Español.aspx>   or
>>> http://mydomain.com/en%20Español.aspx
>>> <http://mydomain.com/en%20Español.aspx>
>>> 
>>> 
>>> 
>>> I tried to substitute the URL encode(%F1) for the special character 
>>> (ñ), (and %20 is for " "), the whole list here 
>>> <http://www.w3schools.com/TAGS/ref_urlencode.asp> .
>>> 
>>>  The link http://mydomain.com/en%20Espa%F1ol.aspx works fine in the
>>> 
>>> browser
>>> 
>>> 
>>> 
>>> I tried to use regex URL normalizer to do the substitution in 
>>> regex-normalize.xml file as below  (%20 is for " ") and (%F1 for the 
>>> special character ñ).
>>> 
>>> <!-- replaces blank space(" ") in URL with escaped "%20"  -->
>>> 
>>> <regex>
>>> 
>>>  <pattern> </pattern>
>>> 
>>>  <substitution>%20</substitution>
>>> 
>>> </regex>
>>> 
>>> 
>>> 
>>> <!-- replaces accented char("ñ") in URL with escaped "%F1"  -->
>>> 
>>> <regex>
>>> 
>>>  <pattern>ñ</pattern>
>>> 
>>>  <substitution>%F1</substitution>
>>> 
>>> </regex>
>>> 
>>> 
>>> 
>>> The former(blank space) substitution works fine, but having trouble 
>>> with the latter (ñ) substitution, I am getting a FATAL 
>>> error(pointing to ñ location in the file) in the command prompt and 
>>> the below error in my hadoop log.
>>> 
>>>     ERROR regex.RegexURLNormalizer - error parsing conf file:
>>> org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid 
>>> byte 2 of 4-byte UTF-8 sequence.
>>> 
>>> 
>>> 
>>> Then I tried changing the character encoding in nutch-site.xml file
>>> 
>>> <property>
>>> 
>>>  <name>parser.character.encoding.default</name>
>>> 
>>>  <value>ISO-8859-1</value>
>>> 
>>>  <description>The character encoding to fall back to when no other
>>> 
>>> information
>>> 
>>>  is available</description>
>>> 
>>> </property>
>>> 
>>>  And in the regex-normalize.xml file as below
>>> 
>>> <regex>
>>> 
>>>  <pattern>U+00F1</pattern>
>>> 
>>>  <substitution>%F1</substitution>
>>> 
>>> </regex>
>>> 
>>> 
>>> 
>>> Now, I don't have any error in the command prompt and but the below 
>>> error in my hadoop log. It looks like the substitution is happening 
>>> but instead of the "%F1" it uses "?".
>>> 
>>> 
>>> 
>>> ERROR httpclient.Http - java.lang.IllegalArgumentException: Invalid 
>>> uri
>>> 'http://mydomain.com/en%20Espa?ol.aspx': escaped absolute path not 
>>> valid
>>> 
>>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
>>> org.apache.commons.httpclient.HttpMethodBase.<init>(HttpMethodBase.j
>>> ava:2
>>> 2 2)
>>> 
>>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
>>> org.apache.commons.httpclient.methods.GetMethod.<init>(GetMethod.jav
>>> a:89)
>>> 
>>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at 
>>> org.apache.nutch.protocol.httpclient.HttpResponse.<init>(HttpRespons
>>> e.jav
>>> a
>>> 
>>> :70)
>>> 
>>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
>>> org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:154)
>>> 
>>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at 
>>> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBa
>>> se.ja
>>> v a:224)
>>> 
>>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:628)
>>> 
>>> 2011-10-03 16:44:02,126 INFO  fetcher.Fetcher - fetch of 
>>> http://mydomain.com/en%20Espa?ol.aspx failed with:
>>> java.lang.IllegalArgumentException: Invalid uri
>>> 'http://mydomain.com/en%20Espa?ol.aspx': escaped absolute path not valid.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Can anyone help me with this issue? Is there any other config 
>>> changes I need to do to get this to work?
>>> 
>>> 
>>> 
>>> Thanks in advance, any help in resolving this issue is much appreciated.
>>> 
>>> 
>>> 
>>> thanks & regards,
>>> Rajesh Ramana

RE: Nutch not crawling URLs with spanish accented characters ( ñ)

Reply via email to