Re: Nutch not crawling URLs with spanish accented characters ( ñ)

Ramanathapuram, Rajesh Thu, 06 Oct 2011 03:28:50 -0700

Just checking back, anyone else had success working with global/modern 
character sets in a NTLM authenticated sites?



Thanks
Rajesh Ramana




On Oct 4, 2011, at 2:54 PM, "Ramanathapuram, Rajesh" 
<[email protected]> wrote:

> Hi Markus, 
> 
> My apologies, I got caught up other projects this morning. 
> 
> My fetch for the url http://es.wikipedia.org/wiki/Espa%C3%B1olas with parse 
> or index checker worked fine.
> 
> FYI, the site I am trying to crawl is a sharepoint site. I guess I need to 
> use ‘protocol-httpclient’ as it needs NTLM authentication credentials in 
> httpclient-auth.xml file. 
> 
> This is my view of the problem area:
> -    Crawling the seed page, nutch gets all the links to be crawled. 
> -    The REGEX normalizer transforms the special characters, but fails to 
> substitute ‘%F1’ or ‘%C3%B1’ for ‘ñ’ 
> -    The fetcher is having trouble interpreting the links with special 
> character ‘ñ’. 
> 
> I tried your suggestion, see details below. 
> 
> WITH ‘protocol-http’ 
>        $ bin/nutch org.apache.nutch.parse.ParserChecker 
> "http://mydomain.com/en Español.aspx"
>                ---------
>                Url
>                ---------------
>                http://mydomain.com/en Espa±ol.aspx---------
>                ParseData
>                ---------
>                Version: 5
>                Status: success(1,0)
>                Title: Bad Request
>                Outlinks: 0
>                Content Metadata: Date=Tue, 04 Oct 2011 18:30:45 GMT 
> Content-Length=311 Connection=close Content-Type=text/html; charset=us-ascii 
> Server=Mic
>                rosoft-HTTPAPI/2.0
>                Parse Metadata: CharEncodingForConversion=us-ascii 
> OriginalCharEncoding=us-ascii
>        
>        $ bin/nutch org.apache.nutch.parse.ParserChecker 
> "http://mydomain.com/en%20Espa%F1ol.aspx";
>                ---------
>                Url
>                ---------------
>                http://mydomain.com/en%20Espa%F1ol.aspx---------
>                ParseData
>                ---------
>                Version: 5
>                Status: success(1,0)
>                Title:
>                Outlinks: 0
>                Content Metadata: WWW-Authenticate=NTLM 
> X-SharePointHealthScore=3 Date=Tue, 04 Oct 2011 18:30:56 GMT 
> MicrosoftSharePointTeamServices=14.0.0.
>                4762 Content-Length=16 Connection=close 
> Content-Type=text/html; charset=utf-8 X-Powered-By=ASP.NET 
> Server=Microsoft-IIS/7.5 SPRequestGuid=77
>                bfafe3-9783-48eb-8d25-d0f30f54c8f9
>                Parse Metadata: CharEncodingForConversion=utf-8 
> OriginalCharEncoding=utf-8
> 
>        The results are like above for 
> "http://mydomain.com/en%20Espa%C3%B1ol.aspx";
> 
> 
> WITH ‘protocol-httpclient’ 
> 
>        $ bin/nutch org.apache.nutch.parse.ParserChecker 
> "http://mydomain.com/en Español.aspx"
>                Exception in thread "main" java.lang.NullPointerException
>                        at 
> org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:84)
>        
>        $ bin/nutch org.apache.nutch.parse.ParserChecker 
> "http://mydomain.com/en%20Espa%F1ol.aspx";
>                ---------
>                Url
>                ---------------
>                http:// mydomain.com/en%20Espa%F1ol.aspx---------
>                ParseData
>                ---------
>                Version: 5
>                Status: success(1,0)
>                Title: en Espa±ol
>                Outlinks: 182
>                ....
>                 outlink: toUrl: http://mydomain.com/Gilo.aspx anchor: Gilo
>                 outlink: toUrl: 
> http://mydomain.com/Asian,%20Asian%20names.aspx anchor: Asian, Asian names
>                 outlink: toUrl: http://mydomain.com/Albania.aspx anchor: 
> Albania
>                 outlink: toUrl: http://mydomain.com/Nuclear%20meltdown.aspx 
> anchor: Nuclear meltdown
>                 outlink: toUrl: http://mydomain.com/Arab%20Spring.aspx 
> anchor: Arab Spring
>                ....
>                Content Metadata: X-SharePointHealthScore=2 
> Persistent-Auth=true MicrosoftSharePointTeamServices=14.0.0.4762 
> Content-Length=65194 Exp
>                n, 19 Sep 2011 18:11:35 GMT Last-Modified=Tue, 04 Oct 2011 
> 18:11:35 GMT Set-Cookie=WSS_KeepSessionAuthenticated={027d0360-d7a6-4607-8
>                be3a76d75}; path=/ Server=Microsoft-IIS/7.5 
> X-Powered-By=ASP.NET Cache-Control=private, max-age=0 
> SPRequestGuid=48edc96d-8f0a-424d-a6
>                b844952c X-AspNet-Version=2.0.50727 Date=Tue, 04 Oct 2011 
> 18:11:35 GMT Content-Type=text/html; charset=utf-8
>                Parse Metadata: CharEncodingForConversion=utf-8 
> OriginalCharEncoding=utf-8
> 
>        The results are like above for 
> "http://mydomain.com/en%20Espa%C3%B1ol.aspx";
> 
> Please let me know, I ‘ll also continue digging more.
> 
> thanks & regards,
> Rajesh Ramana 
> 
> On Oct 3, 2011, at 5:52 PM, "Markus Jelsma" <[email protected]> 
> wrote:
> 
>> 
>>> Looks like you're using protocol-httpclient, try again with the 
>>> protocol-http plugin instead. We crawler a large part of wikipedia 
>>> for test purposes and all global modern character sets worked just fine.
>>> 
>>> Can you fetch:
>>> http://es.wikipedia.org/wiki/Espa%C3%B1olas
>>> 
>>> with parse or index checker? It works fine here.
>> 
>> try bin/nutch org.apache.nutch.parse.ParserChecker <URL> with both 
>> protocol-httpclient and protocol-http.
>> 
>>> 
>>>> I am trying to crawl a website which has link(s) with spanish/latin 
>>>> characters in the url filename. I can't get Nutch to crawl the 
>>>> page(s) with spanish accented chars in URL.
>>>> 
>>>> Link: http://mydomain.com/en Español.aspx
>>>> 
>>>> <http://mydomain.com/en%20Español.aspx>   or
>>>> http://mydomain.com/en%20Español.aspx
>>>> <http://mydomain.com/en%20Español.aspx>
>>>> 
>>>> 
>>>> 
>>>> I tried to substitute the URL encode(%F1) for the special character 
>>>> (ñ), (and %20 is for " "), the whole list here 
>>>> <http://www.w3schools.com/TAGS/ref_urlencode.asp> .
>>>> 
>>>> The link http://mydomain.com/en%20Espa%F1ol.aspx works fine in the
>>>> 
>>>> browser
>>>> 
>>>> 
>>>> 
>>>> I tried to use regex URL normalizer to do the substitution in 
>>>> regex-normalize.xml file as below  (%20 is for " ") and (%F1 for the 
>>>> special character ñ).
>>>> 
>>>> <!-- replaces blank space(" ") in URL with escaped "%20"  -->
>>>> 
>>>> <regex>
>>>> 
>>>> <pattern> </pattern>
>>>> 
>>>> <substitution>%20</substitution>
>>>> 
>>>> </regex>
>>>> 
>>>> 
>>>> 
>>>> <!-- replaces accented char("ñ") in URL with escaped "%F1"  -->
>>>> 
>>>> <regex>
>>>> 
>>>> <pattern>ñ</pattern>
>>>> 
>>>> <substitution>%F1</substitution>
>>>> 
>>>> </regex>
>>>> 
>>>> 
>>>> 
>>>> The former(blank space) substitution works fine, but having trouble 
>>>> with the latter (ñ) substitution, I am getting a FATAL 
>>>> error(pointing to ñ location in the file) in the command prompt and 
>>>> the below error in my hadoop log.
>>>> 
>>>>    ERROR regex.RegexURLNormalizer - error parsing conf file:
>>>> org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid 
>>>> byte 2 of 4-byte UTF-8 sequence.
>>>> 
>>>> 
>>>> 
>>>> Then I tried changing the character encoding in nutch-site.xml file
>>>> 
>>>> <property>
>>>> 
>>>> <name>parser.character.encoding.default</name>
>>>> 
>>>> <value>ISO-8859-1</value>
>>>> 
>>>> <description>The character encoding to fall back to when no other
>>>> 
>>>> information
>>>> 
>>>> is available</description>
>>>> 
>>>> </property>
>>>> 
>>>> And in the regex-normalize.xml file as below
>>>> 
>>>> <regex>
>>>> 
>>>> <pattern>U+00F1</pattern>
>>>> 
>>>> <substitution>%F1</substitution>
>>>> 
>>>> </regex>
>>>> 
>>>> 
>>>> 
>>>> Now, I don't have any error in the command prompt and but the below 
>>>> error in my hadoop log. It looks like the substitution is happening 
>>>> but instead of the "%F1" it uses "?".
>>>> 
>>>> 
>>>> 
>>>> ERROR httpclient.Http - java.lang.IllegalArgumentException: Invalid 
>>>> uri
>>>> 'http://mydomain.com/en%20Espa?ol.aspx': escaped absolute path not 
>>>> valid
>>>> 
>>>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
>>>> org.apache.commons.httpclient.HttpMethodBase.<init>(HttpMethodBase.j
>>>> ava:2
>>>> 2 2)
>>>> 
>>>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
>>>> org.apache.commons.httpclient.methods.GetMethod.<init>(GetMethod.jav
>>>> a:89)
>>>> 
>>>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at 
>>>> org.apache.nutch.protocol.httpclient.HttpResponse.<init>(HttpRespons
>>>> e.jav
>>>> a
>>>> 
>>>> :70)
>>>> 
>>>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
>>>> org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:154)
>>>> 
>>>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at 
>>>> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBa
>>>> se.ja
>>>> v a:224)
>>>> 
>>>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:628)
>>>> 
>>>> 2011-10-03 16:44:02,126 INFO  fetcher.Fetcher - fetch of 
>>>> http://mydomain.com/en%20Espa?ol.aspx failed with:
>>>> java.lang.IllegalArgumentException: Invalid uri
>>>> 'http://mydomain.com/en%20Espa?ol.aspx': escaped absolute path not valid.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Can anyone help me with this issue? Is there any other config 
>>>> changes I need to do to get this to work?
>>>> 
>>>> 
>>>> 
>>>> Thanks in advance, any help in resolving this issue is much appreciated.
>>>> 
>>>> 
>>>> 
>>>> thanks & regards,
>>>> Rajesh Ramana

Re: Nutch not crawling URLs with spanish accented characters ( ñ)

Reply via email to