Just checking back, anyone else had success working with global/modern character sets in a NTLM authenticated sites?
Thanks Rajesh Ramana On Oct 4, 2011, at 2:54 PM, "Ramanathapuram, Rajesh" <[email protected]> wrote: > Hi Markus, > > My apologies, I got caught up other projects this morning. > > My fetch for the url http://es.wikipedia.org/wiki/Espa%C3%B1olas with parse > or index checker worked fine. > > FYI, the site I am trying to crawl is a sharepoint site. I guess I need to > use ‘protocol-httpclient’ as it needs NTLM authentication credentials in > httpclient-auth.xml file. > > This is my view of the problem area: > - Crawling the seed page, nutch gets all the links to be crawled. > - The REGEX normalizer transforms the special characters, but fails to > substitute ‘%F1’ or ‘%C3%B1’ for ‘ñ’ > - The fetcher is having trouble interpreting the links with special > character ‘ñ’. > > I tried your suggestion, see details below. > > WITH ‘protocol-http’ > $ bin/nutch org.apache.nutch.parse.ParserChecker > "http://mydomain.com/en Español.aspx" > --------- > Url > --------------- > http://mydomain.com/en Espa±ol.aspx--------- > ParseData > --------- > Version: 5 > Status: success(1,0) > Title: Bad Request > Outlinks: 0 > Content Metadata: Date=Tue, 04 Oct 2011 18:30:45 GMT > Content-Length=311 Connection=close Content-Type=text/html; charset=us-ascii > Server=Mic > rosoft-HTTPAPI/2.0 > Parse Metadata: CharEncodingForConversion=us-ascii > OriginalCharEncoding=us-ascii > > $ bin/nutch org.apache.nutch.parse.ParserChecker > "http://mydomain.com/en%20Espa%F1ol.aspx" > --------- > Url > --------------- > http://mydomain.com/en%20Espa%F1ol.aspx--------- > ParseData > --------- > Version: 5 > Status: success(1,0) > Title: > Outlinks: 0 > Content Metadata: WWW-Authenticate=NTLM > X-SharePointHealthScore=3 Date=Tue, 04 Oct 2011 18:30:56 GMT > MicrosoftSharePointTeamServices=14.0.0. > 4762 Content-Length=16 Connection=close > Content-Type=text/html; charset=utf-8 X-Powered-By=ASP.NET > Server=Microsoft-IIS/7.5 SPRequestGuid=77 > bfafe3-9783-48eb-8d25-d0f30f54c8f9 > Parse Metadata: CharEncodingForConversion=utf-8 > OriginalCharEncoding=utf-8 > > The results are like above for > "http://mydomain.com/en%20Espa%C3%B1ol.aspx" > > > WITH ‘protocol-httpclient’ > > $ bin/nutch org.apache.nutch.parse.ParserChecker > "http://mydomain.com/en Español.aspx" > Exception in thread "main" java.lang.NullPointerException > at > org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:84) > > $ bin/nutch org.apache.nutch.parse.ParserChecker > "http://mydomain.com/en%20Espa%F1ol.aspx" > --------- > Url > --------------- > http:// mydomain.com/en%20Espa%F1ol.aspx--------- > ParseData > --------- > Version: 5 > Status: success(1,0) > Title: en Espa±ol > Outlinks: 182 > .... > outlink: toUrl: http://mydomain.com/Gilo.aspx anchor: Gilo > outlink: toUrl: > http://mydomain.com/Asian,%20Asian%20names.aspx anchor: Asian, Asian names > outlink: toUrl: http://mydomain.com/Albania.aspx anchor: > Albania > outlink: toUrl: http://mydomain.com/Nuclear%20meltdown.aspx > anchor: Nuclear meltdown > outlink: toUrl: http://mydomain.com/Arab%20Spring.aspx > anchor: Arab Spring > .... > Content Metadata: X-SharePointHealthScore=2 > Persistent-Auth=true MicrosoftSharePointTeamServices=14.0.0.4762 > Content-Length=65194 Exp > n, 19 Sep 2011 18:11:35 GMT Last-Modified=Tue, 04 Oct 2011 > 18:11:35 GMT Set-Cookie=WSS_KeepSessionAuthenticated={027d0360-d7a6-4607-8 > be3a76d75}; path=/ Server=Microsoft-IIS/7.5 > X-Powered-By=ASP.NET Cache-Control=private, max-age=0 > SPRequestGuid=48edc96d-8f0a-424d-a6 > b844952c X-AspNet-Version=2.0.50727 Date=Tue, 04 Oct 2011 > 18:11:35 GMT Content-Type=text/html; charset=utf-8 > Parse Metadata: CharEncodingForConversion=utf-8 > OriginalCharEncoding=utf-8 > > The results are like above for > "http://mydomain.com/en%20Espa%C3%B1ol.aspx" > > Please let me know, I ‘ll also continue digging more. > > thanks & regards, > Rajesh Ramana > > On Oct 3, 2011, at 5:52 PM, "Markus Jelsma" <[email protected]> > wrote: > >> >>> Looks like you're using protocol-httpclient, try again with the >>> protocol-http plugin instead. We crawler a large part of wikipedia >>> for test purposes and all global modern character sets worked just fine. >>> >>> Can you fetch: >>> http://es.wikipedia.org/wiki/Espa%C3%B1olas >>> >>> with parse or index checker? It works fine here. >> >> try bin/nutch org.apache.nutch.parse.ParserChecker <URL> with both >> protocol-httpclient and protocol-http. >> >>> >>>> I am trying to crawl a website which has link(s) with spanish/latin >>>> characters in the url filename. I can't get Nutch to crawl the >>>> page(s) with spanish accented chars in URL. >>>> >>>> Link: http://mydomain.com/en Español.aspx >>>> >>>> <http://mydomain.com/en%20Español.aspx> or >>>> http://mydomain.com/en%20Español.aspx >>>> <http://mydomain.com/en%20Español.aspx> >>>> >>>> >>>> >>>> I tried to substitute the URL encode(%F1) for the special character >>>> (ñ), (and %20 is for " "), the whole list here >>>> <http://www.w3schools.com/TAGS/ref_urlencode.asp> . >>>> >>>> The link http://mydomain.com/en%20Espa%F1ol.aspx works fine in the >>>> >>>> browser >>>> >>>> >>>> >>>> I tried to use regex URL normalizer to do the substitution in >>>> regex-normalize.xml file as below (%20 is for " ") and (%F1 for the >>>> special character ñ). >>>> >>>> <!-- replaces blank space(" ") in URL with escaped "%20" --> >>>> >>>> <regex> >>>> >>>> <pattern> </pattern> >>>> >>>> <substitution>%20</substitution> >>>> >>>> </regex> >>>> >>>> >>>> >>>> <!-- replaces accented char("ñ") in URL with escaped "%F1" --> >>>> >>>> <regex> >>>> >>>> <pattern>ñ</pattern> >>>> >>>> <substitution>%F1</substitution> >>>> >>>> </regex> >>>> >>>> >>>> >>>> The former(blank space) substitution works fine, but having trouble >>>> with the latter (ñ) substitution, I am getting a FATAL >>>> error(pointing to ñ location in the file) in the command prompt and >>>> the below error in my hadoop log. >>>> >>>> ERROR regex.RegexURLNormalizer - error parsing conf file: >>>> org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid >>>> byte 2 of 4-byte UTF-8 sequence. >>>> >>>> >>>> >>>> Then I tried changing the character encoding in nutch-site.xml file >>>> >>>> <property> >>>> >>>> <name>parser.character.encoding.default</name> >>>> >>>> <value>ISO-8859-1</value> >>>> >>>> <description>The character encoding to fall back to when no other >>>> >>>> information >>>> >>>> is available</description> >>>> >>>> </property> >>>> >>>> And in the regex-normalize.xml file as below >>>> >>>> <regex> >>>> >>>> <pattern>U+00F1</pattern> >>>> >>>> <substitution>%F1</substitution> >>>> >>>> </regex> >>>> >>>> >>>> >>>> Now, I don't have any error in the command prompt and but the below >>>> error in my hadoop log. It looks like the substitution is happening >>>> but instead of the "%F1" it uses "?". >>>> >>>> >>>> >>>> ERROR httpclient.Http - java.lang.IllegalArgumentException: Invalid >>>> uri >>>> 'http://mydomain.com/en%20Espa?ol.aspx': escaped absolute path not >>>> valid >>>> >>>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at >>>> org.apache.commons.httpclient.HttpMethodBase.<init>(HttpMethodBase.j >>>> ava:2 >>>> 2 2) >>>> >>>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at >>>> org.apache.commons.httpclient.methods.GetMethod.<init>(GetMethod.jav >>>> a:89) >>>> >>>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at >>>> org.apache.nutch.protocol.httpclient.HttpResponse.<init>(HttpRespons >>>> e.jav >>>> a >>>> >>>> :70) >>>> >>>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at >>>> org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:154) >>>> >>>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at >>>> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBa >>>> se.ja >>>> v a:224) >>>> >>>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at >>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:628) >>>> >>>> 2011-10-03 16:44:02,126 INFO fetcher.Fetcher - fetch of >>>> http://mydomain.com/en%20Espa?ol.aspx failed with: >>>> java.lang.IllegalArgumentException: Invalid uri >>>> 'http://mydomain.com/en%20Espa?ol.aspx': escaped absolute path not valid. >>>> >>>> >>>> >>>> >>>> >>>> Can anyone help me with this issue? Is there any other config >>>> changes I need to do to get this to work? >>>> >>>> >>>> >>>> Thanks in advance, any help in resolving this issue is much appreciated. >>>> >>>> >>>> >>>> thanks & regards, >>>> Rajesh Ramana

