Hi Biswajit, I don't find a single error caused due to authentication problem in the 'new.txt' file you have attached in some mail before.. Most of them are HTTP 404 or HTTP 302 errors, which means either the page is not available or the page has been moved to another location, which the crawler would try to fetch. There's nothing I can do to help you in this matter. You have access to the network and you can analyze better why this is happening. Please do not send the same mail multiple time. As, I have told you before, it takes time for members to respond as they do so only in their free time.
Regards, Susam Pal On Fri, Sep 19, 2008 at 5:38 AM, biswajit_rout <[EMAIL PROTECTED]> wrote: > > Hi Susam, > > Please give a look into the attached file (new.txt) and suggest a solution > for this. This time i have crawled another site. I am able to crawl all the > public pages but password protected pages crawling is not happening... > > Best regards, > Biswajit. > > > biswajit_rout wrote: >> >> Hi, >> >> There is nothing to crawl in the home page of >> http://10.222.18.113:8080/dao/. >> >> So this time i have crawled another site. I have successfully crawled all >> the public pages but not able to crawl private pages. >> I have attached a log file(new.log). Can you please check and let me know >> what needs to be done from my end??? >> >> Best regards, >> Biswajit. >> >> >> Susam Pal wrote: >>> >>> The log file shows only one fetching line: >>> >>> 2008-09-16 20:46:15,321 INFO fetcher.Fetcher - fetching >>> http://10.222.18.113:8080/dao/ >>> >>> This has been fetched successfully. There is no other page being >>> fetched. Have you set up Nutch properly so that it can fetch all the >>> pages you need? If it tries to fetch a page but fails due to >>> authentication, then it is a problem with authentication. >>> >>> In this case, it is not even attempting to fetch those pages. So, the >>> problem lies elsewhere. You need to first find out why it is fetching >>> only one page and not others. >>> >>> Regards, >>> Susam Pal >>> >>> On Tue, Sep 16, 2008 at 5:24 PM, biswajit_rout >>> <[EMAIL PROTECTED]> wrote: >>>> >>>> But still it is not crawling the password protected pages... >>>> >>>> Regards, >>>> Biswajit. >>>> >>>> >>>> Susam Pal wrote: >>>>> >>>>> The latest log shows that the page from the URL: >>>>> http://10.222.18.113:8080/dao/ has been fetched successfully. >>>>> >>>>> Regards, >>>>> Susam Pal >>>>> >>>>> On Tue, Sep 16, 2008 at 3:33 PM, biswajit_rout >>>>> <[EMAIL PROTECTED]> wrote: >>>>>> >>>>>> Hi Susam, >>>>>> >>>>>> Please find the latest log file(latest.log), which shows different >>>>>> error. >>>>>> >>>>>> 2008-09-16 20:46:16,102 DEBUG httpclient.Http - url: >>>>>> http://10.222.18.113:8080/robots.txt; status code: 404; bytes >>>>>> received: >>>>>> 985; >>>>>> Content-Length: 985 >>>>>> 2008-09-16 20:46:16,384 DEBUG httpclient.Http - url: >>>>>> http://10.222.18.113:8080/dao/; status code: 200; bytes received: >>>>>> 1941; >>>>>> Content-Length: 1941 >>>>>> >>>>>> Thanks in advance... >>>>>> >>>>>> Best regards, >>>>>> Biswajit. >>>>>> >>>>>> >>>>>> biswajit_rout wrote: >>>>>>> >>>>>>> Hi Susam, >>>>>>> >>>>>>> Thanks for your immediate response... >>>>>>> Herewith i am attaching the debug enabled log >>>>>>> file(debugenabled_hadoop.log). Kindly go through the file and let me >>>>>>> know >>>>>>> what is missing from my end... >>>>>>> >>>>>>> Best regards, >>>>>>> Biswajit. >>>>>>> >>>>>>> >>>>>>> Susam Pal wrote: >>>>>>>> >>>>>>>> Hi Biswajit, >>>>>>>> >>>>>>>> The authscope specifies which IP address or domain-name would the >>>>>>>> credentials be used for. If you provide 10.222.18.113 in the >>>>>>>> authscope, the credentials would not be used for localhost even >>>>>>>> though >>>>>>>> both represent the same machine. >>>>>>>> >>>>>>>> Please provide logs with DEBUG enabled. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Susam Pal >>>>>>>> >>>>>>>> On Tue, Sep 16, 2008 at 1:33 PM, biswajit_rout >>>>>>>> <[EMAIL PROTECTED]> wrote: >>>>>>>>> >>>>>>>>> Hi Susam, >>>>>>>>> >>>>>>>>> The ip 10.222.18.113 is nothing but the ip address of my >>>>>>>>> machine(localhost). >>>>>>>>> Now also i changed http://localhost:8080/ to >>>>>>>>> http://10.222.18.113:8080. >>>>>>>>> However no result, i mean to say still not able to crawl password >>>>>>>>> protected >>>>>>>>> pages. >>>>>>>>> >>>>>>>>> Kindly assist me to resolve this issue. >>>>>>>>> >>>>>>>>> Thanks in advance... >>>>>>>>> >>>>>>>>> Best regards, >>>>>>>>> Biswajit. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Susam Pal wrote: >>>>>>>>>> >>>>>>>>>> The logs show that it is fetching http://localhost:8080/ but you >>>>>>>>>> have >>>>>>>>>> set credentials for 10.222.18.113:8080 which is never being >>>>>>>>>> fetched. >>>>>>>>>> So, no authentication takes place. >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> Susam Pal >>>>>>>>>> >>>>>>>>>> On Mon, Sep 15, 2008 at 1:20 PM, biswajit_rout >>>>>>>>>> <[EMAIL PROTECTED]> wrote: >>>>>>>>>>> >>>>>>>>>>> Hi Susam, >>>>>>>>>>> >>>>>>>>>>> In order to crawl password protected pages, I am using >>>>>>>>>>> nutch-2008-07-10_04-01-48.tar (I have download from >>>>>>>>>>> http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which >>>>>>>>>>> contains >>>>>>>>>>> your >>>>>>>>>>> patch for HttpAuthentication) >>>>>>>>>>> >>>>>>>>>>> I have modified nutch-site.xml, httpclient-auth.xml. >>>>>>>>>>> >>>>>>>>>>> Please find the attached zip file which contains >>>>>>>>>>> nutch-site.xml,httpclient-auth.xml. >>>>>>>>>>> >>>>>>>>>>> Kindly provide me a solution for this. >>>>>>>>>>> >>>>>>>>>>> Best regards, >>>>>>>>>>> Biswajit >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Susam Pal wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi Biswajit, >>>>>>>>>>>> >>>>>>>>>>>> Could you please tell us how you have added the support for >>>>>>>>>>>> authentication in Nutch 0.9? Nutch 0.9 can not do authentication >>>>>>>>>>>> properly by default. The authentication feature is buggy in >>>>>>>>>>>> Nutch >>>>>>>>>>>> 0.9 >>>>>>>>>>>> which was fixed with this ticket: >>>>>>>>>>>> https://issues.apache.org/jira/browse/NUTCH-559 >>>>>>>>>>>> >>>>>>>>>>>> The feature is documented here: >>>>>>>>>>>> http://wiki.apache.org/nutch/HttpAuthenticationSchemes >>>>>>>>>>>> >>>>>>>>>>>> The easiest way to use it is to check out the latest version of >>>>>>>>>>>> Nutch >>>>>>>>>>>> and build it as it contains the authentication feature. If you >>>>>>>>>>>> want >>>>>>>>>>>> to >>>>>>>>>>>> use it with Nutch 0.9, you have to download the latest patch >>>>>>>>>>>> present >>>>>>>>>>>> in the ticket page and apply it to the source code and build it. >>>>>>>>>>>> You >>>>>>>>>>>> might have to resolve some conflicts manually. >>>>>>>>>>>> >>>>>>>>>>>> I would suggest that you do not send the mail same mail multiple >>>>>>>>>>>> times. We have received the same mail from you 4 times. It takes >>>>>>>>>>>> sometime for members to reply to a mail. :-) >>>>>>>>>>>> >>>>>>>>>>>> Regards, >>>>>>>>>>>> Susam Pal >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Sep 15, 2008 at 6:07 PM, Rout Biswajit-B16078 >>>>>>>>>>>> <[EMAIL PROTECTED]> wrote: >>>>>>>>>>>>> Hi, >>>>>>>>>>>>> >>>>>>>>>>>>> I have successfully configured NUTCH 0.9, which is crawling >>>>>>>>>>>>> number >>>>>>>>>>>>> of >>>>>>>>>>>>> sites >>>>>>>>>>>>> and after that searching is also happening properly. >>>>>>>>>>>>> >>>>>>>>>>>>> However, now I want to crawl password protected pages using >>>>>>>>>>>>> NUTCH. >>>>>>>>>>>>> In >>>>>>>>>>>>> order >>>>>>>>>>>>> to access those pages I should have a valid user name and >>>>>>>>>>>>> password. >>>>>>>>>>>>> I >>>>>>>>>>>>> have >>>>>>>>>>>>> configured the user name and password in my nutch-site.xml and >>>>>>>>>>>>> httpclient-auth.xml >>>>>>>>>>>>> >>>>>>>>>>>>> However it is not crawling. >>>>>>>>>>>>> >>>>>>>>>>>>> I have attached nutch-site.xml, httpclient-auth.xml and >>>>>>>>>>>>> hadoop.log >>>>>>>>>>>>> in >>>>>>>>>>>>> the >>>>>>>>>>>>> Zip file for your reference. Kindly check and let me know what >>>>>>>>>>>>> is >>>>>>>>>>>>> missing >>>>>>>>>>>>> from my end. >>>>>>>>>>>>> >>>>>>>>>>>>> CONFIGURATION: >>>>>>>>>>>>> >>>>>>>>>>>>> nutch-2008-07-10_04-01-48.tar (I have download from >>>>>>>>>>>>> http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which >>>>>>>>>>>>> contains >>>>>>>>>>>>> your >>>>>>>>>>>>> patch for HttpAuthentication) >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Windows XP >>>>>>>>>>>>> >>>>>>>>>>>>> Cygwin >>>>>>>>>>>>> >>>>>>>>>>>>> jdk1.6.0 >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks in advance⦠>>>>>>>>>>>>> >>>>>>>>>>>>> Please help.... >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Best regards, >>>>>>>>>>>>> >>>>>>>>>>>>> Biswajit >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> http://www.nabble.com/file/p19492846/Nutch.zip Nutch.zip >>>>>>>>>>> -- >>>>>>>>>>> View this message in context: >>>>>>>>>>> http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19492846.html >>>>>>>>>>> Sent from the Nutch - User mailing list archive at Nabble.com. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> http://www.nabble.com/file/p19507146/hadoop.log hadoop.log >>>>>>>>> -- >>>>>>>>> View this message in context: >>>>>>>>> http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19507146.html >>>>>>>>> Sent from the Nutch - User mailing list archive at Nabble.com. >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> http://www.nabble.com/file/p19510820/debugenabled_hadoop.log >>>>>>> debugenabled_hadoop.log >>>>>>> >>>>>> http://www.nabble.com/file/p19514374/latest.log latest.log >>>>>> -- >>>>>> View this message in context: >>>>>> http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19514374.html >>>>>> Sent from the Nutch - User mailing list archive at Nabble.com. >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> -- >>>> View this message in context: >>>> http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19516409.html >>>> Sent from the Nutch - User mailing list archive at Nabble.com. >>>> >>>> >>> >>> >> http://www.nabble.com/file/p19552519/new.txt new.txt >> > > -- > View this message in context: > http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19566502.html > Sent from the Nutch - User mailing list archive at Nabble.com. > >
