hi, I had tried removing the authscope tag as well. The problem was not with that in fact. It was because I was trying to crawl pages that were using POST BASED AUTHENTICATION. Any suggestions as to how we can crawl pages that use POST BASED AUTHENTICATION ?
bye, kranthi On Wed, Sep 9, 2009 at 8:55 PM, David M. Cole <[email protected]> wrote: > kranthi: > > i would try removing the authscope tag from the httpclient-auth.xml. though > in my case i'm not going to an alternate port and you are, my working file > does not have an authscope tag. > > if that doesn't help, since you are crawling an intranet, do you have > access to the http server's log? seeing that might help. > > \dmc > > > > At 4:04 PM +0530 9/9/09, kranthi reddy wrote: > >> Hi all, >> >> I am trying to crawl password protected web pages present in our intranet >> . >> I don't know the reason why "*401 Authentication Required*" error creeps >> up. >> I have gone through the previous mails sent by others, but it is not >> getting >> resolved. >> >> Below are the configuration files i have modified as told in " >> http://wiki.apache.org/nutch/HttpAuthenticationSchemes" >> >> My Url file contains single url *"http://10.2.44.34:8088/xwiki/" *(This >> url is actually being redirect to "* >> http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=CDsTIqqN*") >> >> *"httpclient-auth.xml* " >> >> <credentials username="xyz" password="xyz"> >> <default/> >> <authscope host="10.2.44.34" port="8088"/> >> </credentials> >> >> *"nutch-default.xml"* >> >> <property> >> <name>plugin.includes</name> >> <value>*protocol-httpclient|* >> >> urlfilter-regex|parse-(text|html|js|zip)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)| >> >> summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> >> </property> >> >> *OutPut Printed to Terminal* >> >> Fetcher: Your 'http.agent.name' value should be listed first in >> 'http.robots.agents' property. >> Fetcher: starting >> Fetcher: segment: crawl/segments/20090909151219 >> Fetcher: threads: 10 >> QueueFeeder finished: total 1 records. >> fetching http://10.2.44.34:8088/xwiki/ >> http.proxy.host = null >> http.proxy.port = 8080 >> http.timeout = 10000 >> http.content.limit = -1 >> http.agent = iiith/Nutch-1.0 ([email protected]) >> protocol.plugin.check.blocking = false >> protocol.plugin.check.robots = false >> *Credentials - username: superadmin; set as default for realm: ; scheme:* >> -finishing thread FetcherThread, activeThreads=1 >> -finishing thread FetcherThread, activeThreads=1 >> *Credentials - username: superadmin; set for AuthScope - host: 10.2.44.34; >> port: 8088; realm: ; scheme: >> Pre-configured credentials with scope - host: 10.2.44.34; port: 8088; >> found >> for url: http://10.2.44.34:8088/robots.txt >> url: http://10.2.44.34:8088/robots.txt; status code: 401; bytes received: >> 6739; Content-Length: 6739 >> Pre-configured credentials with scope - host: 10.2.44.34; port: 8088; >> found >> for url: http://10.2.44.34:8088/xwiki/ >> url: http://10.2.44.34:8088/xwiki/; status code: 302; bytes received: 0; >> Content-Length: 0; Location: http://10.2.44.34:8088/xwiki/bin/view/Main/* >> -activeThreads=1, spinWaiting=1, fetchQueues.totalSize=1 >> * queue: http://10.2.44.34 >> maxThreads = 1 >> inProgress = 0 >> crawlDelay = 1000 >> minCrawlDelay = 0 >> nextFetchTime = 1252489344874 >> now = 1252489344577 >> 0. http://10.2.44.34:8088/xwiki/bin/view/Main/ >> *fetching http://10.2.44.34:8088/xwiki/bin/view/Main/ >> Pre-configured credentials with scope - host: 10.2.44.34; port: 8088; >> found >> for url: http://10.2.44.34:8088/xwiki/bin/view/Main/ >> url: http://10.2.44.34:8088/xwiki/bin/view/Main/; status code: 302; bytes >> received: 0; Content-Length: 0; Location: >> http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=yjACAWWX* >> -activeThreads=1, spinWaiting=1, fetchQueues.totalSize=1 >> * queue: http://10.2.44.34 >> maxThreads = 1 >> inProgress = 0 >> crawlDelay = 1000 >> minCrawlDelay = 0 >> nextFetchTime = 1252489345884 >> now = 1252489345578 >> 0. http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=yjACAWWX >> *fetching >> http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=yjACAWWX >> Pre-configured credentials with scope - host: 10.2.44.34; port: 8088; >> found >> for url: >> http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=yjACAWWX >> url: >> http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=yjACAWWX; >> status code: 401; bytes received: 6739; Content-Length: 6739 >> 401 Authentication Required* >> -finishing thread FetcherThread, activeThreads=0 >> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 >> -activeThreads=0 >> Fetcher: done >> >> >> >> *LOG FILE IS* >> >> >> 2009-09-09 15:46:55,602 INFO fetcher.Fetcher - fetching >> http://10.2.44.34:8088/xwiki/ >> 2009-09-09 15:46:55,657 INFO fetcher.Fetcher - -finishing thread >> FetcherThread, activeThreads=1 >> 2009-09-09 15:46:55,657 INFO fetcher.Fetcher - -finishing thread >> FetcherThread, activeThreads=1 >> 2009-09-09 15:46:55,691 INFO httpclient.Http - http.proxy.host = null >> 2009-09-09 15:46:55,691 INFO httpclient.Http - http.proxy.port = 8080 >> 2009-09-09 15:46:55,691 INFO httpclient.Http - http.timeout = 10000 >> 2009-09-09 15:46:55,691 INFO httpclient.Http - http.content.limit = -1 >> 2009-09-09 15:46:55,691 INFO httpclient.Http - http.agent = >> iiith/Nutch-1.0 >> ([email protected]) >> 2009-09-09 15:46:55,691 INFO httpclient.Http - >> protocol.plugin.check.blocking = false >> 2009-09-09 15:46:55,691 INFO httpclient.Http - >> protocol.plugin.check.robots >> = false >> 2009-09-09 15:46:55,695 DEBUG httpclient.Http - Credentials - username: >> superadmin; set as default for realm: ; scheme: >> 2009-09-09 15:46:55,697 DEBUG httpclient.Http - Credentials - username: >> superadmin; set for AuthScope - host: 10.2.44.34; port: 8088; realm: ; >> scheme: >> *2009-09-09 15:46:55,697 DEBUG httpclient.Http - Pre-configured >> credentials >> with scope - host: 10.2.44.34; port: 8088; found for url: >> http://10.2.44.34:8088/robots.txt >> 2009-09-09 15:46:55,942 DEBUG httpclient.Http - url: >> http://10.2.44.34:8088/robots.txt; status code: 401; bytes received: >> 6739; >> Content-Length: 6739 >> 2009-09-09 15:46:55,943 DEBUG httpclient.Http - Pre-configured credentials >> with scope - host: 10.2.44.34; port: 8088; found for url: >> http://10.2.44.34:8088/xwiki/ >> 2009-09-09 15:46:55,946 INFO httpclient.HttpMethodDirector - Redirect >> requested but followRedirects is disabled >> 2009-09-09 15:46:55,946 DEBUG httpclient.Http - url: >> http://10.2.44.34:8088/xwiki/; status code: 302; bytes received: 0; >> Content-Length: 0; Location: http://10.2.44.34:8088/xwiki/bin/view/Main/* >> 2009-09-09 15:46:56,657 INFO fetcher.Fetcher - -activeThreads=1, >> spinWaiting=1, fetchQueues.totalSize=1 >> 2009-09-09 15:46:56,658 INFO fetcher.Fetcher - * queue: >> http://10.2.44.34 >> 2009-09-09 15:46:56,658 INFO fetcher.Fetcher - maxThreads = 1 >> 2009-09-09 15:46:56,658 INFO fetcher.Fetcher - inProgress = 0 >> 2009-09-09 15:46:56,658 INFO fetcher.Fetcher - crawlDelay = 1000 >> 2009-09-09 15:46:56,658 INFO fetcher.Fetcher - minCrawlDelay = 0 >> 2009-09-09 15:46:56,658 INFO fetcher.Fetcher - nextFetchTime = >> 1252491417050 >> 2009-09-09 15:46:56,658 INFO fetcher.Fetcher - now = >> 1252491416658 >> 2009-09-09 15:46:56,658 INFO fetcher.Fetcher - 0. >> http://10.2.44.34:8088/xwiki/bin/view/Main/ >> 2009-09-09 15:46:57,051 INFO fetcher.Fetcher - fetching >> http://10.2.44.34:8088/xwiki/bin/view/Main/ >> 2*009-09-09 15:46:57,051 DEBUG httpclient.Http - Pre-configured >> credentials >> >> with scope - host: 10.2.44.34; port: 8088; found for url: >> http://10.2.44.34:8088/xwiki/bin/view/Main/ >> 2009-09-09 15:46:57,056 INFO httpclient.HttpMethodDirector - Redirect >> requested but followRedirects is disabled >> 2009-09-09 15:46:57,057 DEBUG httpclient.Http - url: >> http://10.2.44.34:8088/xwiki/bin/view/Main/; status code: 302; bytes >> received: 0; Content-Length: 0; Location: >> http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=2h453tM1* >> 2009-09-09 15:46:57,658 INFO fetcher.Fetcher - -activeThreads=1, >> spinWaiting=1, fetchQueues.totalSize=1 >> 2009-09-09 15:46:57,659 INFO fetcher.Fetcher - * queue: >> http://10.2.44.34 >> 2009-09-09 15:46:57,659 INFO fetcher.Fetcher - maxThreads = 1 >> 2009-09-09 15:46:57,659 INFO fetcher.Fetcher - inProgress = 0 >> 2009-09-09 15:46:57,659 INFO fetcher.Fetcher - crawlDelay = 1000 >> 2009-09-09 15:46:57,659 INFO fetcher.Fetcher - minCrawlDelay = 0 >> 2009-09-09 15:46:57,659 INFO fetcher.Fetcher - nextFetchTime = >> 1252491418057 >> 2009-09-09 15:46:57,659 INFO fetcher.Fetcher - now = >> 1252491417659 >> *2009-09-09 15:46:57,659 INFO fetcher.Fetcher - 0. >> http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=2h453tM1 >> 2009-09-09 15:46:58,058 INFO fetcher.Fetcher - fetching >> http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=2h453tM1 >> 2009-09-09 15:46:58,058 DEBUG httpclient.Http - Pre-configured credentials >> with scope - host: 10.2.44.34; port: 8088; found for url: >> http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=2h453tM1 >> 2009-09-09 15:46:58,170 DEBUG httpclient.Http - url: >> http://10.2.44.34:8088/xwiki/bin/login/XWiki/XWikiLogin?srid=2h453tM1; >> status code: 401; bytes received: 6739; Content-Length: 6739 >> 2009-09-09 15:46:58,180 DEBUG httpclient.Http - 401 Authentication >> Required* >> 2009-09-09 15:46:58,180 INFO fetcher.Fetcher - -finishing thread >> FetcherThread, activeThreads=0 >> 2009-09-09 15:46:58,659 INFO fetcher.Fetcher - -activeThreads=0, >> spinWaiting=0, fetchQueues.totalSize=0 >> 2009-09-09 15:46:58,659 INFO fetcher.Fetcher - -activeThreads=0 >> >> >> Thank you in advance, >> >> bye, >> Kranthi Reddy. B >> > > > -- > > *+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+ > David M. Cole > [email protected] > Editor & Publisher, NewsInc. <http://newsinc.net> V: (650) > 557-2993 > Consultant: The Cole Group <http://colegroup.com/> F: (650) > 475-8479 > > *+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+ >
