You cannot do so by configuration. The site owner doesn't hasn't configured a robots.txt for nothing and politeness is something we must adhere by.
Markus -----Original message----- > From:Anup Kuri, Vincent <vincent_anupk...@intuit.com> > Sent: Tuesday 9th July 2013 12:24 > To: user@nutch.apache.org > Subject: RE: Regarding crawling https links > > How can I make nutch ignore robots.txt file? > > Regards, > Vincent Anup Kuri > > > -----Original Message----- > From: Markus Jelsma [mailto:markus.jel...@openindex.io] > Sent: Tuesday, July 09, 2013 3:46 PM > To: user@nutch.apache.org > Subject: RE: Regarding crawling https links > > That's because the checker tools do not use robots.txt. > > -----Original message----- > > From:Anup Kuri, Vincent <vincent_anupk...@intuit.com> > > Sent: Tuesday 9th July 2013 12:14 > > To: user@nutch.apache.org > > Subject: RE: Regarding crawling https links > > > > That's for the asp file. When I used Parser Checker, it works > > perfectly, > > > > bin/nutch org.apache.nutch.parse.ParserChecker > > "https://intuitmarket.intuit.com/fsg/home.aspx?page_id=152&brand=3" > > > > > > Regards, > > Vincent Anup Kuri > > > > -----Original Message----- > > From: Canan GİRGİN [mailto:canankara...@gmail.com] > > Sent: Tuesday, July 09, 2013 2:19 PM > > To: user@nutch.apache.org > > Subject: Re: Regarding crawling https links > > > > In think problem is about robots.txt: > > robots.txt file[1] for this website denied > > https://intuitmarket.intuit.com/fsg/home.aspx<https://intuitmarket.int > > uit.com/fsg/home.aspx?page_id=152&brand=3> > > > > Disallow: /fsg/Home.asp > > > > > > [1]:https://intuitmarket.intuit.com/robots.txt > > > > > > On Tue, Jul 9, 2013 at 6:50 AM, Anup Kuri, Vincent < > > vincent_anupk...@intuit.com> wrote: > > > > > Hi all, > > > > > > So I have been trying to crawl the following link, " > > > https://intuitmarket.intuit.com/fsg/home.aspx?page_id=152&brand=3", > > > using Nutch 1.7. > > > Somehow got it to work after switching to Unix. It crawl http links > > > perfectly. So, after reading around, I found that, in order to crawl > > > https links, we need to add the following to nutch-site.xml. > > > > > > "<property> > > > <name>plugin.includes</name> > > > > > > <value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value> > > > <description>Regular expression naming plugin directory names to > > > include. Any plugin not matching this expression is excluded. > > > In any case you need at least include the nutch-extensionpoints plugin. > > > By > > > default Nutch includes crawling just HTML and plain text via HTTP, > > > and basic indexing and search plugins. In order to use HTTPS > > > please enable > > > protocol-httpclient, but be aware of possible intermittent > > > problems with the > > > underlying commons-httpclient library. > > > </description> > > > </property>" > > > > > > I also changed the following in the nutch-default.xml, giving some > > > arbitrary value to each property, > > > > > > "<property> > > > <name>http.agent.name</name> > > > <value>Blah</value> > > > <description>HTTP 'User-Agent' request header. MUST NOT be empty - > > > please set this to a single word uniquely related to your organization. > > > > > > NOTE: You should also check other related properties: > > > > > > http.robots.agents > > > http.agent.description > > > http.agent.url > > > http.agent.email > > > http.agent.version > > > > > > and set their values appropriately. > > > > > > </description> > > > </property> > > > > > > <property> > > > <name>http.robots.agents</name> > > > <value>Blah</value> > > > <description>The agent strings we'll look for in robots.txt files, > > > comma-separated, in decreasing order of precedence. You should > > > put the value of http.agent.name as the first agent name, and keep the > > > default * at the end of the list. E.g.: BlurflDev,Blurfl,* > > > </description> > > > </property>" > > > > > > After that I proceeded to crawling with the following command, > > > bin/nutch crawl urls -dir crawl -threads 10 -depth 3 -topN 10 > > > > > > The logs are present at the following link, > > > > > > http://pastebin.com/e7JEcEjV > > > > > > My stats show that only one link was crawled, whose min, max scores > > > are all 1 When I read the segment that was crawled, I got the > > > following, > > > > > > http://pastebin.com/D83D5BeX > > > > > > I have checked the robots.txt file as well of the website. My friend > > > is doing the same thing, but using nutch 1.2 on windows, with the > > > exact same changes as mine and it's working. > > > > > > Hoping a really quick reply as this is urgent. > > > > > > Regards, > > > Vincent Anup Kuri > > > > > > > > >