RE: Regarding crawling https links

Markus Jelsma Tue, 09 Jul 2013 03:29:30 -0700

You cannot do so by configuration. The site owner doesn't hasn't configured a 
robots.txt for nothing and politeness is something we must adhere by.


Markus
 
-----Original message-----
> From:Anup Kuri, Vincent <vincent_anupk...@intuit.com>
> Sent: Tuesday 9th July 2013 12:24
> To: user@nutch.apache.org
> Subject: RE: Regarding crawling https links
> 
> How can I make nutch ignore robots.txt file?
> 
> Regards,
> Vincent Anup Kuri
> 
> 
> -----Original Message-----
> From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
> Sent: Tuesday, July 09, 2013 3:46 PM
> To: user@nutch.apache.org
> Subject: RE: Regarding crawling https links
> 
> That's because the checker tools do not use robots.txt.
>  
> -----Original message-----
> > From:Anup Kuri, Vincent <vincent_anupk...@intuit.com>
> > Sent: Tuesday 9th July 2013 12:14
> > To: user@nutch.apache.org
> > Subject: RE: Regarding crawling https links
> > 
> > That's for the asp file. When I used Parser Checker, it works 
> > perfectly,
> > 
> > bin/nutch org.apache.nutch.parse.ParserChecker 
> > "https://intuitmarket.intuit.com/fsg/home.aspx?page_id=152&brand=3";
> > 
> > 
> > Regards,
> > Vincent Anup Kuri
> > 
> > -----Original Message-----
> > From: Canan GİRGİN [mailto:canankara...@gmail.com]
> > Sent: Tuesday, July 09, 2013 2:19 PM
> > To: user@nutch.apache.org
> > Subject: Re: Regarding crawling https links
> > 
> > In think problem is about robots.txt:
> > robots.txt file[1] for this website  denied 
> > https://intuitmarket.intuit.com/fsg/home.aspx<https://intuitmarket.int
> > uit.com/fsg/home.aspx?page_id=152&brand=3>
> > 
> > Disallow: /fsg/Home.asp
> > 
> > 
> > [1]:https://intuitmarket.intuit.com/robots.txt
> > 
> > 
> > On Tue, Jul 9, 2013 at 6:50 AM, Anup Kuri, Vincent < 
> > vincent_anupk...@intuit.com> wrote:
> > 
> > > Hi all,
> > >
> > > So I have been trying to crawl the following link, "
> > > https://intuitmarket.intuit.com/fsg/home.aspx?page_id=152&brand=3";,
> > > using Nutch 1.7.
> > > Somehow got it to work after switching to Unix. It crawl http links 
> > > perfectly. So, after reading around, I found that, in order to crawl 
> > > https links, we need to add the following to nutch-site.xml.
> > >
> > > "<property>
> > >   <name>plugin.includes</name>
> > >
> > > <value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> > >   <description>Regular expression naming plugin directory names to
> > >   include.  Any plugin not matching this expression is excluded.
> > >   In any case you need at least include the nutch-extensionpoints plugin.
> > > By
> > >   default Nutch includes crawling just HTML and plain text via HTTP,
> > >   and basic indexing and search plugins. In order to use HTTPS 
> > > please enable
> > >   protocol-httpclient, but be aware of possible intermittent 
> > > problems with the
> > >   underlying commons-httpclient library.
> > >   </description>
> > > </property>"
> > >
> > > I also changed the following in the nutch-default.xml, giving some 
> > > arbitrary value to each property,
> > >
> > > "<property>
> > >   <name>http.agent.name</name>
> > >   <value>Blah</value>
> > >   <description>HTTP 'User-Agent' request header. MUST NOT be empty -
> > >   please set this to a single word uniquely related to your organization.
> > >
> > >   NOTE: You should also check other related properties:
> > >
> > >         http.robots.agents
> > >         http.agent.description
> > >         http.agent.url
> > >         http.agent.email
> > >         http.agent.version
> > >
> > >   and set their values appropriately.
> > >
> > >   </description>
> > > </property>
> > >
> > > <property>
> > >   <name>http.robots.agents</name>
> > >   <value>Blah</value>
> > >   <description>The agent strings we'll look for in robots.txt files,
> > >   comma-separated, in decreasing order of precedence. You should
> > >   put the value of http.agent.name as the first agent name, and keep the
> > >   default * at the end of the list. E.g.: BlurflDev,Blurfl,*
> > >   </description>
> > > </property>"
> > >
> > > After that I proceeded to crawling with the following command, 
> > > bin/nutch crawl urls -dir crawl -threads 10 -depth 3 -topN 10
> > >
> > > The logs are present at the following link,
> > >
> > > http://pastebin.com/e7JEcEjV
> > >
> > > My stats show that only one link was crawled, whose min, max scores 
> > > are all 1 When I read the segment that was crawled, I got the 
> > > following,
> > >
> > > http://pastebin.com/D83D5BeX
> > >
> > > I have checked the robots.txt file as well of the website. My friend 
> > > is doing the same thing, but using nutch 1.2 on windows, with the 
> > > exact same changes as mine and it's working.
> > >
> > > Hoping a really quick reply as this is urgent.
> > >
> > > Regards,
> > > Vincent Anup Kuri
> > >
> > >
> > 
>

RE: Regarding crawling https links

Reply via email to