Re: Problem with crawling macys robots.txt

S.L Wed, 04 Jun 2014 07:19:03 -0700

I have been setting it to 30.The other option is to set it to _1, is that what 
you are suggesting?


Thanks.

Sent from my HTC

----- Reply message -----
From: "Julien Nioche" <[email protected]>
To: "[email protected]" <[email protected]>
Subject: Problem with crawling macys robots.txt
Date: Wed, Jun 4, 2014 5:48 AM

That's why we have fetcher.max.crawl.delay : if a ridiculously large value
is set, at least you won't be slowed down too much. See
https://github.com/apache/nutch/blob/trunk/conf/nutch-default.xml#L693



On 4 June 2014 05:10, S.L <[email protected]> wrote:

> Out of curiosity , what if one needs to set the rules of politeness that
> are more realistic , i.e if I want to set the crawl-delay to be a certain
> max value regardless of what a particular site has , which java class
> should I be looking to change , assuming that this cannot be achieved using
> the config parameters. Thanks.
>
>
> On Tue, Jun 3, 2014 at 5:52 PM, Sebastian Nagel <
> [email protected]>
> wrote:
>
> > > though , I wonder if anyone uses Nutch in production and how they
> > overcome
> > > this limitation being imposed by sites like macys.com where they have
> a
> > > Crawl-Delay specified?
> >
> > If you follow rules of politeness, there will be now way to overcome the
> > crawl-delay from robots.txt: crawling will be horribly slow. So slow,
> that
> > completeness and freshness seem unreachable targets. But maybe that's
> > exactly the intention of site owner.
> >
> > On 06/03/2014 04:29 PM, S.L wrote:
> > > Thats good piece of Info Nima , it means you wont be able to crawl more
> > > than 720 pages in 24 hrs , this sounds like a pretty serious limitation
> > > though , I wonder if anyone uses Nutch in production and how they
> > overcome
> > > this limitation being imposed by sites like macys.com where they have
> a
> > > Crawl-Delay specified?
> > >
> > >
> > >
> > >
> > > On Tue, Jun 3, 2014 at 3:24 AM, Nima Falaki <[email protected]>
> > wrote:
> > >
> > >> Nevermind, I figured it out, I adjusted my fetcher.max.crawl.delay
> > >> accordingly and it solved the issue. Macys.com has a crawl-delay of
> 120,
> > >> nutch by default has a crawl delay of 30, so I had to change that and
> it
> > >> worked. You guys must either make the crawl delay to -1 (something I
> > dont
> > >> recommend, but I did for example purposes), or to over 120 (for
> > macys.com)
> > >> in order to crawl macys.com
> > >>
> > >> <property>
> > >>
> > >>  <name>fetcher.max.crawl.delay</name>
> > >>
> > >>  <value>-1</value>
> > >>
> > >>  <description>
> > >>
> > >>  If the Crawl-Delay in robots.txt is set to greater than this value
> (in
> > >>
> > >>  seconds) then the fetcher will skip this page, generating an error
> > report.
> > >>
> > >>  If set to -1 the fetcher will never skip such pages and will wait the
> > >>
> > >>  amount of time retrieved from robots.txt Crawl-Delay, however long
> that
> > >>
> > >>  might be.
> > >>
> > >>  </description>
> > >>
> > >> </property>
> > >>
> > >>
> > >> On Mon, Jun 2, 2014 at 6:31 PM, Nima Falaki <[email protected]>
> > wrote:
> > >>
> > >>> Hi Sebastian:
> > >>>
> > >>> One thing I noticed is that when I tested the robots.txt with
> > >>> RobotsRulesParser, which is in org.apache.nutch.protocol, with the
> > >>> following URL
> > >>>
> > >>
> >
> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
> > >>>
> > >>> It gave me this message
> > >>>
> > >>> 2014-06-02 18:27:16,949 WARN  robots.SimpleRobotRulesParser (
> > >>> SimpleRobotRulesParser.java:reportWarning(452)) - Problem processing
> > >>> robots.txt for
> > >>> /Users/nfalaki/shopstyle/apache-nutch-1.8/runtime/local/robots4.txt
> > >>>
> > >>> 2014-06-02 18:27:16,952 WARN  robots.SimpleRobotRulesParser (
> > >>> SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in
> > >>> robots.txt file (size 672): noindex: *natuzzi*
> > >>>
> > >>> 2014-06-02 18:27:16,952 WARN  robots.SimpleRobotRulesParser (
> > >>> SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in
> > >>> robots.txt file (size 672): noindex: *Natuzzi*
> > >>>
> > >>> 2014-06-02 18:27:16,954 WARN  robots.SimpleRobotRulesParser (
> > >>> SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in
> > >>> robots.txt file (size 672): noindex: *natuzzi*
> > >>>
> > >>> 2014-06-02 18:27:16,955 WARN  robots.SimpleRobotRulesParser (
> > >>> SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in
> > >>> robots.txt file (size 672): noindex: *Natuzzi*
> > >>>
> > >>> *allowed:
> > >>>
> > >>
> >
> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
> > >>> <
> > >>
> >
> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
> > >>> *
> > >>>
> > >>>
> > >>> This is in direct contrary to what happened when I ran the crawl
> script
> > >>> with
> > >>>
> > >>
> >
> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
> > >>> as my SeedURL
> > >>>
> > >>> I got this in my crawlDB
> > >>>
> > >>> *
> > >>
> >
> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
> > >>> <
> > >>
> >
> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
> > >>>
> > >>>       Version: 7*
> > >>>
> > >>> *Status: 3 (db_gone)*
> > >>>
> > >>> *Fetch time: Thu Jul 17 18:05:47 PDT 2014*
> > >>>
> > >>> *Modified time: Wed Dec 31 16:00:00 PST 1969*
> > >>>
> > >>> *Retries since fetch: 0*
> > >>>
> > >>> *Retry interval: 3888000 seconds (45 days)*
> > >>>
> > >>> *Score: 1.0*
> > >>>
> > >>> *Signature: null*
> > >>>
> > >>> *Metadata:*
> > >>>
> > >>> *        _pst_=robots_denied(18), lastModified=0*
> > >>>
> > >>>
> > >>> Is this a bug in the crawler-commons 0.3? Where when you test the
> macys
> > >>> robots.txt file with RobotRulesParser it allows it, but when you run
> > the
> > >>> macys url as a seed url in the crawl script then it denies the url.
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> On Sun, Jun 1, 2014 at 12:53 PM, Sebastian Nagel <
> > >>> [email protected]> wrote:
> > >>>
> > >>>> Hi Luke, hi Nima,
> > >>>>
> > >>>>>     The/Robot Exclusion Standard/does not mention anything about
> the
> > >>>> "*" character in
> > >>>>> the|Disallow:|statement.
> > >>>> Indeed the RFC draft [1] does not. However, since Google [2] does
> wild
> > >>>> card patterns are
> > >>>> frequently used in robots.txt. With crawler-commons 0.4 [3] these
> > rules
> > >>>> are also followed
> > >>>> by Nutch (to be in versions 1.9 resp. 2.3).
> > >>>>
> > >>>> But the error message is about the noindex lines:
> > >>>>  noindex: *natuzzi*
> > >>>> These lines are redundant (and also invalid, I suppose):
> > >>>> if a page/URL is disallowed, it's not fetched at all,
> > >>>> and will hardly slip into the index.
> > >>>> I think you can ignore the warning.
> > >>>>
> > >>>>> One might also question the craw-delay setting of 120 seconds, but
> > >>>> that's another issue...
> > >>>> Yeah, it will take very long to crawl the site.
> > >>>> With Nutch the property "fetcher.max.crawl.delay" needs to be
> > adjusted:
> > >>>>
> > >>>> <property>
> > >>>>  <name>fetcher.max.crawl.delay</name>
> > >>>>  <value>30</value>
> > >>>>  <description>
> > >>>>  If the Crawl-Delay in robots.txt is set to greater than this value
> > (in
> > >>>>  seconds) then the fetcher will skip this page, generating an error
> > >>>> report.
> > >>>>  If set to -1 the fetcher will never skip such pages and will wait
> the
> > >>>>  amount of time retrieved from robots.txt Crawl-Delay, however long
> > that
> > >>>>  might be.
> > >>>>  </description>
> > >>>> </property>
> > >>>>
> > >>>> Cheers,
> > >>>> Sebastian
> > >>>>
> > >>>> [1] http://www.robotstxt.org/norobots-rfc.txt
> > >>>> [2]
> > >>>>
> > >>
> >
> https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
> > >>>> [3]
> > >>>>
> > >>
> >
> http://crawler-commons.googlecode.com/svn/tags/crawler-commons-0.4/CHANGES.txt
> > >>>>
> > >>>> On 05/31/2014 04:27 PM, Luke Mawbey wrote:
> > >>>>> From wikipedia:
> > >>>>>     The/Robot Exclusion Standard/does not mention anything about
> the
> > >>>> "*" character in
> > >>>>> the|Disallow:|statement. Some crawlers like Googlebot recognize
> > >> strings
> > >>>> containing "*", while MSNbot
> > >>>>> and Teoma interpret it in different ways
> > >>>>>
> > >>>>> So the 'problem' is with Macy's. Really, there is no problem for
> you:
> > >>>> presumably that line is just
> > >>>>> ignored from robots.txt.
> > >>>>>
> > >>>>> One might also question the craw-delay setting of 120 seconds, but
> > >>>> that's another issue...
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> On 31/05/2014 12:16 AM, Nima Falaki wrote:
> > >>>>>> Hello Everyone:
> > >>>>>>
> > >>>>>> Just have a question about an issue I discovered while trying to
> > >> crawl
> > >>>> the
> > >>>>>> macys robots.txt, I am using nutch 1.8 and used crawler-commons
> 0.3
> > >> and
> > >>>>>> crawler-commons 0.4. This is the robots.txt file from macys
> > >>>>>>
> > >>>>>> User-agent: *
> > >>>>>> Crawl-delay: 120
> > >>>>>> Disallow: /compare
> > >>>>>> Disallow: /registry/wedding/compare
> > >>>>>> Disallow: /catalog/product/zoom.jsp
> > >>>>>> Disallow: /search
> > >>>>>> Disallow: /shop/search
> > >>>>>> Disallow: /shop/registry/wedding/search
> > >>>>>> Disallow: *natuzzi*
> > >>>>>> noindex: *natuzzi*
> > >>>>>> Disallow: *Natuzzi*
> > >>>>>> noindex: *Natuzzi*
> > >>>>>> Disallow:  /bag/add*
> > >>>>>>
> > >>>>>>
> > >>>>>> When I run this robots.txt through the RobotsRulesParser with this
> > >> url
> > >>>>>> (
> > >>>>
> > >>
> >
> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
> > >>>> )
> > >>>>>>
> > >>>>>> I get the following exceptions
> > >>>>>>
> > >>>>>> 2014-05-30 17:02:20,570 WARN  robots.SimpleRobotRulesParser
> > >>>>>> (SimpleRobotRulesParser.java:reportWarning(456)) -     Unknown
> line
> > >> in
> > >>>>>> robots.txt file (size 672): noindex: *natuzzi*
> > >>>>>>
> > >>>>>> 2014-05-30 17:02:20,571 WARN  robots.SimpleRobotRulesParser
> > >>>>>> (SimpleRobotRulesParser.java:reportWarning(456)) -     Unknown
> line
> > >> in
> > >>>>>> robots.txt file (size 672): noindex: *Natuzzi*
> > >>>>>>
> > >>>>>> 2014-05-30 17:02:20,574 WARN  robots.SimpleRobotRulesParser
> > >>>>>> (SimpleRobotRulesParser.java:reportWarning(456)) -     Unknown
> line
> > >> in
> > >>>>>> robots.txt file (size 672): noindex: *natuzzi*
> > >>>>>>
> > >>>>>> 2014-05-30 17:02:20,574 WARN  robots.SimpleRobotRulesParser
> > >>>>>> (SimpleRobotRulesParser.java:reportWarning(456)) -     Unknown
> line
> > >> in
> > >>>>>> robots.txt file (size 672): noindex: *Natuzzi*
> > >>>>>>
> > >>>>>> Is there anything I can do to solve this problem? Is this a
> problem
> > >>>>>> with nutch or does macys.com have a really bad robots.txt file?
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>   <http://www.popsugar.com>
> > >>>>>> Nima Falaki
> > >>>>>> Software Engineer
> > >>>>>> [email protected]
> > >>>>>>
> > >>>>>
> > >>>>>
> > >>>>
> > >>>>
> > >>>
> > >>>
> > >>> --
> > >>>
> > >>>
> > >>>  <http://www.popsugar.com>
> > >>>
> > >>> Nima Falaki
> > >>> Software Engineer
> > >>> [email protected]
> > >>>
> > >>>
> > >>
> > >>
> > >> --
> > >>
> > >>
> > >>
> > >> Nima Falaki
> > >> Software Engineer
> > >> [email protected]
> > >>
> > >
> >
> >
>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Problem with crawling macys robots.txt

Reply via email to