Re: Unable to crawl google search results

Yves S. Garret Tue, 04 Jun 2013 15:28:28 -0700

Noted.


On Tue, Jun 4, 2013 at 6:13 PM, Tejas Patil <[email protected]>wrote:

> Do you mean to turn off the robots processing ?
> See the comment by Andrzej over [0]:
>
> " The goal of Nutch is to implement a well-behaved crawler that obeys robot
> rules and netiquette. Your patch simply disables these control mechanisms.
> If it works for you and you can risk the wrath of webmasters, that's fine,
> you are free to use this patch - but Nutch as a project cannot encourage
> such practice."
>
> [0] : https://issues.apache.org/jira/browse/NUTCH-938
>
>
> On Tue, Jun 4, 2013 at 2:58 PM, Yves S. Garret
> <[email protected]>wrote:
>
> > One more question, is it ever a good idea to set this property
> > "protocol.plugin.check.robots" in nutch-site.xml to false?
> >
> >
> > On Tue, Jun 4, 2013 at 5:30 PM, Yves S. Garret
> > <[email protected]>wrote:
> >
> > > Got another issue.  When I run my crawler over google search results, I
> > > see
> > > _nothing_ in my HBase table... why?
> > >
> > > This is what I'm trying to crawl:
> > >
> > >
> >
> https://www.google.com/#output=search&sclient=psy-ab&q=xbox&oq=xbox&gs_l=hp.3..0l4.648.1180.0.1354.4.4.0.0.0.0.213.547.0j2j1.3.0...0.0...1c.1.15.psy-ab.jd107GllWZw&pbx=1&bav=on.2,or.r_cp.r_qf.&bvm=bv.47380653,d.eWU&fp=13d973d49a29d61d&biw=1280&bih=635
> > >
> > > Here are my logs:
> > > http://bin.cakephp.org/view/1619245280
> > >
> > > Here is my $NUTCH_HOME/conf/nutch-site.xml:
> > > http://bin.cakephp.org/view/1304119856
> > >
> > > And the output that I see when I run the crawler:
> > > http://bin.cakephp.org/view/260103467
> > >
> > > In nutch-site.xml, I have all of the needed plugin.includes, I
> believe...
> > >
> >
>

Re: Unable to crawl google search results

Reply via email to