Re: Solr Web Crawler - Robots.txt
In any case after digging further I have found where it checks for robots.txt. Thanks! On Thu, Jun 1, 2017 at 5:34 PM Walter Underwood <wun...@wunderwood.org> wrote: > Which was exactly what I suggested. > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > > > > On Jun 1, 2017, at 3:31 PM, David Choi <choi.davi...@gmail.com> wrote: > > > > In the mean time I have found a better solution at the moment is to test > on > > a site that allows users to crawl their site. > > > > On Thu, Jun 1, 2017 at 5:26 PM David Choi <choi.davi...@gmail.com> > wrote: > > > >> I think you misunderstand the argument was about stealing content. Sorry > >> but I think you need to read what people write before making bold > >> statements. > >> > >> On Thu, Jun 1, 2017 at 5:20 PM Walter Underwood <wun...@wunderwood.org> > >> wrote: > >> > >>> Let’s not get snarky right away, especially when you are wrong. > >>> > >>> Corporations do not generally ignore robots.txt. I worked on a > commercial > >>> web spider for ten years. Occasionally, our customers did need to > bypass > >>> portions of robots.txt. That was usually because of a > poorly-maintained web > >>> server, or because our spider could safely crawl some content that > would > >>> cause problems for other crawlers. > >>> > >>> If you want to learn crawling, don’t start by breaking the conventions > of > >>> good web citizenship. Instead, start with sitemap.xml and crawl the > >>> preferred portions of a site. > >>> > >>> https://www.sitemaps.org/index.html < > https://www.sitemaps.org/index.html> > >>> > >>> If the site blocks you, find a different site to learn on. > >>> > >>> I like the looks of “Scrapy”, written in Python. I haven’t used it for > >>> anything big, but I’d start with that for learning. > >>> > >>> https://scrapy.org/ <https://scrapy.org/> > >>> > >>> If you want to learn on a site with a lot of content, try ours, > chegg.com > >>> But if your crawler gets out of hand, crawling too fast, we’ll block > it. > >>> Any other site will do the same. > >>> > >>> I would not base the crawler directly on Solr. A crawler needs a > >>> dedicated database to record the URLs visited, errors, duplicates, > etc. The > >>> output of the crawl goes to Solr. That is how we did it with Ultraseek > >>> (before Solr existed). > >>> > >>> wunder > >>> Walter Underwood > >>> wun...@wunderwood.org > >>> http://observer.wunderwood.org/ (my blog) > >>> > >>> > >>>> On Jun 1, 2017, at 3:01 PM, David Choi <choi.davi...@gmail.com> > wrote: > >>>> > >>>> Oh well I guess its ok if a corporation does it but not someone > wanting > >>> to > >>>> learn more about the field. I actually have written a crawler before > as > >>>> well as the you know Inverted Index of how solr works but I just > thought > >>>> its architecture was better suited for scaling. > >>>> > >>>> On Thu, Jun 1, 2017 at 4:47 PM Dave <hastings.recurs...@gmail.com> > >>> wrote: > >>>> > >>>>> And I mean that in the context of stealing content from sites that > >>>>> explicitly declare they don't want to be crawled. Robots.txt is to be > >>>>> followed. > >>>>> > >>>>>> On Jun 1, 2017, at 5:31 PM, David Choi <choi.davi...@gmail.com> > >>> wrote: > >>>>>> > >>>>>> Hello, > >>>>>> > >>>>>> I was wondering if anyone could guide me on how to crawl the web and > >>>>>> ignore the robots.txt since I can not index some big sites. Or if > >>> someone > >>>>>> could point how to get around it. I read somewhere about a > >>>>>> protocol.plugin.check.robots > >>>>>> but that was for nutch. > >>>>>> > >>>>>> The way I index is > >>>>>> bin/post -c gettingstarted https://en.wikipedia.org/ > >>>>>> > >>>>>> but I can't index the site I'm guessing because of the robots.txt. > >>>>>> I can index with > >>>>>> bin/post -c gettingstarted http://lucene.apache.org/solr > >>>>>> > >>>>>> which I am guessing allows it. I was also wondering how to find the > >>> name > >>>>> of > >>>>>> the crawler bin/post uses. > >>>>> > >>> > >>> > >
Re: Solr Web Crawler - Robots.txt
In the mean time I have found a better solution at the moment is to test on a site that allows users to crawl their site. On Thu, Jun 1, 2017 at 5:26 PM David Choi <choi.davi...@gmail.com> wrote: > I think you misunderstand the argument was about stealing content. Sorry > but I think you need to read what people write before making bold > statements. > > On Thu, Jun 1, 2017 at 5:20 PM Walter Underwood <wun...@wunderwood.org> > wrote: > >> Let’s not get snarky right away, especially when you are wrong. >> >> Corporations do not generally ignore robots.txt. I worked on a commercial >> web spider for ten years. Occasionally, our customers did need to bypass >> portions of robots.txt. That was usually because of a poorly-maintained web >> server, or because our spider could safely crawl some content that would >> cause problems for other crawlers. >> >> If you want to learn crawling, don’t start by breaking the conventions of >> good web citizenship. Instead, start with sitemap.xml and crawl the >> preferred portions of a site. >> >> https://www.sitemaps.org/index.html <https://www.sitemaps.org/index.html> >> >> If the site blocks you, find a different site to learn on. >> >> I like the looks of “Scrapy”, written in Python. I haven’t used it for >> anything big, but I’d start with that for learning. >> >> https://scrapy.org/ <https://scrapy.org/> >> >> If you want to learn on a site with a lot of content, try ours, chegg.com >> But if your crawler gets out of hand, crawling too fast, we’ll block it. >> Any other site will do the same. >> >> I would not base the crawler directly on Solr. A crawler needs a >> dedicated database to record the URLs visited, errors, duplicates, etc. The >> output of the crawl goes to Solr. That is how we did it with Ultraseek >> (before Solr existed). >> >> wunder >> Walter Underwood >> wun...@wunderwood.org >> http://observer.wunderwood.org/ (my blog) >> >> >> > On Jun 1, 2017, at 3:01 PM, David Choi <choi.davi...@gmail.com> wrote: >> > >> > Oh well I guess its ok if a corporation does it but not someone wanting >> to >> > learn more about the field. I actually have written a crawler before as >> > well as the you know Inverted Index of how solr works but I just thought >> > its architecture was better suited for scaling. >> > >> > On Thu, Jun 1, 2017 at 4:47 PM Dave <hastings.recurs...@gmail.com> >> wrote: >> > >> >> And I mean that in the context of stealing content from sites that >> >> explicitly declare they don't want to be crawled. Robots.txt is to be >> >> followed. >> >> >> >>> On Jun 1, 2017, at 5:31 PM, David Choi <choi.davi...@gmail.com> >> wrote: >> >>> >> >>> Hello, >> >>> >> >>> I was wondering if anyone could guide me on how to crawl the web and >> >>> ignore the robots.txt since I can not index some big sites. Or if >> someone >> >>> could point how to get around it. I read somewhere about a >> >>> protocol.plugin.check.robots >> >>> but that was for nutch. >> >>> >> >>> The way I index is >> >>> bin/post -c gettingstarted https://en.wikipedia.org/ >> >>> >> >>> but I can't index the site I'm guessing because of the robots.txt. >> >>> I can index with >> >>> bin/post -c gettingstarted http://lucene.apache.org/solr >> >>> >> >>> which I am guessing allows it. I was also wondering how to find the >> name >> >> of >> >>> the crawler bin/post uses. >> >> >> >>
Re: Solr Web Crawler - Robots.txt
I think you misunderstand the argument was about stealing content. Sorry but I think you need to read what people write before making bold statements. On Thu, Jun 1, 2017 at 5:20 PM Walter Underwood <wun...@wunderwood.org> wrote: > Let’s not get snarky right away, especially when you are wrong. > > Corporations do not generally ignore robots.txt. I worked on a commercial > web spider for ten years. Occasionally, our customers did need to bypass > portions of robots.txt. That was usually because of a poorly-maintained web > server, or because our spider could safely crawl some content that would > cause problems for other crawlers. > > If you want to learn crawling, don’t start by breaking the conventions of > good web citizenship. Instead, start with sitemap.xml and crawl the > preferred portions of a site. > > https://www.sitemaps.org/index.html <https://www.sitemaps.org/index.html> > > If the site blocks you, find a different site to learn on. > > I like the looks of “Scrapy”, written in Python. I haven’t used it for > anything big, but I’d start with that for learning. > > https://scrapy.org/ <https://scrapy.org/> > > If you want to learn on a site with a lot of content, try ours, chegg.com > But if your crawler gets out of hand, crawling too fast, we’ll block it. > Any other site will do the same. > > I would not base the crawler directly on Solr. A crawler needs a dedicated > database to record the URLs visited, errors, duplicates, etc. The output of > the crawl goes to Solr. That is how we did it with Ultraseek (before Solr > existed). > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > > > > On Jun 1, 2017, at 3:01 PM, David Choi <choi.davi...@gmail.com> wrote: > > > > Oh well I guess its ok if a corporation does it but not someone wanting > to > > learn more about the field. I actually have written a crawler before as > > well as the you know Inverted Index of how solr works but I just thought > > its architecture was better suited for scaling. > > > > On Thu, Jun 1, 2017 at 4:47 PM Dave <hastings.recurs...@gmail.com> > wrote: > > > >> And I mean that in the context of stealing content from sites that > >> explicitly declare they don't want to be crawled. Robots.txt is to be > >> followed. > >> > >>> On Jun 1, 2017, at 5:31 PM, David Choi <choi.davi...@gmail.com> wrote: > >>> > >>> Hello, > >>> > >>> I was wondering if anyone could guide me on how to crawl the web and > >>> ignore the robots.txt since I can not index some big sites. Or if > someone > >>> could point how to get around it. I read somewhere about a > >>> protocol.plugin.check.robots > >>> but that was for nutch. > >>> > >>> The way I index is > >>> bin/post -c gettingstarted https://en.wikipedia.org/ > >>> > >>> but I can't index the site I'm guessing because of the robots.txt. > >>> I can index with > >>> bin/post -c gettingstarted http://lucene.apache.org/solr > >>> > >>> which I am guessing allows it. I was also wondering how to find the > name > >> of > >>> the crawler bin/post uses. > >> > >
Re: Solr Web Crawler - Robots.txt
Oh well I guess its ok if a corporation does it but not someone wanting to learn more about the field. I actually have written a crawler before as well as the you know Inverted Index of how solr works but I just thought its architecture was better suited for scaling. On Thu, Jun 1, 2017 at 4:47 PM Dave <hastings.recurs...@gmail.com> wrote: > And I mean that in the context of stealing content from sites that > explicitly declare they don't want to be crawled. Robots.txt is to be > followed. > > > On Jun 1, 2017, at 5:31 PM, David Choi <choi.davi...@gmail.com> wrote: > > > > Hello, > > > > I was wondering if anyone could guide me on how to crawl the web and > > ignore the robots.txt since I can not index some big sites. Or if someone > > could point how to get around it. I read somewhere about a > > protocol.plugin.check.robots > > but that was for nutch. > > > > The way I index is > > bin/post -c gettingstarted https://en.wikipedia.org/ > > > > but I can't index the site I'm guessing because of the robots.txt. > > I can index with > > bin/post -c gettingstarted http://lucene.apache.org/solr > > > > which I am guessing allows it. I was also wondering how to find the name > of > > the crawler bin/post uses. >
Solr Web Crawler - Robots.txt
Hello, I was wondering if anyone could guide me on how to crawl the web and ignore the robots.txt since I can not index some big sites. Or if someone could point how to get around it. I read somewhere about a protocol.plugin.check.robots but that was for nutch. The way I index is bin/post -c gettingstarted https://en.wikipedia.org/ but I can't index the site I'm guessing because of the robots.txt. I can index with bin/post -c gettingstarted http://lucene.apache.org/solr which I am guessing allows it. I was also wondering how to find the name of the crawler bin/post uses.