Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread David Choi
In any case after digging further I have found where it checks for
robots.txt. Thanks!

On Thu, Jun 1, 2017 at 5:34 PM Walter Underwood <wun...@wunderwood.org>
wrote:

> Which was exactly what I suggested.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Jun 1, 2017, at 3:31 PM, David Choi <choi.davi...@gmail.com> wrote:
> >
> > In the mean time I have found a better solution at the moment is to test
> on
> > a site that allows users to crawl their site.
> >
> > On Thu, Jun 1, 2017 at 5:26 PM David Choi <choi.davi...@gmail.com>
> wrote:
> >
> >> I think you misunderstand the argument was about stealing content. Sorry
> >> but I think you need to read what people write before making bold
> >> statements.
> >>
> >> On Thu, Jun 1, 2017 at 5:20 PM Walter Underwood <wun...@wunderwood.org>
> >> wrote:
> >>
> >>> Let’s not get snarky right away, especially when you are wrong.
> >>>
> >>> Corporations do not generally ignore robots.txt. I worked on a
> commercial
> >>> web spider for ten years. Occasionally, our customers did need to
> bypass
> >>> portions of robots.txt. That was usually because of a
> poorly-maintained web
> >>> server, or because our spider could safely crawl some content that
> would
> >>> cause problems for other crawlers.
> >>>
> >>> If you want to learn crawling, don’t start by breaking the conventions
> of
> >>> good web citizenship. Instead, start with sitemap.xml and crawl the
> >>> preferred portions of a site.
> >>>
> >>> https://www.sitemaps.org/index.html <
> https://www.sitemaps.org/index.html>
> >>>
> >>> If the site blocks you, find a different site to learn on.
> >>>
> >>> I like the looks of “Scrapy”, written in Python. I haven’t used it for
> >>> anything big, but I’d start with that for learning.
> >>>
> >>> https://scrapy.org/ <https://scrapy.org/>
> >>>
> >>> If you want to learn on a site with a lot of content, try ours,
> chegg.com
> >>> But if your crawler gets out of hand, crawling too fast, we’ll block
> it.
> >>> Any other site will do the same.
> >>>
> >>> I would not base the crawler directly on Solr. A crawler needs a
> >>> dedicated database to record the URLs visited, errors, duplicates,
> etc. The
> >>> output of the crawl goes to Solr. That is how we did it with Ultraseek
> >>> (before Solr existed).
> >>>
> >>> wunder
> >>> Walter Underwood
> >>> wun...@wunderwood.org
> >>> http://observer.wunderwood.org/  (my blog)
> >>>
> >>>
> >>>> On Jun 1, 2017, at 3:01 PM, David Choi <choi.davi...@gmail.com>
> wrote:
> >>>>
> >>>> Oh well I guess its ok if a corporation does it but not someone
> wanting
> >>> to
> >>>> learn more about the field. I actually have written a crawler before
> as
> >>>> well as the you know Inverted Index of how solr works but I just
> thought
> >>>> its architecture was better suited for scaling.
> >>>>
> >>>> On Thu, Jun 1, 2017 at 4:47 PM Dave <hastings.recurs...@gmail.com>
> >>> wrote:
> >>>>
> >>>>> And I mean that in the context of stealing content from sites that
> >>>>> explicitly declare they don't want to be crawled. Robots.txt is to be
> >>>>> followed.
> >>>>>
> >>>>>> On Jun 1, 2017, at 5:31 PM, David Choi <choi.davi...@gmail.com>
> >>> wrote:
> >>>>>>
> >>>>>> Hello,
> >>>>>>
> >>>>>> I was wondering if anyone could guide me on how to crawl the web and
> >>>>>> ignore the robots.txt since I can not index some big sites. Or if
> >>> someone
> >>>>>> could point how to get around it. I read somewhere about a
> >>>>>> protocol.plugin.check.robots
> >>>>>> but that was for nutch.
> >>>>>>
> >>>>>> The way I index is
> >>>>>> bin/post -c gettingstarted https://en.wikipedia.org/
> >>>>>>
> >>>>>> but I can't index the site I'm guessing because of the robots.txt.
> >>>>>> I can index with
> >>>>>> bin/post -c gettingstarted http://lucene.apache.org/solr
> >>>>>>
> >>>>>> which I am guessing allows it. I was also wondering how to find the
> >>> name
> >>>>> of
> >>>>>> the crawler bin/post uses.
> >>>>>
> >>>
> >>>
>
>


Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread David Choi
In the mean time I have found a better solution at the moment is to test on
a site that allows users to crawl their site.

On Thu, Jun 1, 2017 at 5:26 PM David Choi <choi.davi...@gmail.com> wrote:

> I think you misunderstand the argument was about stealing content. Sorry
> but I think you need to read what people write before making bold
> statements.
>
> On Thu, Jun 1, 2017 at 5:20 PM Walter Underwood <wun...@wunderwood.org>
> wrote:
>
>> Let’s not get snarky right away, especially when you are wrong.
>>
>> Corporations do not generally ignore robots.txt. I worked on a commercial
>> web spider for ten years. Occasionally, our customers did need to bypass
>> portions of robots.txt. That was usually because of a poorly-maintained web
>> server, or because our spider could safely crawl some content that would
>> cause problems for other crawlers.
>>
>> If you want to learn crawling, don’t start by breaking the conventions of
>> good web citizenship. Instead, start with sitemap.xml and crawl the
>> preferred portions of a site.
>>
>> https://www.sitemaps.org/index.html <https://www.sitemaps.org/index.html>
>>
>> If the site blocks you, find a different site to learn on.
>>
>> I like the looks of “Scrapy”, written in Python. I haven’t used it for
>> anything big, but I’d start with that for learning.
>>
>> https://scrapy.org/ <https://scrapy.org/>
>>
>> If you want to learn on a site with a lot of content, try ours, chegg.com
>> But if your crawler gets out of hand, crawling too fast, we’ll block it.
>> Any other site will do the same.
>>
>> I would not base the crawler directly on Solr. A crawler needs a
>> dedicated database to record the URLs visited, errors, duplicates, etc. The
>> output of the crawl goes to Solr. That is how we did it with Ultraseek
>> (before Solr existed).
>>
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>>
>>
>> > On Jun 1, 2017, at 3:01 PM, David Choi <choi.davi...@gmail.com> wrote:
>> >
>> > Oh well I guess its ok if a corporation does it but not someone wanting
>> to
>> > learn more about the field. I actually have written a crawler before as
>> > well as the you know Inverted Index of how solr works but I just thought
>> > its architecture was better suited for scaling.
>> >
>> > On Thu, Jun 1, 2017 at 4:47 PM Dave <hastings.recurs...@gmail.com>
>> wrote:
>> >
>> >> And I mean that in the context of stealing content from sites that
>> >> explicitly declare they don't want to be crawled. Robots.txt is to be
>> >> followed.
>> >>
>> >>> On Jun 1, 2017, at 5:31 PM, David Choi <choi.davi...@gmail.com>
>> wrote:
>> >>>
>> >>> Hello,
>> >>>
>> >>>  I was wondering if anyone could guide me on how to crawl the web and
>> >>> ignore the robots.txt since I can not index some big sites. Or if
>> someone
>> >>> could point how to get around it. I read somewhere about a
>> >>> protocol.plugin.check.robots
>> >>> but that was for nutch.
>> >>>
>> >>> The way I index is
>> >>> bin/post -c gettingstarted https://en.wikipedia.org/
>> >>>
>> >>> but I can't index the site I'm guessing because of the robots.txt.
>> >>> I can index with
>> >>> bin/post -c gettingstarted http://lucene.apache.org/solr
>> >>>
>> >>> which I am guessing allows it. I was also wondering how to find the
>> name
>> >> of
>> >>> the crawler bin/post uses.
>> >>
>>
>>


Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread David Choi
I think you misunderstand the argument was about stealing content. Sorry
but I think you need to read what people write before making bold
statements.

On Thu, Jun 1, 2017 at 5:20 PM Walter Underwood <wun...@wunderwood.org>
wrote:

> Let’s not get snarky right away, especially when you are wrong.
>
> Corporations do not generally ignore robots.txt. I worked on a commercial
> web spider for ten years. Occasionally, our customers did need to bypass
> portions of robots.txt. That was usually because of a poorly-maintained web
> server, or because our spider could safely crawl some content that would
> cause problems for other crawlers.
>
> If you want to learn crawling, don’t start by breaking the conventions of
> good web citizenship. Instead, start with sitemap.xml and crawl the
> preferred portions of a site.
>
> https://www.sitemaps.org/index.html <https://www.sitemaps.org/index.html>
>
> If the site blocks you, find a different site to learn on.
>
> I like the looks of “Scrapy”, written in Python. I haven’t used it for
> anything big, but I’d start with that for learning.
>
> https://scrapy.org/ <https://scrapy.org/>
>
> If you want to learn on a site with a lot of content, try ours, chegg.com
> But if your crawler gets out of hand, crawling too fast, we’ll block it.
> Any other site will do the same.
>
> I would not base the crawler directly on Solr. A crawler needs a dedicated
> database to record the URLs visited, errors, duplicates, etc. The output of
> the crawl goes to Solr. That is how we did it with Ultraseek (before Solr
> existed).
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Jun 1, 2017, at 3:01 PM, David Choi <choi.davi...@gmail.com> wrote:
> >
> > Oh well I guess its ok if a corporation does it but not someone wanting
> to
> > learn more about the field. I actually have written a crawler before as
> > well as the you know Inverted Index of how solr works but I just thought
> > its architecture was better suited for scaling.
> >
> > On Thu, Jun 1, 2017 at 4:47 PM Dave <hastings.recurs...@gmail.com>
> wrote:
> >
> >> And I mean that in the context of stealing content from sites that
> >> explicitly declare they don't want to be crawled. Robots.txt is to be
> >> followed.
> >>
> >>> On Jun 1, 2017, at 5:31 PM, David Choi <choi.davi...@gmail.com> wrote:
> >>>
> >>> Hello,
> >>>
> >>>  I was wondering if anyone could guide me on how to crawl the web and
> >>> ignore the robots.txt since I can not index some big sites. Or if
> someone
> >>> could point how to get around it. I read somewhere about a
> >>> protocol.plugin.check.robots
> >>> but that was for nutch.
> >>>
> >>> The way I index is
> >>> bin/post -c gettingstarted https://en.wikipedia.org/
> >>>
> >>> but I can't index the site I'm guessing because of the robots.txt.
> >>> I can index with
> >>> bin/post -c gettingstarted http://lucene.apache.org/solr
> >>>
> >>> which I am guessing allows it. I was also wondering how to find the
> name
> >> of
> >>> the crawler bin/post uses.
> >>
>
>


Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread David Choi
Oh well I guess its ok if a corporation does it but not someone wanting to
learn more about the field. I actually have written a crawler before as
well as the you know Inverted Index of how solr works but I just thought
its architecture was better suited for scaling.

On Thu, Jun 1, 2017 at 4:47 PM Dave <hastings.recurs...@gmail.com> wrote:

> And I mean that in the context of stealing content from sites that
> explicitly declare they don't want to be crawled. Robots.txt is to be
> followed.
>
> > On Jun 1, 2017, at 5:31 PM, David Choi <choi.davi...@gmail.com> wrote:
> >
> > Hello,
> >
> >   I was wondering if anyone could guide me on how to crawl the web and
> > ignore the robots.txt since I can not index some big sites. Or if someone
> > could point how to get around it. I read somewhere about a
> > protocol.plugin.check.robots
> > but that was for nutch.
> >
> > The way I index is
> > bin/post -c gettingstarted https://en.wikipedia.org/
> >
> > but I can't index the site I'm guessing because of the robots.txt.
> > I can index with
> > bin/post -c gettingstarted http://lucene.apache.org/solr
> >
> > which I am guessing allows it. I was also wondering how to find the name
> of
> > the crawler bin/post uses.
>


Solr Web Crawler - Robots.txt

2017-06-01 Thread David Choi
Hello,

   I was wondering if anyone could guide me on how to crawl the web and
ignore the robots.txt since I can not index some big sites. Or if someone
could point how to get around it. I read somewhere about a
protocol.plugin.check.robots
but that was for nutch.

The way I index is
bin/post -c gettingstarted https://en.wikipedia.org/

but I can't index the site I'm guessing because of the robots.txt.
I can index with
bin/post -c gettingstarted http://lucene.apache.org/solr

which I am guessing allows it. I was also wondering how to find the name of
the crawler bin/post uses.