Re: Solr Web Crawler - Robots.txt

2020-03-01 Thread Jan Høydahl
bin/post is not a crawler, just a small java class that collects links from 
html pages using SolrCell. It respects very basic robots.txt but far from the 
full spec. This is just a local prototyping tool, not meant for production use.

Jan Høydahl

> 1. mar. 2020 kl. 09:27 skrev Mutuhprasannth :
> 
> Have you found out the name of the crawler which is used by Solr bin/post or
> how to ignore robots.txt in Solr post tool
> 
> 
> 
> 
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr Web Crawler - Robots.txt

2020-03-01 Thread Mutuhprasannth
Have you found out the name of the crawler which is used by Solr bin/post or
how to ignore robots.txt in Solr post tool




--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr Web Crawler - Robots.txt

2020-03-01 Thread Mutuhprasannth
Hi David Choi,

Have you found out the name of the crawler which is used by Solr bin/post?



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr Web Crawler - Robots.txt

2017-06-02 Thread Charlie Hull

On 02/06/2017 00:56, Doug Turnbull wrote:

Scrapy is fantastic and I use it scrape search results pages for clients to
take quality snapshots for relevance work


+1 for Scrapy; it was built by a team at Mydeco.com while we were 
building their search backend and has gone from strength to strength since.


Cheers

Charlie


Ignoring robots.txt sometimes legit comes up because a staging site might
be telling google not to crawl but don't care about a developer crawling
for internal purposes.

Doug
On Thu, Jun 1, 2017 at 6:34 PM Walter Underwood 
wrote:


Which was exactly what I suggested.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



On Jun 1, 2017, at 3:31 PM, David Choi  wrote:

In the mean time I have found a better solution at the moment is to test

on

a site that allows users to crawl their site.

On Thu, Jun 1, 2017 at 5:26 PM David Choi 

wrote:



I think you misunderstand the argument was about stealing content. Sorry
but I think you need to read what people write before making bold
statements.

On Thu, Jun 1, 2017 at 5:20 PM Walter Underwood 
wrote:


Let’s not get snarky right away, especially when you are wrong.

Corporations do not generally ignore robots.txt. I worked on a

commercial

web spider for ten years. Occasionally, our customers did need to

bypass

portions of robots.txt. That was usually because of a

poorly-maintained web

server, or because our spider could safely crawl some content that

would

cause problems for other crawlers.

If you want to learn crawling, don’t start by breaking the conventions

of

good web citizenship. Instead, start with sitemap.xml and crawl the
preferred portions of a site.

https://www.sitemaps.org/index.html <

https://www.sitemaps.org/index.html>


If the site blocks you, find a different site to learn on.

I like the looks of “Scrapy”, written in Python. I haven’t used it for
anything big, but I’d start with that for learning.

https://scrapy.org/ 

If you want to learn on a site with a lot of content, try ours,

chegg.com

But if your crawler gets out of hand, crawling too fast, we’ll block

it.

Any other site will do the same.

I would not base the crawler directly on Solr. A crawler needs a
dedicated database to record the URLs visited, errors, duplicates,

etc. The

output of the crawl goes to Solr. That is how we did it with Ultraseek
(before Solr existed).

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



On Jun 1, 2017, at 3:01 PM, David Choi 

wrote:


Oh well I guess its ok if a corporation does it but not someone

wanting

to

learn more about the field. I actually have written a crawler before

as

well as the you know Inverted Index of how solr works but I just

thought

its architecture was better suited for scaling.

On Thu, Jun 1, 2017 at 4:47 PM Dave 

wrote:



And I mean that in the context of stealing content from sites that
explicitly declare they don't want to be crawled. Robots.txt is to be
followed.


On Jun 1, 2017, at 5:31 PM, David Choi 

wrote:


Hello,

I was wondering if anyone could guide me on how to crawl the web and
ignore the robots.txt since I can not index some big sites. Or if

someone

could point how to get around it. I read somewhere about a
protocol.plugin.check.robots
but that was for nutch.

The way I index is
bin/post -c gettingstarted https://en.wikipedia.org/

but I can't index the site I'm guessing because of the robots.txt.
I can index with
bin/post -c gettingstarted http://lucene.apache.org/solr

which I am guessing allows it. I was also wondering how to find the

name

of

the crawler bin/post uses.











---
This email has been checked for viruses by AVG.
http://www.avg.com




--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread Walter Underwood
Nutch was built for that, but it is a pain to use. I’m still sad that I 
couldn’t get Mike Lynch to open source Ultraseek. So easy and much more 
powerful than Nutch.

Ignoring robots.txt is often a bad idea. You may get into a REST API or into a 
calendar that generates an unending number of valid, different pages. Or the 
combinatorial explosion of diffs between revisions of a wiki page. Those are 
really fun.

There are some web servers that put a session ID in the path, so you get an 
endless set of URLs for the exact same page. We called those a “black hole” 
because it sucked spiders in and never let them out.

The comments in the Wikipedia robots.txt are instructive. For example, they 
allow access to the documentation for the REST API (Allow: /api/rest_v1/?doc) 
then disallow the other paths in the API (Disallow: /api)

https://en.wikipedia.org/robots.txt 

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jun 1, 2017, at 4:58 PM, Mike Drob  wrote:
> 
> Isn't this exactly what Apache Nutch was built for?
> 
> On Thu, Jun 1, 2017 at 6:56 PM, David Choi  wrote:
> 
>> In any case after digging further I have found where it checks for
>> robots.txt. Thanks!
>> 
>> On Thu, Jun 1, 2017 at 5:34 PM Walter Underwood 
>> wrote:
>> 
>>> Which was exactly what I suggested.
>>> 
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>> 
>>> 
 On Jun 1, 2017, at 3:31 PM, David Choi  wrote:
 
 In the mean time I have found a better solution at the moment is to
>> test
>>> on
 a site that allows users to crawl their site.
 
 On Thu, Jun 1, 2017 at 5:26 PM David Choi 
>>> wrote:
 
> I think you misunderstand the argument was about stealing content.
>> Sorry
> but I think you need to read what people write before making bold
> statements.
> 
> On Thu, Jun 1, 2017 at 5:20 PM Walter Underwood <
>> wun...@wunderwood.org>
> wrote:
> 
>> Let’s not get snarky right away, especially when you are wrong.
>> 
>> Corporations do not generally ignore robots.txt. I worked on a
>>> commercial
>> web spider for ten years. Occasionally, our customers did need to
>>> bypass
>> portions of robots.txt. That was usually because of a
>>> poorly-maintained web
>> server, or because our spider could safely crawl some content that
>>> would
>> cause problems for other crawlers.
>> 
>> If you want to learn crawling, don’t start by breaking the
>> conventions
>>> of
>> good web citizenship. Instead, start with sitemap.xml and crawl the
>> preferred portions of a site.
>> 
>> https://www.sitemaps.org/index.html <
>>> https://www.sitemaps.org/index.html>
>> 
>> If the site blocks you, find a different site to learn on.
>> 
>> I like the looks of “Scrapy”, written in Python. I haven’t used it
>> for
>> anything big, but I’d start with that for learning.
>> 
>> https://scrapy.org/ 
>> 
>> If you want to learn on a site with a lot of content, try ours,
>>> chegg.com
>> But if your crawler gets out of hand, crawling too fast, we’ll block
>>> it.
>> Any other site will do the same.
>> 
>> I would not base the crawler directly on Solr. A crawler needs a
>> dedicated database to record the URLs visited, errors, duplicates,
>>> etc. The
>> output of the crawl goes to Solr. That is how we did it with
>> Ultraseek
>> (before Solr existed).
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Jun 1, 2017, at 3:01 PM, David Choi 
>>> wrote:
>>> 
>>> Oh well I guess its ok if a corporation does it but not someone
>>> wanting
>> to
>>> learn more about the field. I actually have written a crawler before
>>> as
>>> well as the you know Inverted Index of how solr works but I just
>>> thought
>>> its architecture was better suited for scaling.
>>> 
>>> On Thu, Jun 1, 2017 at 4:47 PM Dave 
>> wrote:
>>> 
 And I mean that in the context of stealing content from sites that
 explicitly declare they don't want to be crawled. Robots.txt is to
>> be
 followed.
 
> On Jun 1, 2017, at 5:31 PM, David Choi 
>> wrote:
> 
> Hello,
> 
> I was wondering if anyone could guide me on how to crawl the web
>> and
> ignore the robots.txt since I can not index some big sites. Or if
>> someone
> could point how to get around it. I read somewhere about a
> protocol.plugin.check.robots
> but that was for nutch.
> 

Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread Mike Drob
Isn't this exactly what Apache Nutch was built for?

On Thu, Jun 1, 2017 at 6:56 PM, David Choi  wrote:

> In any case after digging further I have found where it checks for
> robots.txt. Thanks!
>
> On Thu, Jun 1, 2017 at 5:34 PM Walter Underwood 
> wrote:
>
> > Which was exactly what I suggested.
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> >
> > > On Jun 1, 2017, at 3:31 PM, David Choi  wrote:
> > >
> > > In the mean time I have found a better solution at the moment is to
> test
> > on
> > > a site that allows users to crawl their site.
> > >
> > > On Thu, Jun 1, 2017 at 5:26 PM David Choi 
> > wrote:
> > >
> > >> I think you misunderstand the argument was about stealing content.
> Sorry
> > >> but I think you need to read what people write before making bold
> > >> statements.
> > >>
> > >> On Thu, Jun 1, 2017 at 5:20 PM Walter Underwood <
> wun...@wunderwood.org>
> > >> wrote:
> > >>
> > >>> Let’s not get snarky right away, especially when you are wrong.
> > >>>
> > >>> Corporations do not generally ignore robots.txt. I worked on a
> > commercial
> > >>> web spider for ten years. Occasionally, our customers did need to
> > bypass
> > >>> portions of robots.txt. That was usually because of a
> > poorly-maintained web
> > >>> server, or because our spider could safely crawl some content that
> > would
> > >>> cause problems for other crawlers.
> > >>>
> > >>> If you want to learn crawling, don’t start by breaking the
> conventions
> > of
> > >>> good web citizenship. Instead, start with sitemap.xml and crawl the
> > >>> preferred portions of a site.
> > >>>
> > >>> https://www.sitemaps.org/index.html <
> > https://www.sitemaps.org/index.html>
> > >>>
> > >>> If the site blocks you, find a different site to learn on.
> > >>>
> > >>> I like the looks of “Scrapy”, written in Python. I haven’t used it
> for
> > >>> anything big, but I’d start with that for learning.
> > >>>
> > >>> https://scrapy.org/ 
> > >>>
> > >>> If you want to learn on a site with a lot of content, try ours,
> > chegg.com
> > >>> But if your crawler gets out of hand, crawling too fast, we’ll block
> > it.
> > >>> Any other site will do the same.
> > >>>
> > >>> I would not base the crawler directly on Solr. A crawler needs a
> > >>> dedicated database to record the URLs visited, errors, duplicates,
> > etc. The
> > >>> output of the crawl goes to Solr. That is how we did it with
> Ultraseek
> > >>> (before Solr existed).
> > >>>
> > >>> wunder
> > >>> Walter Underwood
> > >>> wun...@wunderwood.org
> > >>> http://observer.wunderwood.org/  (my blog)
> > >>>
> > >>>
> >  On Jun 1, 2017, at 3:01 PM, David Choi 
> > wrote:
> > 
> >  Oh well I guess its ok if a corporation does it but not someone
> > wanting
> > >>> to
> >  learn more about the field. I actually have written a crawler before
> > as
> >  well as the you know Inverted Index of how solr works but I just
> > thought
> >  its architecture was better suited for scaling.
> > 
> >  On Thu, Jun 1, 2017 at 4:47 PM Dave 
> > >>> wrote:
> > 
> > > And I mean that in the context of stealing content from sites that
> > > explicitly declare they don't want to be crawled. Robots.txt is to
> be
> > > followed.
> > >
> > >> On Jun 1, 2017, at 5:31 PM, David Choi 
> > >>> wrote:
> > >>
> > >> Hello,
> > >>
> > >> I was wondering if anyone could guide me on how to crawl the web
> and
> > >> ignore the robots.txt since I can not index some big sites. Or if
> > >>> someone
> > >> could point how to get around it. I read somewhere about a
> > >> protocol.plugin.check.robots
> > >> but that was for nutch.
> > >>
> > >> The way I index is
> > >> bin/post -c gettingstarted https://en.wikipedia.org/
> > >>
> > >> but I can't index the site I'm guessing because of the robots.txt.
> > >> I can index with
> > >> bin/post -c gettingstarted http://lucene.apache.org/solr
> > >>
> > >> which I am guessing allows it. I was also wondering how to find
> the
> > >>> name
> > > of
> > >> the crawler bin/post uses.
> > >
> > >>>
> > >>>
> >
> >
>


Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread David Choi
In any case after digging further I have found where it checks for
robots.txt. Thanks!

On Thu, Jun 1, 2017 at 5:34 PM Walter Underwood 
wrote:

> Which was exactly what I suggested.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Jun 1, 2017, at 3:31 PM, David Choi  wrote:
> >
> > In the mean time I have found a better solution at the moment is to test
> on
> > a site that allows users to crawl their site.
> >
> > On Thu, Jun 1, 2017 at 5:26 PM David Choi 
> wrote:
> >
> >> I think you misunderstand the argument was about stealing content. Sorry
> >> but I think you need to read what people write before making bold
> >> statements.
> >>
> >> On Thu, Jun 1, 2017 at 5:20 PM Walter Underwood 
> >> wrote:
> >>
> >>> Let’s not get snarky right away, especially when you are wrong.
> >>>
> >>> Corporations do not generally ignore robots.txt. I worked on a
> commercial
> >>> web spider for ten years. Occasionally, our customers did need to
> bypass
> >>> portions of robots.txt. That was usually because of a
> poorly-maintained web
> >>> server, or because our spider could safely crawl some content that
> would
> >>> cause problems for other crawlers.
> >>>
> >>> If you want to learn crawling, don’t start by breaking the conventions
> of
> >>> good web citizenship. Instead, start with sitemap.xml and crawl the
> >>> preferred portions of a site.
> >>>
> >>> https://www.sitemaps.org/index.html <
> https://www.sitemaps.org/index.html>
> >>>
> >>> If the site blocks you, find a different site to learn on.
> >>>
> >>> I like the looks of “Scrapy”, written in Python. I haven’t used it for
> >>> anything big, but I’d start with that for learning.
> >>>
> >>> https://scrapy.org/ 
> >>>
> >>> If you want to learn on a site with a lot of content, try ours,
> chegg.com
> >>> But if your crawler gets out of hand, crawling too fast, we’ll block
> it.
> >>> Any other site will do the same.
> >>>
> >>> I would not base the crawler directly on Solr. A crawler needs a
> >>> dedicated database to record the URLs visited, errors, duplicates,
> etc. The
> >>> output of the crawl goes to Solr. That is how we did it with Ultraseek
> >>> (before Solr existed).
> >>>
> >>> wunder
> >>> Walter Underwood
> >>> wun...@wunderwood.org
> >>> http://observer.wunderwood.org/  (my blog)
> >>>
> >>>
>  On Jun 1, 2017, at 3:01 PM, David Choi 
> wrote:
> 
>  Oh well I guess its ok if a corporation does it but not someone
> wanting
> >>> to
>  learn more about the field. I actually have written a crawler before
> as
>  well as the you know Inverted Index of how solr works but I just
> thought
>  its architecture was better suited for scaling.
> 
>  On Thu, Jun 1, 2017 at 4:47 PM Dave 
> >>> wrote:
> 
> > And I mean that in the context of stealing content from sites that
> > explicitly declare they don't want to be crawled. Robots.txt is to be
> > followed.
> >
> >> On Jun 1, 2017, at 5:31 PM, David Choi 
> >>> wrote:
> >>
> >> Hello,
> >>
> >> I was wondering if anyone could guide me on how to crawl the web and
> >> ignore the robots.txt since I can not index some big sites. Or if
> >>> someone
> >> could point how to get around it. I read somewhere about a
> >> protocol.plugin.check.robots
> >> but that was for nutch.
> >>
> >> The way I index is
> >> bin/post -c gettingstarted https://en.wikipedia.org/
> >>
> >> but I can't index the site I'm guessing because of the robots.txt.
> >> I can index with
> >> bin/post -c gettingstarted http://lucene.apache.org/solr
> >>
> >> which I am guessing allows it. I was also wondering how to find the
> >>> name
> > of
> >> the crawler bin/post uses.
> >
> >>>
> >>>
>
>


Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread Doug Turnbull
Scrapy is fantastic and I use it scrape search results pages for clients to
take quality snapshots for relevance work

Ignoring robots.txt sometimes legit comes up because a staging site might
be telling google not to crawl but don't care about a developer crawling
for internal purposes.

Doug
On Thu, Jun 1, 2017 at 6:34 PM Walter Underwood 
wrote:

> Which was exactly what I suggested.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Jun 1, 2017, at 3:31 PM, David Choi  wrote:
> >
> > In the mean time I have found a better solution at the moment is to test
> on
> > a site that allows users to crawl their site.
> >
> > On Thu, Jun 1, 2017 at 5:26 PM David Choi 
> wrote:
> >
> >> I think you misunderstand the argument was about stealing content. Sorry
> >> but I think you need to read what people write before making bold
> >> statements.
> >>
> >> On Thu, Jun 1, 2017 at 5:20 PM Walter Underwood 
> >> wrote:
> >>
> >>> Let’s not get snarky right away, especially when you are wrong.
> >>>
> >>> Corporations do not generally ignore robots.txt. I worked on a
> commercial
> >>> web spider for ten years. Occasionally, our customers did need to
> bypass
> >>> portions of robots.txt. That was usually because of a
> poorly-maintained web
> >>> server, or because our spider could safely crawl some content that
> would
> >>> cause problems for other crawlers.
> >>>
> >>> If you want to learn crawling, don’t start by breaking the conventions
> of
> >>> good web citizenship. Instead, start with sitemap.xml and crawl the
> >>> preferred portions of a site.
> >>>
> >>> https://www.sitemaps.org/index.html <
> https://www.sitemaps.org/index.html>
> >>>
> >>> If the site blocks you, find a different site to learn on.
> >>>
> >>> I like the looks of “Scrapy”, written in Python. I haven’t used it for
> >>> anything big, but I’d start with that for learning.
> >>>
> >>> https://scrapy.org/ 
> >>>
> >>> If you want to learn on a site with a lot of content, try ours,
> chegg.com
> >>> But if your crawler gets out of hand, crawling too fast, we’ll block
> it.
> >>> Any other site will do the same.
> >>>
> >>> I would not base the crawler directly on Solr. A crawler needs a
> >>> dedicated database to record the URLs visited, errors, duplicates,
> etc. The
> >>> output of the crawl goes to Solr. That is how we did it with Ultraseek
> >>> (before Solr existed).
> >>>
> >>> wunder
> >>> Walter Underwood
> >>> wun...@wunderwood.org
> >>> http://observer.wunderwood.org/  (my blog)
> >>>
> >>>
>  On Jun 1, 2017, at 3:01 PM, David Choi 
> wrote:
> 
>  Oh well I guess its ok if a corporation does it but not someone
> wanting
> >>> to
>  learn more about the field. I actually have written a crawler before
> as
>  well as the you know Inverted Index of how solr works but I just
> thought
>  its architecture was better suited for scaling.
> 
>  On Thu, Jun 1, 2017 at 4:47 PM Dave 
> >>> wrote:
> 
> > And I mean that in the context of stealing content from sites that
> > explicitly declare they don't want to be crawled. Robots.txt is to be
> > followed.
> >
> >> On Jun 1, 2017, at 5:31 PM, David Choi 
> >>> wrote:
> >>
> >> Hello,
> >>
> >> I was wondering if anyone could guide me on how to crawl the web and
> >> ignore the robots.txt since I can not index some big sites. Or if
> >>> someone
> >> could point how to get around it. I read somewhere about a
> >> protocol.plugin.check.robots
> >> but that was for nutch.
> >>
> >> The way I index is
> >> bin/post -c gettingstarted https://en.wikipedia.org/
> >>
> >> but I can't index the site I'm guessing because of the robots.txt.
> >> I can index with
> >> bin/post -c gettingstarted http://lucene.apache.org/solr
> >>
> >> which I am guessing allows it. I was also wondering how to find the
> >>> name
> > of
> >> the crawler bin/post uses.
> >
> >>>
> >>>
>
>


Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread Walter Underwood
Which was exactly what I suggested.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jun 1, 2017, at 3:31 PM, David Choi  wrote:
> 
> In the mean time I have found a better solution at the moment is to test on
> a site that allows users to crawl their site.
> 
> On Thu, Jun 1, 2017 at 5:26 PM David Choi  wrote:
> 
>> I think you misunderstand the argument was about stealing content. Sorry
>> but I think you need to read what people write before making bold
>> statements.
>> 
>> On Thu, Jun 1, 2017 at 5:20 PM Walter Underwood 
>> wrote:
>> 
>>> Let’s not get snarky right away, especially when you are wrong.
>>> 
>>> Corporations do not generally ignore robots.txt. I worked on a commercial
>>> web spider for ten years. Occasionally, our customers did need to bypass
>>> portions of robots.txt. That was usually because of a poorly-maintained web
>>> server, or because our spider could safely crawl some content that would
>>> cause problems for other crawlers.
>>> 
>>> If you want to learn crawling, don’t start by breaking the conventions of
>>> good web citizenship. Instead, start with sitemap.xml and crawl the
>>> preferred portions of a site.
>>> 
>>> https://www.sitemaps.org/index.html 
>>> 
>>> If the site blocks you, find a different site to learn on.
>>> 
>>> I like the looks of “Scrapy”, written in Python. I haven’t used it for
>>> anything big, but I’d start with that for learning.
>>> 
>>> https://scrapy.org/ 
>>> 
>>> If you want to learn on a site with a lot of content, try ours, chegg.com
>>> But if your crawler gets out of hand, crawling too fast, we’ll block it.
>>> Any other site will do the same.
>>> 
>>> I would not base the crawler directly on Solr. A crawler needs a
>>> dedicated database to record the URLs visited, errors, duplicates, etc. The
>>> output of the crawl goes to Solr. That is how we did it with Ultraseek
>>> (before Solr existed).
>>> 
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>> 
>>> 
 On Jun 1, 2017, at 3:01 PM, David Choi  wrote:
 
 Oh well I guess its ok if a corporation does it but not someone wanting
>>> to
 learn more about the field. I actually have written a crawler before as
 well as the you know Inverted Index of how solr works but I just thought
 its architecture was better suited for scaling.
 
 On Thu, Jun 1, 2017 at 4:47 PM Dave 
>>> wrote:
 
> And I mean that in the context of stealing content from sites that
> explicitly declare they don't want to be crawled. Robots.txt is to be
> followed.
> 
>> On Jun 1, 2017, at 5:31 PM, David Choi 
>>> wrote:
>> 
>> Hello,
>> 
>> I was wondering if anyone could guide me on how to crawl the web and
>> ignore the robots.txt since I can not index some big sites. Or if
>>> someone
>> could point how to get around it. I read somewhere about a
>> protocol.plugin.check.robots
>> but that was for nutch.
>> 
>> The way I index is
>> bin/post -c gettingstarted https://en.wikipedia.org/
>> 
>> but I can't index the site I'm guessing because of the robots.txt.
>> I can index with
>> bin/post -c gettingstarted http://lucene.apache.org/solr
>> 
>> which I am guessing allows it. I was also wondering how to find the
>>> name
> of
>> the crawler bin/post uses.
> 
>>> 
>>> 



Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread David Choi
In the mean time I have found a better solution at the moment is to test on
a site that allows users to crawl their site.

On Thu, Jun 1, 2017 at 5:26 PM David Choi  wrote:

> I think you misunderstand the argument was about stealing content. Sorry
> but I think you need to read what people write before making bold
> statements.
>
> On Thu, Jun 1, 2017 at 5:20 PM Walter Underwood 
> wrote:
>
>> Let’s not get snarky right away, especially when you are wrong.
>>
>> Corporations do not generally ignore robots.txt. I worked on a commercial
>> web spider for ten years. Occasionally, our customers did need to bypass
>> portions of robots.txt. That was usually because of a poorly-maintained web
>> server, or because our spider could safely crawl some content that would
>> cause problems for other crawlers.
>>
>> If you want to learn crawling, don’t start by breaking the conventions of
>> good web citizenship. Instead, start with sitemap.xml and crawl the
>> preferred portions of a site.
>>
>> https://www.sitemaps.org/index.html 
>>
>> If the site blocks you, find a different site to learn on.
>>
>> I like the looks of “Scrapy”, written in Python. I haven’t used it for
>> anything big, but I’d start with that for learning.
>>
>> https://scrapy.org/ 
>>
>> If you want to learn on a site with a lot of content, try ours, chegg.com
>> But if your crawler gets out of hand, crawling too fast, we’ll block it.
>> Any other site will do the same.
>>
>> I would not base the crawler directly on Solr. A crawler needs a
>> dedicated database to record the URLs visited, errors, duplicates, etc. The
>> output of the crawl goes to Solr. That is how we did it with Ultraseek
>> (before Solr existed).
>>
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>>
>>
>> > On Jun 1, 2017, at 3:01 PM, David Choi  wrote:
>> >
>> > Oh well I guess its ok if a corporation does it but not someone wanting
>> to
>> > learn more about the field. I actually have written a crawler before as
>> > well as the you know Inverted Index of how solr works but I just thought
>> > its architecture was better suited for scaling.
>> >
>> > On Thu, Jun 1, 2017 at 4:47 PM Dave 
>> wrote:
>> >
>> >> And I mean that in the context of stealing content from sites that
>> >> explicitly declare they don't want to be crawled. Robots.txt is to be
>> >> followed.
>> >>
>> >>> On Jun 1, 2017, at 5:31 PM, David Choi 
>> wrote:
>> >>>
>> >>> Hello,
>> >>>
>> >>>  I was wondering if anyone could guide me on how to crawl the web and
>> >>> ignore the robots.txt since I can not index some big sites. Or if
>> someone
>> >>> could point how to get around it. I read somewhere about a
>> >>> protocol.plugin.check.robots
>> >>> but that was for nutch.
>> >>>
>> >>> The way I index is
>> >>> bin/post -c gettingstarted https://en.wikipedia.org/
>> >>>
>> >>> but I can't index the site I'm guessing because of the robots.txt.
>> >>> I can index with
>> >>> bin/post -c gettingstarted http://lucene.apache.org/solr
>> >>>
>> >>> which I am guessing allows it. I was also wondering how to find the
>> name
>> >> of
>> >>> the crawler bin/post uses.
>> >>
>>
>>


Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread David Choi
I think you misunderstand the argument was about stealing content. Sorry
but I think you need to read what people write before making bold
statements.

On Thu, Jun 1, 2017 at 5:20 PM Walter Underwood 
wrote:

> Let’s not get snarky right away, especially when you are wrong.
>
> Corporations do not generally ignore robots.txt. I worked on a commercial
> web spider for ten years. Occasionally, our customers did need to bypass
> portions of robots.txt. That was usually because of a poorly-maintained web
> server, or because our spider could safely crawl some content that would
> cause problems for other crawlers.
>
> If you want to learn crawling, don’t start by breaking the conventions of
> good web citizenship. Instead, start with sitemap.xml and crawl the
> preferred portions of a site.
>
> https://www.sitemaps.org/index.html 
>
> If the site blocks you, find a different site to learn on.
>
> I like the looks of “Scrapy”, written in Python. I haven’t used it for
> anything big, but I’d start with that for learning.
>
> https://scrapy.org/ 
>
> If you want to learn on a site with a lot of content, try ours, chegg.com
> But if your crawler gets out of hand, crawling too fast, we’ll block it.
> Any other site will do the same.
>
> I would not base the crawler directly on Solr. A crawler needs a dedicated
> database to record the URLs visited, errors, duplicates, etc. The output of
> the crawl goes to Solr. That is how we did it with Ultraseek (before Solr
> existed).
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Jun 1, 2017, at 3:01 PM, David Choi  wrote:
> >
> > Oh well I guess its ok if a corporation does it but not someone wanting
> to
> > learn more about the field. I actually have written a crawler before as
> > well as the you know Inverted Index of how solr works but I just thought
> > its architecture was better suited for scaling.
> >
> > On Thu, Jun 1, 2017 at 4:47 PM Dave 
> wrote:
> >
> >> And I mean that in the context of stealing content from sites that
> >> explicitly declare they don't want to be crawled. Robots.txt is to be
> >> followed.
> >>
> >>> On Jun 1, 2017, at 5:31 PM, David Choi  wrote:
> >>>
> >>> Hello,
> >>>
> >>>  I was wondering if anyone could guide me on how to crawl the web and
> >>> ignore the robots.txt since I can not index some big sites. Or if
> someone
> >>> could point how to get around it. I read somewhere about a
> >>> protocol.plugin.check.robots
> >>> but that was for nutch.
> >>>
> >>> The way I index is
> >>> bin/post -c gettingstarted https://en.wikipedia.org/
> >>>
> >>> but I can't index the site I'm guessing because of the robots.txt.
> >>> I can index with
> >>> bin/post -c gettingstarted http://lucene.apache.org/solr
> >>>
> >>> which I am guessing allows it. I was also wondering how to find the
> name
> >> of
> >>> the crawler bin/post uses.
> >>
>
>


Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread Walter Underwood
Let’s not get snarky right away, especially when you are wrong.

Corporations do not generally ignore robots.txt. I worked on a commercial web 
spider for ten years. Occasionally, our customers did need to bypass portions 
of robots.txt. That was usually because of a poorly-maintained web server, or 
because our spider could safely crawl some content that would cause problems 
for other crawlers.

If you want to learn crawling, don’t start by breaking the conventions of good 
web citizenship. Instead, start with sitemap.xml and crawl the preferred 
portions of a site.

https://www.sitemaps.org/index.html 

If the site blocks you, find a different site to learn on.

I like the looks of “Scrapy”, written in Python. I haven’t used it for anything 
big, but I’d start with that for learning.

https://scrapy.org/ 

If you want to learn on a site with a lot of content, try ours, chegg.com But 
if your crawler gets out of hand, crawling too fast, we’ll block it. Any other 
site will do the same.

I would not base the crawler directly on Solr. A crawler needs a dedicated 
database to record the URLs visited, errors, duplicates, etc. The output of the 
crawl goes to Solr. That is how we did it with Ultraseek (before Solr existed).

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jun 1, 2017, at 3:01 PM, David Choi  wrote:
> 
> Oh well I guess its ok if a corporation does it but not someone wanting to
> learn more about the field. I actually have written a crawler before as
> well as the you know Inverted Index of how solr works but I just thought
> its architecture was better suited for scaling.
> 
> On Thu, Jun 1, 2017 at 4:47 PM Dave  wrote:
> 
>> And I mean that in the context of stealing content from sites that
>> explicitly declare they don't want to be crawled. Robots.txt is to be
>> followed.
>> 
>>> On Jun 1, 2017, at 5:31 PM, David Choi  wrote:
>>> 
>>> Hello,
>>> 
>>>  I was wondering if anyone could guide me on how to crawl the web and
>>> ignore the robots.txt since I can not index some big sites. Or if someone
>>> could point how to get around it. I read somewhere about a
>>> protocol.plugin.check.robots
>>> but that was for nutch.
>>> 
>>> The way I index is
>>> bin/post -c gettingstarted https://en.wikipedia.org/
>>> 
>>> but I can't index the site I'm guessing because of the robots.txt.
>>> I can index with
>>> bin/post -c gettingstarted http://lucene.apache.org/solr
>>> 
>>> which I am guessing allows it. I was also wondering how to find the name
>> of
>>> the crawler bin/post uses.
>> 



Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread David Choi
Oh well I guess its ok if a corporation does it but not someone wanting to
learn more about the field. I actually have written a crawler before as
well as the you know Inverted Index of how solr works but I just thought
its architecture was better suited for scaling.

On Thu, Jun 1, 2017 at 4:47 PM Dave  wrote:

> And I mean that in the context of stealing content from sites that
> explicitly declare they don't want to be crawled. Robots.txt is to be
> followed.
>
> > On Jun 1, 2017, at 5:31 PM, David Choi  wrote:
> >
> > Hello,
> >
> >   I was wondering if anyone could guide me on how to crawl the web and
> > ignore the robots.txt since I can not index some big sites. Or if someone
> > could point how to get around it. I read somewhere about a
> > protocol.plugin.check.robots
> > but that was for nutch.
> >
> > The way I index is
> > bin/post -c gettingstarted https://en.wikipedia.org/
> >
> > but I can't index the site I'm guessing because of the robots.txt.
> > I can index with
> > bin/post -c gettingstarted http://lucene.apache.org/solr
> >
> > which I am guessing allows it. I was also wondering how to find the name
> of
> > the crawler bin/post uses.
>


Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread Dave
And I mean that in the context of stealing content from sites that explicitly 
declare they don't want to be crawled. Robots.txt is to be followed. 

> On Jun 1, 2017, at 5:31 PM, David Choi  wrote:
> 
> Hello,
> 
>   I was wondering if anyone could guide me on how to crawl the web and
> ignore the robots.txt since I can not index some big sites. Or if someone
> could point how to get around it. I read somewhere about a
> protocol.plugin.check.robots
> but that was for nutch.
> 
> The way I index is
> bin/post -c gettingstarted https://en.wikipedia.org/
> 
> but I can't index the site I'm guessing because of the robots.txt.
> I can index with
> bin/post -c gettingstarted http://lucene.apache.org/solr
> 
> which I am guessing allows it. I was also wondering how to find the name of
> the crawler bin/post uses.


Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread Vivek Pathak
I can help.  We can chat in some freenode chatroom in an hour or so.  
Let me know where you hang out.


Thanks

Vivek


On 6/1/17 5:45 PM, Dave wrote:

If you are not capable of even writing your own indexing code, let alone 
crawler, I would prefer that you just stop now.  No one is going to help you 
with this request, at least I'd hope not.


On Jun 1, 2017, at 5:31 PM, David Choi  wrote:

Hello,

   I was wondering if anyone could guide me on how to crawl the web and
ignore the robots.txt since I can not index some big sites. Or if someone
could point how to get around it. I read somewhere about a
protocol.plugin.check.robots
but that was for nutch.

The way I index is
bin/post -c gettingstarted https://en.wikipedia.org/

but I can't index the site I'm guessing because of the robots.txt.
I can index with
bin/post -c gettingstarted http://lucene.apache.org/solr

which I am guessing allows it. I was also wondering how to find the name of
the crawler bin/post uses.




Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread Dave
If you are not capable of even writing your own indexing code, let alone 
crawler, I would prefer that you just stop now.  No one is going to help you 
with this request, at least I'd hope not. 

> On Jun 1, 2017, at 5:31 PM, David Choi  wrote:
> 
> Hello,
> 
>   I was wondering if anyone could guide me on how to crawl the web and
> ignore the robots.txt since I can not index some big sites. Or if someone
> could point how to get around it. I read somewhere about a
> protocol.plugin.check.robots
> but that was for nutch.
> 
> The way I index is
> bin/post -c gettingstarted https://en.wikipedia.org/
> 
> but I can't index the site I'm guessing because of the robots.txt.
> I can index with
> bin/post -c gettingstarted http://lucene.apache.org/solr
> 
> which I am guessing allows it. I was also wondering how to find the name of
> the crawler bin/post uses.


Solr Web Crawler - Robots.txt

2017-06-01 Thread David Choi
Hello,

   I was wondering if anyone could guide me on how to crawl the web and
ignore the robots.txt since I can not index some big sites. Or if someone
could point how to get around it. I read somewhere about a
protocol.plugin.check.robots
but that was for nutch.

The way I index is
bin/post -c gettingstarted https://en.wikipedia.org/

but I can't index the site I'm guessing because of the robots.txt.
I can index with
bin/post -c gettingstarted http://lucene.apache.org/solr

which I am guessing allows it. I was also wondering how to find the name of
the crawler bin/post uses.