subject:"Nutch running time"

RE: Nutch running time

2015-01-03 Thread Chaushu, Shani

Hi,
My nutch version is 1.9
Hadoop is on CDH 5.2, I think it's Hadoop 2.3
What changes did you make?

Thank,
Shani

-Original Message-
From: Meraj A. Khan [mailto:mera...@gmail.com] 
Sent: Saturday, January 03, 2015 22:36
To: user@nutch.apache.org
Subject: Re: Nutch running time

Shani,

What is your Nutch version and which Hadoop version are you using , I was able 
to get this running using Nutch 1.7 on Hadoop Yarn, for which I needed to make 
minor tweaks in the code.

On Fri, Jan 2, 2015 at 12:37 PM, Chaushu, Shani  wrote:
> I'm running nutch distributed, on 3 nodes...
> I thought there is more configuration that I missed..
>
> -Original Message-
> From: S.L [mailto:simpleliving...@gmail.com]
> Sent: Thursday, January 01, 2015 18:28
> To: user@nutch.apache.org
> Subject: Re: Nutch running time
>
> You need to run Nutch as a Map Reduce job/application on Hadoop , there is a 
> lot of info on the Wiki to make it run in distributed mode , but if you can 
> live with the psuedo-distributed /local mode for the 20K pages that you need 
> to fecth , it would save you lot of work.
>
> On Thu, Jan 1, 2015 at 8:32 AM, Chaushu, Shani 
> 
> wrote:
>
>> How can I configure number of map reduce? Which parameter is it? More 
>> map reduce will make it slower or faster?
>>
>> Thanks
>>
>> -Original Message-
>> From: Meraj A. Khan [mailto:mera...@gmail.com]
>> Sent: Thursday, January 01, 2015 15:17
>> To: user@nutch.apache.org
>> Subject: Re: Nutch running time
>>
>> It seems kind of slower for 20k links, how many map and reduce tasks 
>> ,have you configured for each one of the pahses in a Nutch crawl.
>> On Jan 1, 2015 6:00 AM, "Chaushu, Shani"  wrote:
>>
>> >
>> >
>> > Hi all,
>> >  I wanted to know how long nutch should run.
>> > I change the configurations, and ran distributed - one master node 
>> > and
>> > 3 slaves, and it for 20k links for about a day (depth 15).
>> > Is it normal? Or it should take less?
>> > This is my configurations:
>> >
>> >
>> > 
>> > db.ignore.external.links
>> > true
>> > If true, outlinks leading from a page 
>> > to external hosts
>> > will be ignored. This is an effective way 
>> > to limit the crawl to include
>> > only initially injected hosts, without 
>> > creating complex URLFilters.
>> > 
>> > 
>> >
>> > 
>> > db.max.outlinks.per.page
>> > 1000
>> > The maximum number of outlinks that 
>> > we'll process for a page.
>> > If this value is nonnegative (>=0), at most 
>> > db.max.outlinks.per.page outlinks
>> > will be processed for a page; otherwise, 
>> > all outlinks will be processed.
>> > 
>> > 
>> >
>> >
>> > 
>> > fetcher.threads.fetch
>> > 100
>> > The number of FetcherThreads the 
>> > fetcher should use.
>> > This is also determines the maximum number 
>> > of requests that are
>> > made at once (each FetcherThread handles 
>> > one connection). The total
>> > number of threads running in distributed 
>> > mode will be the number of
>> > fetcher threads * number of nodes as 
>> > fetcher has one map task per node.
>> > 
>> > 
>> >
>> >
>> > 
>> > fetcher.queue.depth.multiplier
>> > 150
>> > (EXPERT)The fetcher buffers the 
>> > incoming URLs into queues based on the [host|domain|IP]
>> > see param fetcher.queue.mode). The depth of 
>> > the queue is the number of threads times the value of this parameter.
>> > A large value requires more memory but can 
>> > improve the performance of the fetch when the order of the URLS in 
>> > the
>> fetch list
>> > is not optimal.
>> > 
>> > 
>> >
>> >
>> > 
>> > fetcher.threads.per.queue
>> >

Re: Nutch running time

2015-01-03 Thread Meraj A. Khan

Shani,

What is your Nutch version and which Hadoop version are you using , I
was able to get this running using Nutch 1.7 on Hadoop Yarn, for which
I needed to make minor tweaks in the code.

On Fri, Jan 2, 2015 at 12:37 PM, Chaushu, Shani  wrote:
> I'm running nutch distributed, on 3 nodes...
> I thought there is more configuration that I missed..
>
> -Original Message-
> From: S.L [mailto:simpleliving...@gmail.com]
> Sent: Thursday, January 01, 2015 18:28
> To: user@nutch.apache.org
> Subject: Re: Nutch running time
>
> You need to run Nutch as a Map Reduce job/application on Hadoop , there is a 
> lot of info on the Wiki to make it run in distributed mode , but if you can 
> live with the psuedo-distributed /local mode for the 20K pages that you need 
> to fecth , it would save you lot of work.
>
> On Thu, Jan 1, 2015 at 8:32 AM, Chaushu, Shani 
> wrote:
>
>> How can I configure number of map reduce? Which parameter is it? More
>> map reduce will make it slower or faster?
>>
>> Thanks
>>
>> -Original Message-
>> From: Meraj A. Khan [mailto:mera...@gmail.com]
>> Sent: Thursday, January 01, 2015 15:17
>> To: user@nutch.apache.org
>> Subject: Re: Nutch running time
>>
>> It seems kind of slower for 20k links, how many map and reduce tasks
>> ,have you configured for each one of the pahses in a Nutch crawl.
>> On Jan 1, 2015 6:00 AM, "Chaushu, Shani"  wrote:
>>
>> >
>> >
>> > Hi all,
>> >  I wanted to know how long nutch should run.
>> > I change the configurations, and ran distributed - one master node
>> > and
>> > 3 slaves, and it for 20k links for about a day (depth 15).
>> > Is it normal? Or it should take less?
>> > This is my configurations:
>> >
>> >
>> > 
>> > db.ignore.external.links
>> > true
>> > If true, outlinks leading from a page
>> > to external hosts
>> > will be ignored. This is an effective way to
>> > limit the crawl to include
>> > only initially injected hosts, without
>> > creating complex URLFilters.
>> > 
>> > 
>> >
>> > 
>> > db.max.outlinks.per.page
>> > 1000
>> > The maximum number of outlinks that
>> > we'll process for a page.
>> > If this value is nonnegative (>=0), at most
>> > db.max.outlinks.per.page outlinks
>> > will be processed for a page; otherwise, all
>> > outlinks will be processed.
>> > 
>> > 
>> >
>> >
>> > 
>> > fetcher.threads.fetch
>> > 100
>> > The number of FetcherThreads the
>> > fetcher should use.
>> > This is also determines the maximum number
>> > of requests that are
>> > made at once (each FetcherThread handles one
>> > connection). The total
>> > number of threads running in distributed
>> > mode will be the number of
>> > fetcher threads * number of nodes as fetcher
>> > has one map task per node.
>> > 
>> > 
>> >
>> >
>> > 
>> > fetcher.queue.depth.multiplier
>> > 150
>> > (EXPERT)The fetcher buffers the
>> > incoming URLs into queues based on the [host|domain|IP]
>> > see param fetcher.queue.mode). The depth of
>> > the queue is the number of threads times the value of this parameter.
>> > A large value requires more memory but can
>> > improve the performance of the fetch when the order of the URLS in
>> > the
>> fetch list
>> > is not optimal.
>> > 
>> > 
>> >
>> >
>> > 
>> > fetcher.threads.per.queue
>> > 10
>> >  This number is the maximum number of
>> > threads that
>> > should be allowed to access a queue at one time.
>> > Setting it to
>> > a value > 1 will cause the Crawl-Delay value
>> >

RE: Nutch running time

2015-01-02 Thread Chaushu, Shani

I'm running nutch distributed, on 3 nodes...
I thought there is more configuration that I missed..

-Original Message-
From: S.L [mailto:simpleliving...@gmail.com] 
Sent: Thursday, January 01, 2015 18:28
To: user@nutch.apache.org
Subject: Re: Nutch running time

You need to run Nutch as a Map Reduce job/application on Hadoop , there is a 
lot of info on the Wiki to make it run in distributed mode , but if you can 
live with the psuedo-distributed /local mode for the 20K pages that you need to 
fecth , it would save you lot of work.

On Thu, Jan 1, 2015 at 8:32 AM, Chaushu, Shani 
wrote:

> How can I configure number of map reduce? Which parameter is it? More 
> map reduce will make it slower or faster?
>
> Thanks
>
> -Original Message-
> From: Meraj A. Khan [mailto:mera...@gmail.com]
> Sent: Thursday, January 01, 2015 15:17
> To: user@nutch.apache.org
> Subject: Re: Nutch running time
>
> It seems kind of slower for 20k links, how many map and reduce tasks 
> ,have you configured for each one of the pahses in a Nutch crawl.
> On Jan 1, 2015 6:00 AM, "Chaushu, Shani"  wrote:
>
> >
> >
> > Hi all,
> >  I wanted to know how long nutch should run.
> > I change the configurations, and ran distributed - one master node 
> > and
> > 3 slaves, and it for 20k links for about a day (depth 15).
> > Is it normal? Or it should take less?
> > This is my configurations:
> >
> >
> > 
> > db.ignore.external.links
> > true
> > If true, outlinks leading from a page 
> > to external hosts
> > will be ignored. This is an effective way to 
> > limit the crawl to include
> > only initially injected hosts, without 
> > creating complex URLFilters.
> > 
> > 
> >
> > 
> > db.max.outlinks.per.page
> > 1000
> > The maximum number of outlinks that 
> > we'll process for a page.
> > If this value is nonnegative (>=0), at most 
> > db.max.outlinks.per.page outlinks
> > will be processed for a page; otherwise, all 
> > outlinks will be processed.
> > 
> > 
> >
> >
> > 
> > fetcher.threads.fetch
> > 100
> > The number of FetcherThreads the 
> > fetcher should use.
> > This is also determines the maximum number 
> > of requests that are
> > made at once (each FetcherThread handles one 
> > connection). The total
> > number of threads running in distributed 
> > mode will be the number of
> > fetcher threads * number of nodes as fetcher 
> > has one map task per node.
> > 
> > 
> >
> >
> > 
> > fetcher.queue.depth.multiplier
> > 150
> > (EXPERT)The fetcher buffers the 
> > incoming URLs into queues based on the [host|domain|IP]
> > see param fetcher.queue.mode). The depth of 
> > the queue is the number of threads times the value of this parameter.
> > A large value requires more memory but can 
> > improve the performance of the fetch when the order of the URLS in 
> > the
> fetch list
> > is not optimal.
> > 
> > 
> >
> >
> > 
> > fetcher.threads.per.queue
> > 10
> >  This number is the maximum number of 
> > threads that
> > should be allowed to access a queue at one time.
> > Setting it to
> > a value > 1 will cause the Crawl-Delay value 
> > from robots.txt to
> > be ignored and the value of 
> > fetcher.server.min.delay to be used
> > as a delay between successive requests to 
> > the same server instead
> > of fetcher.server.delay.
> > 
> > 
> >
> > 
> > fetcher.server.min.delay
> > 0.0
> > The minimum number of seconds the 
> > fetcher will delay between
> > successive requests to the same server. This 
> > value is applicable ONLY
> >

Re: Nutch running time

2015-01-01 Thread S.L

You need to run Nutch as a Map Reduce job/application on Hadoop , there is
a lot of info on the Wiki to make it run in distributed mode , but if you
can live with the psuedo-distributed /local mode for the 20K pages that you
need to fecth , it would save you lot of work.

On Thu, Jan 1, 2015 at 8:32 AM, Chaushu, Shani 
wrote:

> How can I configure number of map reduce? Which parameter is it? More map
> reduce will make it slower or faster?
>
> Thanks
>
> -Original Message-
> From: Meraj A. Khan [mailto:mera...@gmail.com]
> Sent: Thursday, January 01, 2015 15:17
> To: user@nutch.apache.org
> Subject: Re: Nutch running time
>
> It seems kind of slower for 20k links, how many map and reduce tasks ,have
> you configured for each one of the pahses in a Nutch crawl.
> On Jan 1, 2015 6:00 AM, "Chaushu, Shani"  wrote:
>
> >
> >
> > Hi all,
> >  I wanted to know how long nutch should run.
> > I change the configurations, and ran distributed - one master node and
> > 3 slaves, and it for 20k links for about a day (depth 15).
> > Is it normal? Or it should take less?
> > This is my configurations:
> >
> >
> > 
> > db.ignore.external.links
> > true
> > If true, outlinks leading from a page to
> > external hosts
> > will be ignored. This is an effective way to
> > limit the crawl to include
> > only initially injected hosts, without
> > creating complex URLFilters.
> > 
> > 
> >
> > 
> > db.max.outlinks.per.page
> > 1000
> > The maximum number of outlinks that we'll
> > process for a page.
> > If this value is nonnegative (>=0), at most
> > db.max.outlinks.per.page outlinks
> > will be processed for a page; otherwise, all
> > outlinks will be processed.
> > 
> > 
> >
> >
> > 
> > fetcher.threads.fetch
> > 100
> > The number of FetcherThreads the fetcher
> > should use.
> > This is also determines the maximum number of
> > requests that are
> > made at once (each FetcherThread handles one
> > connection). The total
> > number of threads running in distributed mode
> > will be the number of
> > fetcher threads * number of nodes as fetcher
> > has one map task per node.
> > 
> > 
> >
> >
> > 
> > fetcher.queue.depth.multiplier
> > 150
> > (EXPERT)The fetcher buffers the incoming
> > URLs into queues based on the [host|domain|IP]
> > see param fetcher.queue.mode). The depth of
> > the queue is the number of threads times the value of this parameter.
> > A large value requires more memory but can
> > improve the performance of the fetch when the order of the URLS in the
> fetch list
> > is not optimal.
> > 
> > 
> >
> >
> > 
> > fetcher.threads.per.queue
> > 10
> >  This number is the maximum number of
> > threads that
> > should be allowed to access a queue at one time.
> > Setting it to
> > a value > 1 will cause the Crawl-Delay value
> > from robots.txt to
> > be ignored and the value of
> > fetcher.server.min.delay to be used
> > as a delay between successive requests to the
> > same server instead
> > of fetcher.server.delay.
> > 
> > 
> >
> > 
> > fetcher.server.min.delay
> > 0.0
> > The minimum number of seconds the fetcher
> > will delay between
> > successive requests to the same server. This
> > value is applicable ONLY
> > if fetcher.threads.per.queue is greater than 1
> > (i.e. the host blocking
> > is turned off).
> > 
> > 
> >
> >
> > 
> > fetcher.max.crawl.delay
> > 5
> >

RE: Nutch running time

2015-01-01 Thread Chaushu, Shani

How can I configure number of map reduce? Which parameter is it? More map 
reduce will make it slower or faster?

Thanks

-Original Message-
From: Meraj A. Khan [mailto:mera...@gmail.com] 
Sent: Thursday, January 01, 2015 15:17
To: user@nutch.apache.org
Subject: Re: Nutch running time

It seems kind of slower for 20k links, how many map and reduce tasks ,have you 
configured for each one of the pahses in a Nutch crawl.
On Jan 1, 2015 6:00 AM, "Chaushu, Shani"  wrote:

>
>
> Hi all,
>  I wanted to know how long nutch should run.
> I change the configurations, and ran distributed - one master node and 
> 3 slaves, and it for 20k links for about a day (depth 15).
> Is it normal? Or it should take less?
> This is my configurations:
>
>
> 
> db.ignore.external.links
> true
> If true, outlinks leading from a page to 
> external hosts
> will be ignored. This is an effective way to 
> limit the crawl to include
> only initially injected hosts, without 
> creating complex URLFilters.
> 
> 
>
> 
> db.max.outlinks.per.page
> 1000
> The maximum number of outlinks that we'll 
> process for a page.
> If this value is nonnegative (>=0), at most 
> db.max.outlinks.per.page outlinks
> will be processed for a page; otherwise, all 
> outlinks will be processed.
> 
> 
>
>
> 
> fetcher.threads.fetch
> 100
> The number of FetcherThreads the fetcher 
> should use.
> This is also determines the maximum number of 
> requests that are
> made at once (each FetcherThread handles one 
> connection). The total
> number of threads running in distributed mode 
> will be the number of
> fetcher threads * number of nodes as fetcher 
> has one map task per node.
> 
> 
>
>
> 
> fetcher.queue.depth.multiplier
> 150
> (EXPERT)The fetcher buffers the incoming 
> URLs into queues based on the [host|domain|IP]
> see param fetcher.queue.mode). The depth of 
> the queue is the number of threads times the value of this parameter.
> A large value requires more memory but can 
> improve the performance of the fetch when the order of the URLS in the fetch 
> list
> is not optimal.
> 
> 
>
>
> 
> fetcher.threads.per.queue
> 10
>  This number is the maximum number of 
> threads that
> should be allowed to access a queue at one time.
> Setting it to
> a value > 1 will cause the Crawl-Delay value 
> from robots.txt to
> be ignored and the value of 
> fetcher.server.min.delay to be used
> as a delay between successive requests to the 
> same server instead
> of fetcher.server.delay.
> 
> 
>
> 
> fetcher.server.min.delay
> 0.0
> The minimum number of seconds the fetcher 
> will delay between
> successive requests to the same server. This 
> value is applicable ONLY
> if fetcher.threads.per.queue is greater than 1 
> (i.e. the host blocking
> is turned off).
> 
> 
>
>
> 
> fetcher.max.crawl.delay
> 5
> 
> If the Crawl-Delay in robots.txt is set to 
> greater than this value (in
> seconds) then the fetcher will skip this page, 
> generating an error report.
> If set to -1 the fetcher will never skip such 
> pages and will wait the
> amount of time retrieved from robots.txt 
> Crawl-Delay, however long that
> might be.
> 
> 
>
>
>
>
>
> -
> Intel Electronics Ltd.
>
> This e-mail and any attachments may contain confidential material for 
> the sole use of the intended recipient(s). Any review or distribution 
> by others is strictly prohibited. If you are not the intended 
> recipient

Re: Nutch running time

2015-01-01 Thread Meraj A. Khan

It seems kind of slower for 20k links, how many map and reduce tasks ,have
you configured for each one of the pahses in a Nutch crawl.
On Jan 1, 2015 6:00 AM, "Chaushu, Shani"  wrote:

>
>
> Hi all,
>  I wanted to know how long nutch should run.
> I change the configurations, and ran distributed - one master node and 3
> slaves, and it for 20k links for about a day (depth 15).
> Is it normal? Or it should take less?
> This is my configurations:
>
>
> 
> db.ignore.external.links
> true
> If true, outlinks leading from a page to
> external hosts
> will be ignored. This is an effective way to limit
> the crawl to include
> only initially injected hosts, without creating
> complex URLFilters.
> 
> 
>
> 
> db.max.outlinks.per.page
> 1000
> The maximum number of outlinks that we'll
> process for a page.
> If this value is nonnegative (>=0), at most
> db.max.outlinks.per.page outlinks
> will be processed for a page; otherwise, all
> outlinks will be processed.
> 
> 
>
>
> 
> fetcher.threads.fetch
> 100
> The number of FetcherThreads the fetcher
> should use.
> This is also determines the maximum number of
> requests that are
> made at once (each FetcherThread handles one
> connection). The total
> number of threads running in distributed mode will
> be the number of
> fetcher threads * number of nodes as fetcher has
> one map task per node.
> 
> 
>
>
> 
> fetcher.queue.depth.multiplier
> 150
> (EXPERT)The fetcher buffers the incoming URLs
> into queues based on the [host|domain|IP]
> see param fetcher.queue.mode). The depth of the
> queue is the number of threads times the value of this parameter.
> A large value requires more memory but can improve
> the performance of the fetch when the order of the URLS in the fetch list
> is not optimal.
> 
> 
>
>
> 
> fetcher.threads.per.queue
> 10
>  This number is the maximum number of threads
> that
> should be allowed to access a queue at one time.
> Setting it to
> a value > 1 will cause the Crawl-Delay value from
> robots.txt to
> be ignored and the value of
> fetcher.server.min.delay to be used
> as a delay between successive requests to the same
> server instead
> of fetcher.server.delay.
> 
> 
>
> 
> fetcher.server.min.delay
> 0.0
> The minimum number of seconds the fetcher
> will delay between
> successive requests to the same server. This value
> is applicable ONLY
> if fetcher.threads.per.queue is greater than 1
> (i.e. the host blocking
> is turned off).
> 
> 
>
>
> 
> fetcher.max.crawl.delay
> 5
> 
> If the Crawl-Delay in robots.txt is set to greater
> than this value (in
> seconds) then the fetcher will skip this page,
> generating an error report.
> If set to -1 the fetcher will never skip such
> pages and will wait the
> amount of time retrieved from robots.txt
> Crawl-Delay, however long that
> might be.
> 
> 
>
>
>
>
>
> -
> Intel Electronics Ltd.
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>

Nutch running time

2015-01-01 Thread Chaushu, Shani

Hi all,
 I wanted to know how long nutch should run.
I change the configurations, and ran distributed - one master node and 3 
slaves, and it for 20k links for about a day (depth 15).
Is it normal? Or it should take less?
This is my configurations:



db.ignore.external.links
true
If true, outlinks leading from a page to external 
hosts
will be ignored. This is an effective way to limit the 
crawl to include
only initially injected hosts, without creating complex 
URLFilters.




db.max.outlinks.per.page
1000
The maximum number of outlinks that we'll process 
for a page.
If this value is nonnegative (>=0), at most 
db.max.outlinks.per.page outlinks
will be processed for a page; otherwise, all outlinks 
will be processed.





fetcher.threads.fetch
100
The number of FetcherThreads the fetcher should 
use.
This is also determines the maximum number of requests 
that are
made at once (each FetcherThread handles one 
connection). The total
number of threads running in distributed mode will be 
the number of
fetcher threads * number of nodes as fetcher has one 
map task per node.





fetcher.queue.depth.multiplier
150
(EXPERT)The fetcher buffers the incoming URLs into 
queues based on the [host|domain|IP]
see param fetcher.queue.mode). The depth of the queue 
is the number of threads times the value of this parameter.
A large value requires more memory but can improve the 
performance of the fetch when the order of the URLS in the fetch list
is not optimal.





fetcher.threads.per.queue
10
 This number is the maximum number of threads that
should be allowed to access a queue at one time. 
Setting it to
a value > 1 will cause the Crawl-Delay value from 
robots.txt to
be ignored and the value of fetcher.server.min.delay to 
be used
as a delay between successive requests to the same 
server instead
of fetcher.server.delay.




fetcher.server.min.delay
0.0
The minimum number of seconds the fetcher will 
delay between
successive requests to the same server. This value is 
applicable ONLY
if fetcher.threads.per.queue is greater than 1 (i.e. 
the host blocking
is turned off).





fetcher.max.crawl.delay
5

If the Crawl-Delay in robots.txt is set to greater than 
this value (in
seconds) then the fetcher will skip this page, 
generating an error report.
If set to -1 the fetcher will never skip such pages and 
will wait the
amount of time retrieved from robots.txt Crawl-Delay, 
however long that
might be.







-
Intel Electronics Ltd.

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Nutch running time

2015-01-01 Thread Chaushu, Shani



Hi all,
 I wanted to know how long nutch should run.
I change the configurations, and ran distributed - one master node and 3 
slaves, and it for 20k links for about a day (depth 15).
Is it normal? Or it should take less?
This is my configurations:



db.ignore.external.links
true
If true, outlinks leading from a page to external 
hosts
will be ignored. This is an effective way to limit the 
crawl to include
only initially injected hosts, without creating complex 
URLFilters.




db.max.outlinks.per.page
1000
The maximum number of outlinks that we'll process 
for a page.
If this value is nonnegative (>=0), at most 
db.max.outlinks.per.page outlinks
will be processed for a page; otherwise, all outlinks 
will be processed.





fetcher.threads.fetch
100
The number of FetcherThreads the fetcher should 
use.
This is also determines the maximum number of requests 
that are
made at once (each FetcherThread handles one 
connection). The total
number of threads running in distributed mode will be 
the number of
fetcher threads * number of nodes as fetcher has one 
map task per node.





fetcher.queue.depth.multiplier
150
(EXPERT)The fetcher buffers the incoming URLs into 
queues based on the [host|domain|IP]
see param fetcher.queue.mode). The depth of the queue 
is the number of threads times the value of this parameter.
A large value requires more memory but can improve the 
performance of the fetch when the order of the URLS in the fetch list
is not optimal.





fetcher.threads.per.queue
10
 This number is the maximum number of threads that
should be allowed to access a queue at one time. 
Setting it to
a value > 1 will cause the Crawl-Delay value from 
robots.txt to
be ignored and the value of fetcher.server.min.delay to 
be used
as a delay between successive requests to the same 
server instead
of fetcher.server.delay.




fetcher.server.min.delay
0.0
The minimum number of seconds the fetcher will 
delay between
successive requests to the same server. This value is 
applicable ONLY
if fetcher.threads.per.queue is greater than 1 (i.e. 
the host blocking
is turned off).





fetcher.max.crawl.delay
5

If the Crawl-Delay in robots.txt is set to greater than 
this value (in
seconds) then the fetcher will skip this page, 
generating an error report.
If set to -1 the fetcher will never skip such pages and 
will wait the
amount of time retrieved from robots.txt Crawl-Delay, 
however long that
might be.







-
Intel Electronics Ltd.

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

RE: Nutch running time

Re: Nutch running time

RE: Nutch running time

Re: Nutch running time

RE: Nutch running time

Re: Nutch running time

Nutch running time

Nutch running time

8 matches

Site Navigation

Mail list logo

Footer information