Re: Crawling same domain URL's

Sebastian Nagel Tue, 11 May 2021 08:16:37 -0700

Hi Prateek,

alternatively, you could modify the URLPartitioner [1], so that during the 
"generate" step
the URLs of a specific host or domain are distributed over more partitions. One 
partition
is the fetch list of one fetcher map task.  At Common Crawl we partition by 
domain and made
the number of partitions configurable to assign more fetcher tasks to certain 
super-domains,
e.g. wordpress.com or blogspot.com, see [2].


Best,
Sebastian

[1] 
https://github.com/apache/nutch/blob/6c02da053d8ce65e0283a144ab59586e563608b8/src/java/org/apache/nutch/crawl/URLPartitioner.java#L75

[2]https://github.com/commoncrawl/nutch/blob/98a137910aa30dcb4fa1acd720fb4a4b7d9c520f/src/java/org/apache/nutch/crawl/URLPartitioner.java#L131(used by Generator2)



On 5/11/21 3:07 PM, Markus Jelsma wrote:

Hello Prateek,

You are right, it is limited by the number of CPU cores and how many
threads it can handle, but you can still process a million records per day
if you have a few cores. If you parse as a separate step, it can run even
faster.

Indeed, it won't work if you need to process 10 million recors of the same
host every day. If you want to use Hadoop for this, you can opt for a
custom YARN application [1]. We have done that too for some of our
distributed tools, it works very nice.

Regards,
Markus

[1]
https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html

Op di 11 mei 2021 om 14:54 schreef prateek <[email protected]>:

Hi Markus,

Depending upon the core of the machine, I can only increase the number of
threads upto a limit. After that performance degradation will come into the
picture.
So running a single mapper will still be a bottleneck in this case. I am
looking for options to distribute the same domain URLs across various
mappers. Not sure if that's even possible with Nutch or not.

Regards
Prateek

On Tue, May 11, 2021 at 11:58 AM Markus Jelsma <[email protected]

wrote:

Hello Prateet,

If you want to fetch stuff from the same host/domain as fast as you want,
increase the number of threads, and the number of threads per queue. Then
decrease all the fetch delays.

Regards,
Markus

Op di 11 mei 2021 om 12:48 schreef prateek <[email protected]>:

Hi Lewis,

As mentioned earlier, it does not matter how many mappers I assign to

fetch

tasks. Since all the URLs are of the same domain, everything will be
assigned to the same mapper and all other mappers will have no task to
execute. So I am looking for ways I can crawl the same domain URLs

quickly.


Regards
Prateek

On Mon, May 10, 2021 at 1:02 AM Lewis John McGibbney <

[email protected]


wrote:

Hi Prateek,
mapred.map.tasks     -->    mapreduce.job.maps
mapred.reduce.tasks  -->    mapreduce.job.reduces
You should be able to override in these in nutch-site.xml then

publish

to

your Hadoop cluster.
lewismc

On 2021/05/07 15:18:38, prateek <[email protected]> wrote:

Hi,

I am trying to crawl URLs belonging to the same domain (around

140k)

and

because of the fact that all the same domain URLs go to the same

mapper,

only one mapper is used for fetching. All others are just a waste

of

resources. These are the configurations I have tried till now but

it's

still very slow.

Attempt 1 -
         fetcher.threads.fetch : 10
         fetcher.server.delay : 1
         fetcher.threads.per.queue : 1,
         fetcher.server.min.delay : 0.0

Attempt 2 -
         fetcher.threads.fetch : 10
         fetcher.server.delay : 1
         fetcher.threads.per.queue : 3,
         fetcher.server.min.delay : 0.5

Is there a way to distribute the same domain URLs across all the
fetcher.threads.fetch? I understand that in this case crawl delay

cannot

be

reinforced across different mappers but for my use case it's ok to

crawl

aggressively. So any suggestions?

Regards
Prateek

Re: Crawling same domain URL's

Reply via email to