Re: Crawling same domain URL's

prateek Wed, 12 May 2021 05:28:27 -0700

Thanks Sebastian. This makes sense. I will override URLPartitioner
accordingly.


Regards
Prateek

On Tue, May 11, 2021 at 4:16 PM Sebastian Nagel
<[email protected]> wrote:

> Hi Prateek,
>
> alternatively, you could modify the URLPartitioner [1], so that during the
> "generate" step
> the URLs of a specific host or domain are distributed over more
> partitions. One partition
> is the fetch list of one fetcher map task.  At Common Crawl we partition
> by domain and made
> the number of partitions configurable to assign more fetcher tasks to
> certain super-domains,
> e.g. wordpress.com or blogspot.com, see [2].
>
> Best,
> Sebastian
>
> [1]
> https://github.com/apache/nutch/blob/6c02da053d8ce65e0283a144ab59586e563608b8/src/java/org/apache/nutch/crawl/URLPartitioner.java#L75
> [2]
>
> https://github.com/commoncrawl/nutch/blob/98a137910aa30dcb4fa1acd720fb4a4b7d9c520f/src/java/org/apache/nutch/crawl/URLPartitioner.java#L131
> (used by Generator2)
>
>
> On 5/11/21 3:07 PM, Markus Jelsma wrote:
> > Hello Prateek,
> >
> > You are right, it is limited by the number of CPU cores and how many
> > threads it can handle, but you can still process a million records per
> day
> > if you have a few cores. If you parse as a separate step, it can run even
> > faster.
> >
> > Indeed, it won't work if you need to process 10 million recors of the
> same
> > host every day. If you want to use Hadoop for this, you can opt for a
> > custom YARN application [1]. We have done that too for some of our
> > distributed tools, it works very nice.
> >
> > Regards,
> > Markus
> >
> > [1]
> >
> https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html
> >
> > Op di 11 mei 2021 om 14:54 schreef prateek <[email protected]>:
> >
> >> Hi Markus,
> >>
> >> Depending upon the core of the machine, I can only increase the number
> of
> >> threads upto a limit. After that performance degradation will come into
> the
> >> picture.
> >> So running a single mapper will still be a bottleneck in this case. I am
> >> looking for options to distribute the same domain URLs across various
> >> mappers. Not sure if that's even possible with Nutch or not.
> >>
> >> Regards
> >> Prateek
> >>
> >> On Tue, May 11, 2021 at 11:58 AM Markus Jelsma <
> [email protected]
> >>>
> >> wrote:
> >>
> >>> Hello Prateet,
> >>>
> >>> If you want to fetch stuff from the same host/domain as fast as you
> want,
> >>> increase the number of threads, and the number of threads per queue.
> Then
> >>> decrease all the fetch delays.
> >>>
> >>> Regards,
> >>> Markus
> >>>
> >>> Op di 11 mei 2021 om 12:48 schreef prateek <[email protected]>:
> >>>
> >>>> Hi Lewis,
> >>>>
> >>>> As mentioned earlier, it does not matter how many mappers I assign to
> >>> fetch
> >>>> tasks. Since all the URLs are of the same domain, everything will be
> >>>> assigned to the same mapper and all other mappers will have no task to
> >>>> execute. So I am looking for ways I can crawl the same domain URLs
> >>> quickly.
> >>>>
> >>>> Regards
> >>>> Prateek
> >>>>
> >>>> On Mon, May 10, 2021 at 1:02 AM Lewis John McGibbney <
> >> [email protected]
> >>>>
> >>>> wrote:
> >>>>
> >>>>> Hi Prateek,
> >>>>> mapred.map.tasks     -->    mapreduce.job.maps
> >>>>> mapred.reduce.tasks  -->    mapreduce.job.reduces
> >>>>> You should be able to override in these in nutch-site.xml then
> >> publish
> >>> to
> >>>>> your Hadoop cluster.
> >>>>> lewismc
> >>>>>
> >>>>> On 2021/05/07 15:18:38, prateek <[email protected]> wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> I am trying to crawl URLs belonging to the same domain (around
> >> 140k)
> >>>> and
> >>>>>> because of the fact that all the same domain URLs go to the same
> >>>> mapper,
> >>>>>> only one mapper is used for fetching. All others are just a waste
> >> of
> >>>>>> resources. These are the configurations I have tried till now but
> >>> it's
> >>>>>> still very slow.
> >>>>>>
> >>>>>> Attempt 1 -
> >>>>>>          fetcher.threads.fetch : 10
> >>>>>>          fetcher.server.delay : 1
> >>>>>>          fetcher.threads.per.queue : 1,
> >>>>>>          fetcher.server.min.delay : 0.0
> >>>>>>
> >>>>>> Attempt 2 -
> >>>>>>          fetcher.threads.fetch : 10
> >>>>>>          fetcher.server.delay : 1
> >>>>>>          fetcher.threads.per.queue : 3,
> >>>>>>          fetcher.server.min.delay : 0.5
> >>>>>>
> >>>>>> Is there a way to distribute the same domain URLs across all the
> >>>>>> fetcher.threads.fetch? I understand that in this case crawl delay
> >>>> cannot
> >>>>> be
> >>>>>> reinforced across different mappers but for my use case it's ok to
> >>>> crawl
> >>>>>> aggressively. So any suggestions?
> >>>>>>
> >>>>>> Regards
> >>>>>> Prateek
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>
>

Re: Crawling same domain URL's

Reply via email to