Thanks Sebastian. This makes sense. I will override URLPartitioner accordingly.
Regards Prateek On Tue, May 11, 2021 at 4:16 PM Sebastian Nagel <[email protected]> wrote: > Hi Prateek, > > alternatively, you could modify the URLPartitioner [1], so that during the > "generate" step > the URLs of a specific host or domain are distributed over more > partitions. One partition > is the fetch list of one fetcher map task. At Common Crawl we partition > by domain and made > the number of partitions configurable to assign more fetcher tasks to > certain super-domains, > e.g. wordpress.com or blogspot.com, see [2]. > > Best, > Sebastian > > [1] > https://github.com/apache/nutch/blob/6c02da053d8ce65e0283a144ab59586e563608b8/src/java/org/apache/nutch/crawl/URLPartitioner.java#L75 > [2] > > https://github.com/commoncrawl/nutch/blob/98a137910aa30dcb4fa1acd720fb4a4b7d9c520f/src/java/org/apache/nutch/crawl/URLPartitioner.java#L131 > (used by Generator2) > > > On 5/11/21 3:07 PM, Markus Jelsma wrote: > > Hello Prateek, > > > > You are right, it is limited by the number of CPU cores and how many > > threads it can handle, but you can still process a million records per > day > > if you have a few cores. If you parse as a separate step, it can run even > > faster. > > > > Indeed, it won't work if you need to process 10 million recors of the > same > > host every day. If you want to use Hadoop for this, you can opt for a > > custom YARN application [1]. We have done that too for some of our > > distributed tools, it works very nice. > > > > Regards, > > Markus > > > > [1] > > > https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html > > > > Op di 11 mei 2021 om 14:54 schreef prateek <[email protected]>: > > > >> Hi Markus, > >> > >> Depending upon the core of the machine, I can only increase the number > of > >> threads upto a limit. After that performance degradation will come into > the > >> picture. > >> So running a single mapper will still be a bottleneck in this case. I am > >> looking for options to distribute the same domain URLs across various > >> mappers. Not sure if that's even possible with Nutch or not. > >> > >> Regards > >> Prateek > >> > >> On Tue, May 11, 2021 at 11:58 AM Markus Jelsma < > [email protected] > >>> > >> wrote: > >> > >>> Hello Prateet, > >>> > >>> If you want to fetch stuff from the same host/domain as fast as you > want, > >>> increase the number of threads, and the number of threads per queue. > Then > >>> decrease all the fetch delays. > >>> > >>> Regards, > >>> Markus > >>> > >>> Op di 11 mei 2021 om 12:48 schreef prateek <[email protected]>: > >>> > >>>> Hi Lewis, > >>>> > >>>> As mentioned earlier, it does not matter how many mappers I assign to > >>> fetch > >>>> tasks. Since all the URLs are of the same domain, everything will be > >>>> assigned to the same mapper and all other mappers will have no task to > >>>> execute. So I am looking for ways I can crawl the same domain URLs > >>> quickly. > >>>> > >>>> Regards > >>>> Prateek > >>>> > >>>> On Mon, May 10, 2021 at 1:02 AM Lewis John McGibbney < > >> [email protected] > >>>> > >>>> wrote: > >>>> > >>>>> Hi Prateek, > >>>>> mapred.map.tasks --> mapreduce.job.maps > >>>>> mapred.reduce.tasks --> mapreduce.job.reduces > >>>>> You should be able to override in these in nutch-site.xml then > >> publish > >>> to > >>>>> your Hadoop cluster. > >>>>> lewismc > >>>>> > >>>>> On 2021/05/07 15:18:38, prateek <[email protected]> wrote: > >>>>>> Hi, > >>>>>> > >>>>>> I am trying to crawl URLs belonging to the same domain (around > >> 140k) > >>>> and > >>>>>> because of the fact that all the same domain URLs go to the same > >>>> mapper, > >>>>>> only one mapper is used for fetching. All others are just a waste > >> of > >>>>>> resources. These are the configurations I have tried till now but > >>> it's > >>>>>> still very slow. > >>>>>> > >>>>>> Attempt 1 - > >>>>>> fetcher.threads.fetch : 10 > >>>>>> fetcher.server.delay : 1 > >>>>>> fetcher.threads.per.queue : 1, > >>>>>> fetcher.server.min.delay : 0.0 > >>>>>> > >>>>>> Attempt 2 - > >>>>>> fetcher.threads.fetch : 10 > >>>>>> fetcher.server.delay : 1 > >>>>>> fetcher.threads.per.queue : 3, > >>>>>> fetcher.server.min.delay : 0.5 > >>>>>> > >>>>>> Is there a way to distribute the same domain URLs across all the > >>>>>> fetcher.threads.fetch? I understand that in this case crawl delay > >>>> cannot > >>>>> be > >>>>>> reinforced across different mappers but for my use case it's ok to > >>>> crawl > >>>>>> aggressively. So any suggestions? > >>>>>> > >>>>>> Regards > >>>>>> Prateek > >>>>>> > >>>>> > >>>> > >>> > >> > > > >

