Hi Sebastian, Yes, right now I only care about the statistics (basically using HostDB as an improved CrawlCompletionStats). For this reason, and since the number of problematic domains I have right now is small, urlnormalizer-host is good enough for me. Aggregating over HostDB per domain as a parameter to ReadHostDb would also solve my problem, as you suggest. There is even a comment in the code there that suggests someone already had a similar idea. To be honest, I don't know which solution is best, and I have a useable work-around, so I don't feel the need to implement a solution right now, unless someone pushes me to 😊
Yossi. > -----Original Message----- > From: Sebastian Nagel <wastl.na...@googlemail.com> > Sent: 05 March 2018 16:07 > To: user@nutch.apache.org > Subject: Re: Why doesn't hostdb support byDomain mode? > > Hi Yossi, > > please don't take it as a vote against your proposal. > It could be also solved by documenting what's not working with the HostDb > containing domains. > > Are you only about the statistics or also about using the HostDb for > Generator? > For the former use case, a solution could be also to aggregate the counts by > domain. Usually, the HostDb is orders of magnitude smaller than the CrawlDb, > so this should be considerably fast. > > Best, > Sebastian > > On 03/05/2018 02:03 PM, Yossi Tamari wrote: > > Thanks, I will submit a patch for this. Since this allows me to solve my > > specific > issue, and since Sebastian raised some questions regarding byDomain, I will > not > proceed with that currently. > > > >> -----Original Message----- > >> From: Markus Jelsma <markus.jel...@openindex.io> > >> Sent: 05 March 2018 14:41 > >> To: user@nutch.apache.org > >> Subject: RE: Why doesn't hostdb support byDomain mode? > >> > >> Ah, well, that is a good one! I took me a while to figure it out, but > >> having the check there is an error. We had added the same check in an > >> earlier different Nutch job where the database itself could remove > >> itself just by the rules it emitted and host normalized enabled. > >> > >> I simply reused the job setup code and forgot to remove that check. > >> You can safely remove that check in HostDB. > >> > >> Regards, > >> Markus > >> > >> > >> -----Original message----- > >>> From:Yossi Tamari <yossi.tam...@pipl.com> > >>> Sent: Monday 5th March 2018 11:30 > >>> To: user@nutch.apache.org > >>> Subject: RE: Why doesn't hostdb support byDomain mode? > >>> > >>> Thanks Markus, I will open a ticket and submit a patch. > >>> One follow up question: UpdateHostDb checks and throws an exception > >>> if > >> urlnormalizer-host (which can be used to mitigate the problem I > >> mentioned) is enabled. Is that also an internal decision of > >> OpenIndex, and perhaps should be removed now that the code is part of > >> Nutch, or is there a reason this normalizer must not be used with > UpdateHostDb? > >>> > >>> Yossi. > >>> > >>>> -----Original Message----- > >>>> From: Markus Jelsma <markus.jel...@openindex.io> > >>>> Sent: 05 March 2018 12:22 > >>>> To: user@nutch.apache.org > >>>> Subject: RE: Why doesn't hostdb support byDomain mode? > >>>> > >>>> Hi, > >>>> > >>>> The reason is simple, we (company) needed this information based on > >>>> hostname, so we made a hostdb. I don't see any downside for > >>>> supporting a domain mode. Adding support for it through > >>>> hostdb.url.mode seems like a good idea. > >>>> > >>>> Regards, > >>>> Markus > >>>> > >>>> -----Original message----- > >>>>> From:Yossi Tamari <yossi.tam...@pipl.com> > >>>>> Sent: Sunday 4th March 2018 12:01 > >>>>> To: user@nutch.apache.org > >>>>> Subject: Why doesn't hostdb support byDomain mode? > >>>>> > >>>>> Hi, > >>>>> > >>>>> > >>>>> > >>>>> Is there a reason that hostdb provides per-host data even when the > >>>>> generate/fetch are working by domain? This generates misleading > >>>>> statistics for servers that load-balance by redirecting to nodes (e.g. > >>>> photobucket). > >>>>> > >>>>> If this is just an oversight, I can contribute a patch, but I'm > >>>>> not sure if I should use partition.url.mode, generate.count.mode, > >>>>> one of the other similar properties, or create one more such > >>>>> property > >>>> hostdb.url.mode. > >>>>> > >>>>> > >>>>> > >>>>> Yossi. > >>>>> > >>>>> > >>> > >>> > >