Hi Sebastian,

Yes, right now I only care about the statistics (basically using HostDB as an 
improved CrawlCompletionStats). For this reason, and since the number of 
problematic domains I have right now is small, urlnormalizer-host is good 
enough for me.
Aggregating over HostDB per domain as a parameter to ReadHostDb would also 
solve my problem, as you suggest. There is even a comment in the code there 
that suggests someone already had a similar idea.
To be honest, I don't know which solution is best, and I have a useable 
work-around, so I don't feel the need to implement a solution right now, unless 
someone pushes me to 😊

        Yossi.

> -----Original Message-----
> From: Sebastian Nagel <wastl.na...@googlemail.com>
> Sent: 05 March 2018 16:07
> To: user@nutch.apache.org
> Subject: Re: Why doesn't hostdb support byDomain mode?
> 
> Hi Yossi,
> 
> please don't take it as a vote against your proposal.
> It could be also solved by documenting what's not working with the HostDb
> containing domains.
> 
> Are you only about the statistics or also about using the HostDb for 
> Generator?
> For the former use case, a solution could be also to aggregate the counts by
> domain. Usually, the HostDb is orders of magnitude smaller than the CrawlDb,
> so this should be considerably fast.
> 
> Best,
> Sebastian
> 
> On 03/05/2018 02:03 PM, Yossi Tamari wrote:
> > Thanks, I will submit a patch for this. Since this allows me to solve my 
> > specific
> issue, and since Sebastian raised some questions regarding byDomain, I will 
> not
> proceed with that currently.
> >
> >> -----Original Message-----
> >> From: Markus Jelsma <markus.jel...@openindex.io>
> >> Sent: 05 March 2018 14:41
> >> To: user@nutch.apache.org
> >> Subject: RE: Why doesn't hostdb support byDomain mode?
> >>
> >> Ah, well, that is a good one! I took me a while to figure it out, but
> >> having the check there is an error. We had added the same check in an
> >> earlier different Nutch job where the database itself could remove
> >> itself just by the rules it emitted and host normalized enabled.
> >>
> >> I simply reused the job setup code and forgot to remove that check.
> >> You can safely remove that check in HostDB.
> >>
> >> Regards,
> >> Markus
> >>
> >>
> >> -----Original message-----
> >>> From:Yossi Tamari <yossi.tam...@pipl.com>
> >>> Sent: Monday 5th March 2018 11:30
> >>> To: user@nutch.apache.org
> >>> Subject: RE: Why doesn't hostdb support byDomain mode?
> >>>
> >>> Thanks Markus, I will open a ticket and submit a patch.
> >>> One follow up question: UpdateHostDb checks and throws an exception
> >>> if
> >> urlnormalizer-host (which can be used to mitigate the problem I
> >> mentioned) is enabled. Is that also an internal decision of
> >> OpenIndex, and perhaps should be removed now that the code is part of
> >> Nutch, or is there a reason this normalizer must not be used with
> UpdateHostDb?
> >>>
> >>>   Yossi.
> >>>
> >>>> -----Original Message-----
> >>>> From: Markus Jelsma <markus.jel...@openindex.io>
> >>>> Sent: 05 March 2018 12:22
> >>>> To: user@nutch.apache.org
> >>>> Subject: RE: Why doesn't hostdb support byDomain mode?
> >>>>
> >>>> Hi,
> >>>>
> >>>> The reason is simple, we (company) needed this information based on
> >>>> hostname, so we made a hostdb. I don't see any downside for
> >>>> supporting a domain mode. Adding support for it through
> >>>> hostdb.url.mode seems like a good idea.
> >>>>
> >>>> Regards,
> >>>> Markus
> >>>>
> >>>> -----Original message-----
> >>>>> From:Yossi Tamari <yossi.tam...@pipl.com>
> >>>>> Sent: Sunday 4th March 2018 12:01
> >>>>> To: user@nutch.apache.org
> >>>>> Subject: Why doesn't hostdb support byDomain mode?
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>>
> >>>>>
> >>>>> Is there a reason that hostdb provides per-host data even when the
> >>>>> generate/fetch are working by domain? This generates misleading
> >>>>> statistics for servers that load-balance by redirecting to nodes (e.g.
> >>>> photobucket).
> >>>>>
> >>>>> If this is just an oversight, I can contribute a patch, but I'm
> >>>>> not sure if I should use partition.url.mode, generate.count.mode,
> >>>>> one of the other similar properties, or create one more such
> >>>>> property
> >>>> hostdb.url.mode.
> >>>>>
> >>>>>
> >>>>>
> >>>>> Yossi.
> >>>>>
> >>>>>
> >>>
> >>>
> >


Reply via email to