Hi Mike, hi Markus,
there's also
https://issues.apache.org/jira/browse/NUTCH-1806
which would make it much easier to keep up-to-date with the public suffix list.
Resp., because crawler-commons loads the public suffix list
(for historic reasons named "effective_tld_names.dat") from the class
Hi Mike,
Yes it is possible to extend the TLD list. In fact, when the TLD lost was
compiled the author left a note explicitly stating that it may not be
complete.
https://github.com/apache/nutch/blob/master/conf/domain-suffixes.xml.template
Please submit a PR if you wish to make any changes or
Hello Mike,
You can try adding the TLD to conf/domain-suffixes.xml and see if it works.
Regards,
Markus
Op di 8 nov. 2022 om 11:16 schreef Mike :
> Hi!
> Some of the new TLDs are wrongly indexed by Nutch, is it possible to extend
> the TLD list?
>
>
Hi!
Some of the new TLDs are wrongly indexed by Nutch, is it possible to extend
the TLD list?
"url":"https://about.google/intl/en_FR/how-our-business-works/;,
"tstamp":"2022-11-06T17:22:14.808Z",
"domain":"google",
"digest":"3b9a23d42f200392d12a697bbb8d4d87",
4 matches
Mail list logo