Re: Incomplete TLD List

2022-11-08 Thread Sebastian Nagel
Hi Mike, hi Markus, there's also https://issues.apache.org/jira/browse/NUTCH-1806 which would make it much easier to keep up-to-date with the public suffix list. Resp., because crawler-commons loads the public suffix list (for historic reasons named "effective_tld_names.dat") from the class

Re: user Digest 8 Nov 2022 10:16:05 -0000 Issue 3169

2022-11-08 Thread lewis john mcgibbney
Hi Mike, Yes it is possible to extend the TLD list. In fact, when the TLD lost was compiled the author left a note explicitly stating that it may not be complete. https://github.com/apache/nutch/blob/master/conf/domain-suffixes.xml.template Please submit a PR if you wish to make any changes or

Re: Incomplete TLD List

2022-11-08 Thread Markus Jelsma
Hello Mike, You can try adding the TLD to conf/domain-suffixes.xml and see if it works. Regards, Markus Op di 8 nov. 2022 om 11:16 schreef Mike : > Hi! > Some of the new TLDs are wrongly indexed by Nutch, is it possible to extend > the TLD list? > >

Incomplete TLD List

2022-11-08 Thread Mike
Hi! Some of the new TLDs are wrongly indexed by Nutch, is it possible to extend the TLD list? "url":"https://about.google/intl/en_FR/how-our-business-works/;, "tstamp":"2022-11-06T17:22:14.808Z", "domain":"google", "digest":"3b9a23d42f200392d12a697bbb8d4d87",