Re: Incomplete TLD List
Hi Mike, hi Markus, there's also https://issues.apache.org/jira/browse/NUTCH-1806 which would make it much easier to keep up-to-date with the public suffix list. Resp., because crawler-commons loads the public suffix list (for historic reasons named "effective_tld_names.dat") from the class path it would be quite easy to update the list by simple placing it in the Nutch conf folder. @Mike: please, let us know whether this is an option (for the long term). You may also upvote the Jira issue. Thanks! Best, Sebastian On 11/8/22 11:45, Markus Jelsma wrote: Hello Mike, You can try adding the TLD to conf/domain-suffixes.xml and see if it works. Regards, Markus Op di 8 nov. 2022 om 11:16 schreef Mike : Hi! Some of the new TLDs are wrongly indexed by Nutch, is it possible to extend the TLD list? "url":"https://about.google/intl/en_FR/how-our-business-works/;, "tstamp":"2022-11-06T17:22:14.808Z", "domain":"google", "digest":"3b9a23d42f200392d12a697bbb8d4d87", Thanks Mike
Re: user Digest 8 Nov 2022 10:16:05 -0000 Issue 3169
Hi Mike, Yes it is possible to extend the TLD list. In fact, when the TLD lost was compiled the author left a note explicitly stating that it may not be complete. https://github.com/apache/nutch/blob/master/conf/domain-suffixes.xml.template Please submit a PR if you wish to make any changes or additions. You can use the parser checker tool to validate your change before creating the PR. Thanks lewismc On Tue, Nov 8, 2022 at 02:16 wrote: > > -- Forwarded message -- > From: Mike > To: user@nutch.apache.org > Cc: > Bcc: > Date: Tue, 8 Nov 2022 11:15:51 +0100 > Subject: Incomplete TLD List > Hi! > Some of the new TLDs are wrongly indexed by Nutch, is it possible to extend > the TLD list? > > "url":"https://about.google/intl/en_FR/how-our-business-works/;, > "tstamp":"2022-11-06T17:22:14.808Z", > "domain":"google", > "digest":"3b9a23d42f200392d12a697bbb8d4d87", > > > Thanks > > Mike > -- http://home.apache.org/~lewismc/ http://people.apache.org/keys/committer/lewismc
Re: Incomplete TLD List
Hello Mike, You can try adding the TLD to conf/domain-suffixes.xml and see if it works. Regards, Markus Op di 8 nov. 2022 om 11:16 schreef Mike : > Hi! > Some of the new TLDs are wrongly indexed by Nutch, is it possible to extend > the TLD list? > > "url":"https://about.google/intl/en_FR/how-our-business-works/;, > "tstamp":"2022-11-06T17:22:14.808Z", > "domain":"google", > "digest":"3b9a23d42f200392d12a697bbb8d4d87", > > > Thanks > > Mike >
Incomplete TLD List
Hi! Some of the new TLDs are wrongly indexed by Nutch, is it possible to extend the TLD list? "url":"https://about.google/intl/en_FR/how-our-business-works/;, "tstamp":"2022-11-06T17:22:14.808Z", "domain":"google", "digest":"3b9a23d42f200392d12a697bbb8d4d87", Thanks Mike