Re: Incomplete TLD List

2022-11-08 Thread Sebastian Nagel

Hi Mike, hi Markus,

there's also
  https://issues.apache.org/jira/browse/NUTCH-1806
which would make it much easier to keep up-to-date with the public suffix list.

Resp., because crawler-commons loads the public suffix list
(for historic reasons named "effective_tld_names.dat") from the class path
it would be quite easy to update the list by simple placing it in the
Nutch conf folder.

@Mike: please, let us know whether this is an option (for the long term). You 
may also upvote the Jira issue. Thanks!


Best,
Sebastian

On 11/8/22 11:45, Markus Jelsma wrote:

Hello Mike,

You can try adding the TLD to conf/domain-suffixes.xml and see if it works.

Regards,
Markus

Op di 8 nov. 2022 om 11:16 schreef Mike :


Hi!
Some of the new TLDs are wrongly indexed by Nutch, is it possible to extend
the TLD list?

 "url":"https://about.google/intl/en_FR/how-our-business-works/;,
 "tstamp":"2022-11-06T17:22:14.808Z",
 "domain":"google",
 "digest":"3b9a23d42f200392d12a697bbb8d4d87",


Thanks

Mike





Re: user Digest 8 Nov 2022 10:16:05 -0000 Issue 3169

2022-11-08 Thread lewis john mcgibbney
Hi Mike,

Yes it is possible to extend the TLD list. In fact, when the TLD lost was
compiled the author left a note explicitly stating that it may not be
complete.
https://github.com/apache/nutch/blob/master/conf/domain-suffixes.xml.template
Please submit a PR if you wish to make any changes or additions. You can
use the parser checker tool to validate your change before creating the PR.
Thanks
lewismc

On Tue, Nov 8, 2022 at 02:16  wrote:

>
> -- Forwarded message --
> From: Mike 
> To: user@nutch.apache.org
> Cc:
> Bcc:
> Date: Tue, 8 Nov 2022 11:15:51 +0100
> Subject: Incomplete TLD List
> Hi!
> Some of the new TLDs are wrongly indexed by Nutch, is it possible to extend
> the TLD list?
>
> "url":"https://about.google/intl/en_FR/how-our-business-works/;,
> "tstamp":"2022-11-06T17:22:14.808Z",
> "domain":"google",
> "digest":"3b9a23d42f200392d12a697bbb8d4d87",
>
>
> Thanks
>
> Mike
>
-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


Re: Incomplete TLD List

2022-11-08 Thread Markus Jelsma
Hello Mike,

You can try adding the TLD to conf/domain-suffixes.xml and see if it works.

Regards,
Markus

Op di 8 nov. 2022 om 11:16 schreef Mike :

> Hi!
> Some of the new TLDs are wrongly indexed by Nutch, is it possible to extend
> the TLD list?
>
> "url":"https://about.google/intl/en_FR/how-our-business-works/;,
> "tstamp":"2022-11-06T17:22:14.808Z",
> "domain":"google",
> "digest":"3b9a23d42f200392d12a697bbb8d4d87",
>
>
> Thanks
>
> Mike
>


Incomplete TLD List

2022-11-08 Thread Mike
Hi!
Some of the new TLDs are wrongly indexed by Nutch, is it possible to extend
the TLD list?

"url":"https://about.google/intl/en_FR/how-our-business-works/;,
"tstamp":"2022-11-06T17:22:14.808Z",
"domain":"google",
"digest":"3b9a23d42f200392d12a697bbb8d4d87",


Thanks

Mike