Script 'mail_helper' called by obssrc Hello community, here is the log from the commit of package python-tldextract for openSUSE:Factory checked in at 2023-05-21 19:09:08 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Comparing /work/SRC/openSUSE:Factory/python-tldextract (Old) and /work/SRC/openSUSE:Factory/.python-tldextract.new.1533 (New) ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Package is "python-tldextract" Sun May 21 19:09:08 2023 rev:18 rq:1088132 version:3.4.4 Changes: -------- --- /work/SRC/openSUSE:Factory/python-tldextract/python-tldextract.changes 2023-05-12 20:40:02.534775795 +0200 +++ /work/SRC/openSUSE:Factory/.python-tldextract.new.1533/python-tldextract.changes 2023-05-21 19:09:52.174832852 +0200 @@ -1,0 +2,24 @@ +Sun May 21 13:02:41 UTC 2023 - Mia Herkt <m...@0x0.st> + +- Update to 3.4.4: +Bugfixes + * Honor private domains flag on self, not only when passed to + __call__ + #gh/john-kurkowski/tldextract#289 +- Changes in 3.4.3: +Bugfixes + * Speed up 10-15% over all inputs + * Refactor suffix_index() to use a trie + #gh/john-kurkowski/tldextract#285 +Docs + * Adopt PEP257 doc style +- Changes in 3.4.2: +Bugfixes + * Speed up 10-40% on "average" inputs, and even more on + pathological inputs, like long subdomains + * Optimize suffix_index(): search from right to left + #gh/john-kurkowski/tldextract#283 + * Optimize netloc extraction: switch from regex to if/else + #gh/john-kurkowski/tldextract#284 + +------------------------------------------------------------------- Old: ---- tldextract-3.4.1.tar.gz New: ---- tldextract-3.4.4.tar.gz ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Other differences: ------------------ ++++++ python-tldextract.spec ++++++ --- /var/tmp/diff_new_pack.tkAnGX/_old 2023-05-21 19:09:52.594835250 +0200 +++ /var/tmp/diff_new_pack.tkAnGX/_new 2023-05-21 19:09:52.598835272 +0200 @@ -18,7 +18,7 @@ %define oldpython python Name: python-tldextract -Version: 3.4.1 +Version: 3.4.4 Release: 0 Summary: Python module to separate the TLD of a URL License: BSD-3-Clause ++++++ tldextract-3.4.1.tar.gz -> tldextract-3.4.4.tar.gz ++++++ diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/tldextract-3.4.1/CHANGELOG.md new/tldextract-3.4.4/CHANGELOG.md --- old/tldextract-3.4.1/CHANGELOG.md 2023-04-27 01:26:27.000000000 +0200 +++ new/tldextract-3.4.4/CHANGELOG.md 2023-05-20 02:30:51.000000000 +0200 @@ -3,6 +3,26 @@ After upgrading, update your cache file by deleting it or via `tldextract --update`. +## 3.4.4 (2023-05-19) + +* Bugfixes + * Honor private domains flag on `self`, not only when passed to `__call__` ([#289](https://github.com/john-kurkowski/tldextract/issues/289)) + +## 3.4.3 (2023-05-18) + +* Bugfixes + * Speed up 10-15% over all inputs + * Refactor `suffix_index()` to use a trie ([#285](https://github.com/john-kurkowski/tldextract/issues/285)) +* Docs + * Adopt PEP257 doc style + +## 3.4.2 (2023-05-16) + +* Bugfixes + * Speed up 10-40% on "average" inputs, and even more on pathological inputs, like long subdomains + * Optimize `suffix_index()`: search from right to left ([#283](https://github.com/john-kurkowski/tldextract/issues/283)) + * Optimize netloc extraction: switch from regex to if/else ([#284](https://github.com/john-kurkowski/tldextract/issues/284)) + ## 3.4.1 (2023-04-26) * Bugfixes diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/tldextract-3.4.1/PKG-INFO new/tldextract-3.4.4/PKG-INFO --- old/tldextract-3.4.1/PKG-INFO 2023-04-27 01:31:34.373349000 +0200 +++ new/tldextract-3.4.4/PKG-INFO 2023-05-20 02:33:31.880953800 +0200 @@ -1,6 +1,6 @@ Metadata-Version: 2.1 Name: tldextract -Version: 3.4.1 +Version: 3.4.4 Summary: Accurately separates a URL's subdomain, domain, and public suffix, using the Public Suffix List (PSL). By default, this includes the public ICANN TLDs and their exceptions. You can optionally support the Public Suffix List's private domains as well. Home-page: https://github.com/john-kurkowski/tldextract Author: John Kurkowski @@ -20,8 +20,9 @@ Description-Content-Type: text/markdown License-File: LICENSE - `tldextract` accurately separates a URL's subdomain, domain, and public suffix, -using the Public Suffix List (PSL). +`tldextract` accurately separates a URL's subdomain, domain, and public suffix. + +It does this via the Public Suffix List (PSL). >>> import tldextract >>> tldextract.extract('http://forums.news.cnn.com/') diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/tldextract-3.4.1/setup.py new/tldextract-3.4.4/setup.py --- old/tldextract-3.4.1/setup.py 2023-04-27 01:28:16.000000000 +0200 +++ new/tldextract-3.4.4/setup.py 2023-05-20 02:25:26.000000000 +0200 @@ -1,5 +1,6 @@ -""" `tldextract` accurately separates a URL's subdomain, domain, and public suffix, -using the Public Suffix List (PSL). +"""`tldextract` accurately separates a URL's subdomain, domain, and public suffix. + +It does this via the Public Suffix List (PSL). >>> import tldextract >>> tldextract.extract('http://forums.news.cnn.com/') diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/tldextract-3.4.1/tests/__init__.py new/tldextract-3.4.4/tests/__init__.py --- old/tldextract-3.4.1/tests/__init__.py 2023-01-12 02:07:59.000000000 +0100 +++ new/tldextract-3.4.4/tests/__init__.py 2023-05-20 02:25:26.000000000 +0200 @@ -0,0 +1 @@ +"""Package tests.""" diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/tldextract-3.4.1/tests/conftest.py new/tldextract-3.4.4/tests/conftest.py --- old/tldextract-3.4.1/tests/conftest.py 2023-01-12 02:07:59.000000000 +0100 +++ new/tldextract-3.4.4/tests/conftest.py 2023-05-20 02:25:26.000000000 +0200 @@ -3,13 +3,16 @@ import logging import pytest + import tldextract.cache @pytest.fixture(autouse=True) def reset_log_level(): - """Automatically reset log level verbosity between tests. Generally want - test output the Unix way: silence is golden.""" + """Automatically reset log level verbosity between tests. + + Generally want test output the Unix way: silence is golden. + """ tldextract.cache._DID_LOG_UNABLE_TO_CACHE = ( # pylint: disable=protected-access False ) diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/tldextract-3.4.1/tests/main_test.py new/tldextract-3.4.4/tests/main_test.py --- old/tldextract-3.4.1/tests/main_test.py 2023-04-10 03:08:44.000000000 +0200 +++ new/tldextract-3.4.4/tests/main_test.py 2023-05-20 02:30:02.000000000 +0200 @@ -7,6 +7,7 @@ import pytest import responses + import tldextract import tldextract.suffix_list from tldextract.cache import DiskCache @@ -36,9 +37,11 @@ extract_using_fallback_to_snapshot_no_cache, ), ) -> None: - """Test helper to compare all the expected and actual attributes and - properties of an extraction. Runs the same comparison across several - permutations of tldextract instance configurations.""" + """Test helper to compare all expected and actual attributes of an extraction. + + Runs the same comparison across several permutations of tldextract instance + configurations. + """ ( expected_fqdn, expected_subdomain, @@ -84,6 +87,14 @@ def test_suffix(): assert_extract("com", ("", "", "", "com")) assert_extract("co.uk", ("", "", "", "co.uk")) + assert_extract("example.ck", ("", "", "", "example.ck")) + assert_extract("www.example.ck", ("www.example.ck", "", "www", "example.ck")) + assert_extract( + "sub.www.example.ck", ("sub.www.example.ck", "sub", "www", "example.ck") + ) + assert_extract("www.ck", ("www.ck", "", "www", "ck")) + assert_extract("nes.buskerud.no", ("", "", "", "nes.buskerud.no")) + assert_extract("buskerud.no", ("buskerud.no", "", "buskerud", "no")) def test_local_host(): @@ -187,9 +198,7 @@ def test_idna_2008(): - """Python supports IDNA 2003. - The IDNA library adds 2008 support for characters like Ã. - """ + """Python supports IDNA 2003. The IDNA library adds 2008 support for characters like Ã.""" assert_extract( "xn--gieen46ers-73a.de", ("xn--gieen46ers-73a.de", "", "xn--gieen46ers-73a", "de"), @@ -205,6 +214,13 @@ def test_scheme(): + assert_extract("//", ("", "", "", "")) + assert_extract("://", ("", "", "", "")) + assert_extract("://example.com", ("", "", "", "")) + assert_extract("a+-.://example.com", ("example.com", "", "example", "com")) + assert_extract("a#//example.com", ("", "", "a", "")) + assert_extract("a@://example.com", ("", "", "", "")) + assert_extract("#//example.com", ("", "", "", "")) assert_extract( "https://mail.google.com/mail", ("mail.google.com", "mail", "google", "com") ) @@ -272,10 +288,29 @@ # ('www.net.cn', 'www', 'net', 'cn')) +def test_no_1st_level_tld(): + assert_extract("za", ("", "", "za", "")) + assert_extract("example.za", ("", "example", "za", "")) + assert_extract("co.za", ("", "", "", "co.za")) + assert_extract("example.co.za", ("example.co.za", "", "example", "co.za")) + assert_extract( + "sub.example.co.za", ("sub.example.co.za", "sub", "example", "co.za") + ) + + def test_dns_root_label(): assert_extract( "http://www.example.com./", ("www.example.com", "www", "example", "com") ) + assert_extract( + "http://www.example.com\u3002/", ("www.example.com", "www", "example", "com") + ) + assert_extract( + "http://www.example.com\uff0e/", ("www.example.com", "www", "example", "com") + ) + assert_extract( + "http://www.example.com\uff61/", ("www.example.com", "www", "example", "com") + ) def test_private_domains(): @@ -317,7 +352,6 @@ def test_cache_permission(mocker, monkeypatch, tmpdir): """Emit a warning once that this can't cache the latest PSL.""" - warning = mocker.patch.object(logging.getLogger("tldextract.cache"), "warning") def no_permission_makedirs(*args, **kwargs): @@ -350,6 +384,17 @@ tldextract.suffix_list.find_first_response(cache, [server], 5) +def test_include_psl_private_domain_attr(): + extract_private = tldextract.TLDExtract(include_psl_private_domains=True) + extract_public = tldextract.TLDExtract(include_psl_private_domains=False) + assert extract_private("foo.uk.com") == ExtractResult( + subdomain="", domain="foo", suffix="uk.com" + ) + assert extract_public("foo.uk.com") == ExtractResult( + subdomain="foo", domain="uk", suffix="com" + ) + + def test_tlds_property(): extract_private = tldextract.TLDExtract( cache_dir=None, suffix_list_urls=(), include_psl_private_domains=True @@ -367,3 +412,32 @@ assert tldextract.extract( "foo.blogspot.com", include_psl_private_domains=True ) == ExtractResult(subdomain="", domain="foo", suffix="blogspot.com") + assert tldextract.extract( + "s3.ap-south-1.amazonaws.com", include_psl_private_domains=True + ) == ExtractResult(subdomain="", domain="", suffix="s3.ap-south-1.amazonaws.com") + assert tldextract.extract( + "the-quick-brown-fox.ap-south-1.amazonaws.com", include_psl_private_domains=True + ) == ExtractResult( + subdomain="the-quick-brown-fox.ap-south-1", domain="amazonaws", suffix="com" + ) + assert tldextract.extract( + "ap-south-1.amazonaws.com", include_psl_private_domains=True + ) == ExtractResult(subdomain="ap-south-1", domain="amazonaws", suffix="com") + assert tldextract.extract( + "amazonaws.com", include_psl_private_domains=True + ) == ExtractResult(subdomain="", domain="amazonaws", suffix="com") + assert tldextract.extract( + "s3.cn-north-1.amazonaws.com.cn", include_psl_private_domains=True + ) == ExtractResult(subdomain="", domain="", suffix="s3.cn-north-1.amazonaws.com.cn") + assert tldextract.extract( + "the-quick-brown-fox.cn-north-1.amazonaws.com.cn", + include_psl_private_domains=True, + ) == ExtractResult( + subdomain="the-quick-brown-fox.cn-north-1", domain="amazonaws", suffix="com.cn" + ) + assert tldextract.extract( + "cn-north-1.amazonaws.com.cn", include_psl_private_domains=True + ) == ExtractResult(subdomain="cn-north-1", domain="amazonaws", suffix="com.cn") + assert tldextract.extract( + "amazonaws.com.cn", include_psl_private_domains=True + ) == ExtractResult(subdomain="", domain="amazonaws", suffix="com.cn") diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/tldextract-3.4.1/tests/test_cache.py new/tldextract-3.4.4/tests/test_cache.py --- old/tldextract-3.4.1/tests/test_cache.py 2023-01-12 02:07:59.000000000 +0100 +++ new/tldextract-3.4.4/tests/test_cache.py 2023-05-20 02:25:26.000000000 +0200 @@ -1,4 +1,4 @@ -"""Test the caching functionality""" +"""Test the caching functionality.""" import os.path import sys import types diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/tldextract-3.4.1/tests/test_parallel.py new/tldextract-3.4.4/tests/test_parallel.py --- old/tldextract-3.4.1/tests/test_parallel.py 2023-01-12 02:07:59.000000000 +0100 +++ new/tldextract-3.4.4/tests/test_parallel.py 2023-05-20 02:25:26.000000000 +0200 @@ -1,15 +1,16 @@ -"""Test ability to run in parallel with shared cache""" +"""Test ability to run in parallel with shared cache.""" import os import os.path from multiprocessing import Pool import responses + from tldextract import TLDExtract from tldextract.tldextract import PUBLIC_SUFFIX_LIST_URLS def test_multiprocessing_makes_one_request(tmpdir): - """Ensure there aren't duplicate download requests""" + """Ensure there aren't duplicate download requests.""" process_count = 3 with Pool(processes=process_count) as pool: http_request_counts = pool.map(_run_extractor, [str(tmpdir)] * process_count) @@ -18,7 +19,7 @@ @responses.activate def _run_extractor(cache_dir): - """run the extractor""" + """Run the extractor.""" responses.add(responses.GET, PUBLIC_SUFFIX_LIST_URLS[0], status=208, body="uk.co") extract = TLDExtract(cache_dir=cache_dir) @@ -28,7 +29,7 @@ @responses.activate def test_cache_cleared_by_other_process(tmpdir, monkeypatch): - """Simulate a file being deleted after we check for existence but before we try to delete it""" + """Simulate a file being deleted after we check for existence but before we try to delete it.""" responses.add(responses.GET, PUBLIC_SUFFIX_LIST_URLS[0], status=208, body="uk.com") cache_dir = str(tmpdir) @@ -37,7 +38,7 @@ orig_unlink = os.unlink def evil_unlink(filename): - """Simulates someone delete the file right before we try to""" + """Simulate someone deletes the file right before we try to.""" if filename.startswith(cache_dir): orig_unlink(filename) orig_unlink(filename) diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/tldextract-3.4.1/tests/test_trie.py new/tldextract-3.4.4/tests/test_trie.py --- old/tldextract-3.4.1/tests/test_trie.py 1970-01-01 01:00:00.000000000 +0100 +++ new/tldextract-3.4.4/tests/test_trie.py 2023-05-20 02:25:26.000000000 +0200 @@ -0,0 +1,53 @@ +"""Trie tests.""" +from itertools import permutations + +from tldextract.tldextract import Trie + + +def test_nested_dict() -> None: + original_keys_sequence = [ + ["a"], + ["a", "d"], + ["a", "b"], + ["a", "b", "c"], + ["c"], + ["c", "b"], + ["d", "f"], + ] + for keys_sequence in permutations(original_keys_sequence): + trie = Trie() + for keys in keys_sequence: + trie.add_suffix(keys) + # check each nested value + # Top level c + assert "c" in trie.matches + top_c = trie.matches["c"] + assert len(top_c.matches) == 1 + assert "b" in top_c.matches + assert top_c.end + # Top level a + assert "a" in trie.matches + top_a = trie.matches["a"] + assert len(top_a.matches) == 2 + # a -> d + assert "d" in top_a.matches + a_to_d = top_a.matches["d"] + assert not a_to_d.matches + # a -> b + assert "b" in top_a.matches + a_to_b = top_a.matches["b"] + assert a_to_b.end + assert len(a_to_b.matches) == 1 + # a -> b -> c + assert "c" in a_to_b.matches + a_to_b_to_c = a_to_b.matches["c"] + assert not a_to_b_to_c.matches + assert top_a.end + # d -> f + assert "d" in trie.matches + top_d = trie.matches["d"] + assert not top_d.end + assert "f" in top_d.matches + d_to_f = top_d.matches["f"] + assert d_to_f.end + assert not d_to_f.matches diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/tldextract-3.4.1/tldextract/_version.py new/tldextract-3.4.4/tldextract/_version.py --- old/tldextract-3.4.1/tldextract/_version.py 2023-04-27 01:31:34.000000000 +0200 +++ new/tldextract-3.4.4/tldextract/_version.py 2023-05-20 02:33:31.000000000 +0200 @@ -1,4 +1,4 @@ # file generated by setuptools_scm # don't change, don't track in version control -__version__ = version = '3.4.1' -__version_tuple__ = version_tuple = (3, 4, 1) +__version__ = version = '3.4.4' +__version_tuple__ = version_tuple = (3, 4, 4) diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/tldextract-3.4.1/tldextract/cache.py new/tldextract-3.4.4/tldextract/cache.py --- old/tldextract-3.4.1/tldextract/cache.py 2023-04-10 03:08:44.000000000 +0200 +++ new/tldextract-3.4.4/tldextract/cache.py 2023-05-20 02:25:26.000000000 +0200 @@ -1,4 +1,4 @@ -"""Helpers """ +"""Helpers.""" import errno import hashlib import json @@ -30,7 +30,7 @@ def get_pkg_unique_identifier() -> str: """ - Generate an identifier unique to the python version, tldextract version, and python instance + Generate an identifier unique to the python version, tldextract version, and python instance. This will prevent interference between virtualenvs and issues that might arise when installing a new version of tldextract @@ -61,7 +61,7 @@ def get_cache_dir() -> str: """ - Get a cache dir that we have permission to write to + Get a cache dir that we have permission to write to. Try to follow the XDG standard, but if that doesn't work fallback to the package directory http://specifications.freedesktop.org/basedir-spec/basedir-spec-latest.html @@ -86,7 +86,7 @@ class DiskCache: - """Disk _cache that only works for jsonable values""" + """Disk _cache that only works for jsonable values.""" def __init__(self, cache_dir: Optional[str], lock_timeout: int = 20): self.enabled = bool(cache_dir) @@ -115,7 +115,7 @@ def set( self, namespace: str, key: Union[str, Dict[str, Hashable]], value: object ) -> None: - """Set a value in the disk cache""" + """Set a value in the disk cache.""" if not self.enabled: return @@ -142,7 +142,7 @@ _DID_LOG_UNABLE_TO_CACHE = True def clear(self) -> None: - """Clear the disk cache""" + """Clear the disk cache.""" for root, _, files in os.walk(self.cache_dir): for filename in files: if filename.endswith(self.file_ext) or filename.endswith( @@ -175,7 +175,7 @@ kwargs: Dict[str, Hashable], hashed_argnames: Iterable[str], ) -> T: - """Get a url but cache the response""" + """Get a url but cache the response.""" if not self.enabled: return func(**kwargs) @@ -215,7 +215,7 @@ def cached_fetch_url( self, session: requests.Session, url: str, timeout: Union[float, int, None] ) -> str: - """Get a url but cache the response""" + """Get a url but cache the response.""" return self.run_and_cache( func=_fetch_url, namespace="urls", @@ -241,7 +241,7 @@ def _make_dir(filename: str) -> None: - """Make a directory if it doesn't already exist""" + """Make a directory if it doesn't already exist.""" if not os.path.exists(os.path.dirname(filename)): try: os.makedirs(os.path.dirname(filename)) diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/tldextract-3.4.1/tldextract/cli.py new/tldextract-3.4.4/tldextract/cli.py --- old/tldextract-3.4.1/tldextract/cli.py 2023-04-10 03:15:19.000000000 +0200 +++ new/tldextract-3.4.4/tldextract/cli.py 2023-05-20 02:25:26.000000000 +0200 @@ -1,4 +1,4 @@ -"""tldextract CLI""" +"""tldextract CLI.""" import argparse @@ -12,7 +12,7 @@ def main() -> None: - """tldextract CLI main command.""" + """Tldextract CLI main command.""" logging.basicConfig() parser = argparse.ArgumentParser( diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/tldextract-3.4.1/tldextract/remote.py new/tldextract-3.4.4/tldextract/remote.py --- old/tldextract-3.4.1/tldextract/remote.py 2023-01-12 02:07:59.000000000 +0100 +++ new/tldextract-3.4.4/tldextract/remote.py 2023-05-20 02:25:26.000000000 +0200 @@ -1,36 +1,51 @@ -"tldextract helpers for testing and fetching remote resources." +"""tldextract helpers for testing and fetching remote resources.""" import re import socket from urllib.parse import scheme_chars IP_RE = re.compile( - # pylint: disable-next=line-too-long - r"^(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])$" + r"^(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.)" + r"{3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])$" ) -SCHEME_RE = re.compile(r"^([" + scheme_chars + "]+:)?//") +scheme_chars_set = set(scheme_chars) def lenient_netloc(url: str) -> str: - """Extract the netloc of a URL-like string, similar to the netloc attribute - returned by urllib.parse.{urlparse,urlsplit}, but extract more leniently, - without raising errors.""" + """Extract the netloc of a URL-like string. + Similar to the netloc attribute returned by + urllib.parse.{urlparse,urlsplit}, but extract more leniently, without + raising errors. + """ return ( - SCHEME_RE.sub("", url) + _schemeless_url(url) .partition("/")[0] .partition("?")[0] .partition("#")[0] - .split("@")[-1] + .rpartition("@")[-1] .partition(":")[0] .strip() - .rstrip(".") + .rstrip(".\u3002\uff0e\uff61") ) +def _schemeless_url(url: str) -> str: + double_slashes_start = url.find("//") + if double_slashes_start == 0: + return url[2:] + if ( + double_slashes_start < 2 + or not url[double_slashes_start - 1] == ":" + or set(url[: double_slashes_start - 1]) - scheme_chars_set + ): + return url + return url[double_slashes_start + 2 :] + + def looks_like_ip(maybe_ip: str) -> bool: - """Does the given str look like an IP address?""" + """Check whether the given str looks like an IP address.""" if not maybe_ip[0].isdigit(): return False diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/tldextract-3.4.1/tldextract/suffix_list.py new/tldextract-3.4.4/tldextract/suffix_list.py --- old/tldextract-3.4.1/tldextract/suffix_list.py 2023-01-12 02:48:06.000000000 +0100 +++ new/tldextract-3.4.4/tldextract/suffix_list.py 2023-05-20 02:25:26.000000000 +0200 @@ -1,4 +1,4 @@ -"tldextract helpers for testing and fetching remote resources." +"""tldextract helpers for testing and fetching remote resources.""" import logging import pkgutil @@ -17,8 +17,11 @@ class SuffixListNotFound(LookupError): - """A recoverable error while looking up a suffix list. Recoverable because - you can specify backups, or use this library's bundled snapshot.""" + """A recoverable error while looking up a suffix list. + + Recoverable because you can specify backups, or use this library's bundled + snapshot. + """ def find_first_response( @@ -26,9 +29,7 @@ urls: Sequence[str], cache_fetch_timeout: Union[float, int, None] = None, ) -> str: - """Decode the first successfully fetched URL, from UTF-8 encoding to - Python unicode. - """ + """Decode the first successfully fetched URL, from UTF-8 encoding to Python unicode.""" with requests.Session() as session: session.mount("file://", FileAdapter()) @@ -46,8 +47,7 @@ def extract_tlds_from_suffix_list(suffix_list_text: str) -> Tuple[List[str], List[str]]: - """Parse the raw suffix list text for its different designations of - suffixes.""" + """Parse the raw suffix list text for its different designations of suffixes.""" public_text, _, private_text = suffix_list_text.partition( PUBLIC_PRIVATE_SUFFIX_SEPARATOR ) diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/tldextract-3.4.1/tldextract/tldextract.py new/tldextract-3.4.4/tldextract/tldextract.py --- old/tldextract-3.4.1/tldextract/tldextract.py 2023-01-12 02:48:06.000000000 +0100 +++ new/tldextract-3.4.4/tldextract/tldextract.py 2023-05-20 02:30:02.000000000 +0200 @@ -1,5 +1,6 @@ -""" `tldextract` accurately separates a URL's subdomain, domain, and public suffix, -using the Public Suffix List (PSL). +"""`tldextract` accurately separates a URL's subdomain, domain, and public suffix. + +It does this via the Public Suffix List (PSL). >>> import tldextract @@ -48,12 +49,22 @@ '127.0.0.1' """ +from __future__ import annotations + import logging import os -import re -from functools import wraps -from typing import FrozenSet, List, NamedTuple, Optional, Sequence, Union import urllib.parse +from functools import wraps +from typing import ( + Collection, + Dict, + FrozenSet, + List, + NamedTuple, + Optional, + Sequence, + Union, +) import idna @@ -71,8 +82,6 @@ "https://raw.githubusercontent.com/publicsuffix/list/master/public_suffix_list.dat", ) -_UNICODE_DOTS_RE = re.compile("[\u002e\u3002\uff0e\uff61]") - class ExtractResult(NamedTuple): """namedtuple of a URL's subdomain, domain, and suffix.""" @@ -91,8 +100,8 @@ >>> extract('http://localhost:8080').registered_domain '' """ - if self.domain and self.suffix: - return self.domain + "." + self.suffix + if self.suffix and self.domain: + return f"{self.domain}.{self.suffix}" return "" @property @@ -105,7 +114,7 @@ >>> extract('http://localhost:8080').fqdn '' """ - if self.domain and self.suffix: + if self.suffix and self.domain: # Disable bogus lint error (https://github.com/PyCQA/pylint/issues/2568) # pylint: disable-next=not-an-iterable return ".".join(i for i in self if i) @@ -114,7 +123,7 @@ @property def ipv4(self) -> str: """ - Returns the ipv4 if that is what the presented domain/url is + Returns the ipv4 if that is what the presented domain/url is. >>> extract('http://127.0.0.1/path/to/file').ipv4 '127.0.0.1' @@ -129,8 +138,7 @@ class TLDExtract: - """A callable for extracting, subdomain, domain, and suffix components from - a URL.""" + """A callable for extracting, subdomain, domain, and suffix components from a URL.""" # TODO: Agreed with Pylint: too-many-arguments def __init__( # pylint: disable=too-many-arguments @@ -142,9 +150,7 @@ extra_suffixes: Sequence[str] = (), cache_fetch_timeout: Union[str, float, None] = CACHE_TIMEOUT, ) -> None: - """ - Constructs a callable for extracting subdomain, domain, and suffix - components from a URL. + """Construct a callable for extracting subdomain, domain, and suffix components from a URL. Upon calling it, it first checks for a JSON in `cache_dir`. By default, the `cache_dir` will live in the tldextract directory. You can disable @@ -207,17 +213,17 @@ self._cache = DiskCache(cache_dir) def __call__( - self, url: str, include_psl_private_domains: Optional[bool] = None + self, url: str, include_psl_private_domains: bool | None = None ) -> ExtractResult: """Alias for `extract_str`.""" return self.extract_str(url, include_psl_private_domains) def extract_str( - self, url: str, include_psl_private_domains: Optional[bool] = None + self, url: str, include_psl_private_domains: bool | None = None ) -> ExtractResult: - """ - Takes a string URL and splits it into its subdomain, domain, and - suffix (effective TLD, gTLD, ccTLD, etc.) components. + """Take a string URL and splits it into its subdomain, domain, and suffix components. + + I.e. its effective TLD, gTLD, ccTLD, etc. components. >>> extractor = TLDExtract() >>> extractor.extract_str('http://forums.news.cnn.com/') @@ -232,10 +238,10 @@ url: Union[urllib.parse.ParseResult, urllib.parse.SplitResult], include_psl_private_domains: Optional[bool] = None, ) -> ExtractResult: - """ - Takes the output of urllib.parse URL parsing methods and further splits - the parsed URL into its subdomain, domain, and suffix (effective TLD, - gTLD, ccTLD, etc.) components. + """Take the output of urllib.parse URL parsing methods and further splits the parsed URL. + + Splits the parsed URL into its subdomain, domain, and suffix + components, i.e. its effective TLD, gTLD, ccTLD, etc. components. This method is like `extract_str` but faster, as the string's domain name has already been parsed. @@ -251,18 +257,22 @@ def _extract_netloc( self, netloc: str, include_psl_private_domains: Optional[bool] ) -> ExtractResult: - labels = _UNICODE_DOTS_RE.split(netloc) + labels = ( + netloc.replace("\u3002", "\u002e") + .replace("\uff0e", "\u002e") + .replace("\uff61", "\u002e") + .split(".") + ) - translations = [_decode_punycode(label) for label in labels] suffix_index = self._get_tld_extractor().suffix_index( - translations, include_psl_private_domains=include_psl_private_domains + labels, include_psl_private_domains=include_psl_private_domains ) - suffix = ".".join(labels[suffix_index:]) - if not suffix and netloc and looks_like_ip(netloc): + if suffix_index == len(labels) and netloc and looks_like_ip(netloc): return ExtractResult("", netloc, "") - subdomain = ".".join(labels[: suffix_index - 1]) if suffix_index else "" + suffix = ".".join(labels[suffix_index:]) if suffix_index != len(labels) else "" + subdomain = ".".join(labels[: suffix_index - 1]) if suffix_index >= 2 else "" domain = labels[suffix_index - 1] if suffix_index else "" return ExtractResult(subdomain, domain, suffix) @@ -276,22 +286,23 @@ @property def tlds(self) -> List[str]: """ - Returns the list of tld's used by default + Returns the list of tld's used by default. This will vary based on `include_psl_private_domains` and `extra_suffixes` """ return list(self._get_tld_extractor().tlds()) - def _get_tld_extractor(self) -> "_PublicSuffixListTLDExtractor": - """Get or compute this object's TLDExtractor. Looks up the TLDExtractor - in roughly the following order, based on the settings passed to - __init__: + def _get_tld_extractor(self) -> _PublicSuffixListTLDExtractor: + """Get or compute this object's TLDExtractor. + + Looks up the TLDExtractor in roughly the following order, based on the + settings passed to __init__: 1. Memoized on `self` 2. Local system _cache file 3. Remote PSL, over HTTP - 4. Bundled PSL snapshot file""" - + 4. Bundled PSL snapshot file + """ if self._extractor: return self._extractor @@ -317,6 +328,37 @@ TLD_EXTRACTOR = TLDExtract() +class Trie: + """Trie for storing eTLDs with their labels in reverse-order.""" + + def __init__(self, matches: Optional[Dict] = None, end: bool = False) -> None: + self.matches = matches if matches else {} + self.end = end + + @staticmethod + def create(suffixes: Collection[str]) -> Trie: + """Create a Trie from a list of suffixes and return its root node.""" + root_node = Trie() + + for suffix in suffixes: + suffix_labels = suffix.split(".") + suffix_labels.reverse() + root_node.add_suffix(suffix_labels) + + return root_node + + def add_suffix(self, labels: List[str]) -> None: + """Append a suffix's labels to this Trie node.""" + node = self + + for label in labels: + if label not in node.matches: + node.matches[label] = Trie() + node = node.matches[label] + + node.end = True + + @wraps(TLD_EXTRACTOR.__call__) def extract( # pylint: disable=missing-function-docstring url: str, include_psl_private_domains: Optional[bool] = False @@ -331,9 +373,7 @@ class _PublicSuffixListTLDExtractor: - """Wrapper around this project's main algo for PSL - lookups. - """ + """Wrapper around this project's main algo for PSL lookups.""" def __init__( self, @@ -348,6 +388,8 @@ self.private_tlds = private_tlds self.tlds_incl_private = frozenset(public_tlds + private_tlds + extra_tlds) self.tlds_excl_private = frozenset(public_tlds + extra_tlds) + self.tlds_incl_private_trie = Trie.create(self.tlds_incl_private) + self.tlds_excl_private_trie = Trie.create(self.tlds_excl_private) def tlds( self, include_psl_private_domains: Optional[bool] = None @@ -363,27 +405,41 @@ ) def suffix_index( - self, lower_spl: List[str], include_psl_private_domains: Optional[bool] = None + self, spl: List[str], include_psl_private_domains: Optional[bool] = None ) -> int: - """Returns the index of the first suffix label. - Returns len(spl) if no suffix is found + """Return the index of the first suffix label. + + Returns len(spl) if no suffix is found. """ - tlds = self.tlds(include_psl_private_domains) - length = len(lower_spl) - for i in range(length): - maybe_tld = ".".join(lower_spl[i:]) - exception_tld = "!" + maybe_tld - if exception_tld in tlds: - return i + 1 - - if maybe_tld in tlds: - return i - - wildcard_tld = "*." + ".".join(lower_spl[i + 1 :]) - if wildcard_tld in tlds: - return i + if include_psl_private_domains is None: + include_psl_private_domains = self.include_psl_private_domains + + node = ( + self.tlds_incl_private_trie + if include_psl_private_domains + else self.tlds_excl_private_trie + ) + i = len(spl) + j = i + for label in reversed(spl): + decoded_label = _decode_punycode(label) + if decoded_label in node.matches: + j -= 1 + if node.matches[decoded_label].end: + i = j + node = node.matches[decoded_label] + continue + + is_wildcard = "*" in node.matches + if is_wildcard: + is_wildcard_exception = "!" + decoded_label in node.matches + if is_wildcard_exception: + return j + return j - 1 + + break - return length + return i def _decode_punycode(label: str) -> str: diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/tldextract-3.4.1/tldextract.egg-info/PKG-INFO new/tldextract-3.4.4/tldextract.egg-info/PKG-INFO --- old/tldextract-3.4.1/tldextract.egg-info/PKG-INFO 2023-04-27 01:31:34.000000000 +0200 +++ new/tldextract-3.4.4/tldextract.egg-info/PKG-INFO 2023-05-20 02:33:31.000000000 +0200 @@ -1,6 +1,6 @@ Metadata-Version: 2.1 Name: tldextract -Version: 3.4.1 +Version: 3.4.4 Summary: Accurately separates a URL's subdomain, domain, and public suffix, using the Public Suffix List (PSL). By default, this includes the public ICANN TLDs and their exceptions. You can optionally support the Public Suffix List's private domains as well. Home-page: https://github.com/john-kurkowski/tldextract Author: John Kurkowski @@ -20,8 +20,9 @@ Description-Content-Type: text/markdown License-File: LICENSE - `tldextract` accurately separates a URL's subdomain, domain, and public suffix, -using the Public Suffix List (PSL). +`tldextract` accurately separates a URL's subdomain, domain, and public suffix. + +It does this via the Public Suffix List (PSL). >>> import tldextract >>> tldextract.extract('http://forums.news.cnn.com/') diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/tldextract-3.4.1/tldextract.egg-info/SOURCES.txt new/tldextract-3.4.4/tldextract.egg-info/SOURCES.txt --- old/tldextract-3.4.1/tldextract.egg-info/SOURCES.txt 2023-04-27 01:31:34.000000000 +0200 +++ new/tldextract-3.4.4/tldextract.egg-info/SOURCES.txt 2023-05-20 02:33:31.000000000 +0200 @@ -18,6 +18,7 @@ tests/main_test.py tests/test_cache.py tests/test_parallel.py +tests/test_trie.py tests/fixtures/fake_suffix_list_fixture.dat tldextract/.tld_set_snapshot tldextract/__init__.py