commit python-beautifulsoup4 for openSUSE:Factory

Source-Sync Mon, 29 Dec 2025 06:16:38 -0800

Script 'mail_helper' called by obssrc
Hello community,

here is the log from the commit of package python-beautifulsoup4 for 
openSUSE:Factory checked in at 2025-12-29 15:15:51
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Comparing /work/SRC/openSUSE:Factory/python-beautifulsoup4 (Old)
 and      /work/SRC/openSUSE:Factory/.python-beautifulsoup4.new.1928 (New)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Package is "python-beautifulsoup4"

Mon Dec 29 15:15:51 2025 rev:46 rq:1324671 version:4.14.3

Changes:
--------
--- 
/work/SRC/openSUSE:Factory/python-beautifulsoup4/python-beautifulsoup4.changes  
    2025-10-15 12:44:37.334994871 +0200
+++ 
/work/SRC/openSUSE:Factory/.python-beautifulsoup4.new.1928/python-beautifulsoup4.changes
    2025-12-29 15:16:06.221284579 +0100
@@ -1,0 +2,21 @@
+Mon Dec 29 09:58:48 UTC 2025 - Markéta Machová <[email protected]>
+
+- update to 4.14.3
+  * When using one of the lxml tree builders, you can pass in
+    huge_tree=True to disable lxml's security restrictions and process
+    files that include huge text nodes. ("huge" means more than
+    10,000,000 bytes of text in a single node). Without this, lxml may
+    silently stop processing the file after encountering a huge text
+    node.
+  * The html.parser tree builder processes numeric character entities
+    using the algorithm described in the HTML spec. If this means
+    replacing some other character with REPLACEMENT CHARACTER, it will
+    set BeautifulSoup.contains_replacement_characters.
+  * Added a general test of the html.parser tree builder's ability to
+    turn any parsing exception from html.parser into a
+    ParserRejectedMarkup exception. This makes it possible to remove
+    version-dependent tests that depended on the existence of specific
+    bugs in html.parser.
+- Add htmlparser.patch to fix behaviour with cpython interpreters
+
+-------------------------------------------------------------------

Old:
----
  beautifulsoup4-4.14.2.tar.gz

New:
----
  beautifulsoup4-4.14.3.tar.gz
  htmlparser.patch

----------(New B)----------
  New:    bugs in html.parser.
- Add htmlparser.patch to fix behaviour with cpython interpreters
----------(New E)----------

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Other differences:
------------------
++++++ python-beautifulsoup4.spec ++++++
--- /var/tmp/diff_new_pack.CcxKUl/_old  2025-12-29 15:16:07.021317440 +0100
+++ /var/tmp/diff_new_pack.CcxKUl/_new  2025-12-29 15:16:07.021317440 +0100
@@ -18,12 +18,14 @@
 
 %{?sle15_python_module_pythons}
 Name:           python-beautifulsoup4
-Version:        4.14.2
+Version:        4.14.3
 Release:        0
 Summary:        HTML/XML Parser for Quick-Turnaround Applications Like 
Screen-Scraping
 License:        MIT
 URL:            https://www.crummy.com/software/BeautifulSoup/
 Source:         
https://files.pythonhosted.org/packages/source/b/beautifulsoup4/beautifulsoup4-%{version}.tar.gz
+# PATCH-FIX-UPSTREAM 55f655ffb7ef03bdd1df0f013743831fe54e3c7a Change the 
html.parser tree builder's code for handling numeric character references
+Patch0:         htmlparser.patch
 BuildRequires:  %{python_module base >= 3.7}
 BuildRequires:  %{python_module hatchling}
 BuildRequires:  %{python_module pip}

++++++ beautifulsoup4-4.14.2.tar.gz -> beautifulsoup4-4.14.3.tar.gz ++++++
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/beautifulsoup4-4.14.2/CHANGELOG 
new/beautifulsoup4-4.14.3/CHANGELOG
--- old/beautifulsoup4-4.14.2/CHANGELOG 2020-02-02 01:00:00.000000000 +0100
+++ new/beautifulsoup4-4.14.3/CHANGELOG 2020-02-02 01:00:00.000000000 +0100
@@ -1,3 +1,30 @@
+= 4.14.3 (20251130)
+
+* When using one of the lxml tree builders, you can pass in
+  huge_tree=True to disable lxml's security restrictions and process
+  files that include huge text nodes. ("huge" means more than
+  10,000,000 bytes of text in a single node). Without this, lxml may
+  silently stop processing the file after encountering a huge text
+  node. [bug=2072424]
+
+* The html.parser tree builder processes numeric character entities
+  using the algorithm described in the HTML spec. If this means
+  replacing some other character with REPLACEMENT CHARACTER, it will
+  set BeautifulSoup.contains_replacement_characters. [bug=2126753]
+
+  The other tree builders rely on the underlying parser to do this
+  sort of replacement. That means that Beautiful Soup never sees the
+  original character reference, so it doesn't know whether
+  REPLACEMENT_CHARACTER was the original content; therefore
+  the html.parser tree builder will set contains_replacement_characters in
+  situations where the other tree builders won't.
+
+* Added a general test of the html.parser tree builder's ability to
+  turn any parsing exception from html.parser into a
+  ParserRejectedMarkup exception. This makes it possible to remove
+  version-dependent tests that depended on the existence of specific
+  bugs in html.parser. [bug=2121335,2121335]
+
 = 4.14.2 (20250929)
 
 * Making ResultSet inherit from MutableSequence still resulted in too many
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/beautifulsoup4-4.14.2/PKG-INFO 
new/beautifulsoup4-4.14.3/PKG-INFO
--- old/beautifulsoup4-4.14.2/PKG-INFO  2020-02-02 01:00:00.000000000 +0100
+++ new/beautifulsoup4-4.14.3/PKG-INFO  2020-02-02 01:00:00.000000000 +0100
@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: beautifulsoup4
-Version: 4.14.2
+Version: 4.14.3
 Summary: Screen-scraping library
 Project-URL: Download, 
https://www.crummy.com/software/BeautifulSoup/bs4/download/
 Project-URL: Homepage, https://www.crummy.com/software/BeautifulSoup/bs4/
@@ -19,7 +19,7 @@
 Classifier: Topic :: Text Processing :: Markup :: SGML
 Classifier: Topic :: Text Processing :: Markup :: XML
 Requires-Python: >=3.7.0
-Requires-Dist: soupsieve>1.2
+Requires-Dist: soupsieve>=1.6.1
 Requires-Dist: typing-extensions>=4.0.0
 Provides-Extra: cchardet
 Requires-Dist: cchardet; extra == 'cchardet'
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/beautifulsoup4-4.14.2/bs4/__init__.py 
new/beautifulsoup4-4.14.3/bs4/__init__.py
--- old/beautifulsoup4-4.14.2/bs4/__init__.py   2020-02-02 01:00:00.000000000 
+0100
+++ new/beautifulsoup4-4.14.3/bs4/__init__.py   2020-02-02 01:00:00.000000000 
+0100
@@ -15,7 +15,7 @@
 """
 
 __author__ = "Leonard Richardson ([email protected])"
-__version__ = "4.14.2"
+__version__ = "4.14.3"
 __copyright__ = "Copyright (c) 2004-2025 Leonard Richardson"
 # Use of this source code is governed by the MIT license.
 __license__ = "MIT"
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/beautifulsoup4-4.14.2/bs4/builder/_htmlparser.py 
new/beautifulsoup4-4.14.3/bs4/builder/_htmlparser.py
--- old/beautifulsoup4-4.14.2/bs4/builder/_htmlparser.py        2020-02-02 
01:00:00.000000000 +0100
+++ new/beautifulsoup4-4.14.3/bs4/builder/_htmlparser.py        2020-02-02 
01:00:00.000000000 +0100
@@ -234,6 +234,7 @@
         # HTMLParser. (http://bugs.python.org/issue13633) The bug has
         # been fixed, but removing this code still makes some
         # Beautiful Soup tests fail. This needs investigation.
+        real_name:int
         if name.startswith("x"):
             real_name = int(name.lstrip("x"), 16)
         elif name.startswith("X"):
@@ -241,26 +242,9 @@
         else:
             real_name = int(name)
 
-        data = None
-        if real_name < 256:
-            # HTML numeric entities are supposed to reference Unicode
-            # code points, but sometimes they reference code points in
-            # some other encoding (ahem, Windows-1252). E.g. &#147;
-            # instead of &#201; for LEFT DOUBLE QUOTATION MARK. This
-            # code tries to detect this situation and compensate.
-            for encoding in (self.soup.original_encoding, "windows-1252"):
-                if not encoding:
-                    continue
-                try:
-                    data = bytearray([real_name]).decode(encoding)
-                except UnicodeDecodeError:
-                    pass
-        if not data:
-            try:
-                data = chr(real_name)
-            except (ValueError, OverflowError):
-                pass
-        data = data or "\N{REPLACEMENT CHARACTER}"
+        data, replacement_added = 
UnicodeDammit.numeric_character_reference(real_name)
+        if replacement_added:
+            self.soup.contains_replacement_characters = True
         self.handle_data(data)
 
     def handle_entityref(self, name: str) -> None:
@@ -446,7 +430,11 @@
                 dammit.contains_replacement_characters,
             )
 
-    def feed(self, markup: _RawMarkup) -> None:
+    def feed(self, markup: _RawMarkup, 
_parser_class:type[BeautifulSoupHTMLParser] =BeautifulSoupHTMLParser) -> None:
+        """
+        :param markup: The markup to feed into the parser.
+        :param _parser_class: An HTMLParser subclass to use. This is only 
intended for use in unit tests.
+        """
         args, kwargs = self.parser_args
 
         # HTMLParser.feed will only handle str, but
@@ -461,7 +449,7 @@
         # before calling feed(), so we can assume self.soup
         # is set.
         assert self.soup is not None
-        parser = BeautifulSoupHTMLParser(self.soup, *args, **kwargs)
+        parser = _parser_class(self.soup, *args, **kwargs)
 
         try:
             parser.feed(markup)
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/beautifulsoup4-4.14.2/bs4/builder/_lxml.py 
new/beautifulsoup4-4.14.3/bs4/builder/_lxml.py
--- old/beautifulsoup4-4.14.2/bs4/builder/_lxml.py      2020-02-02 
01:00:00.000000000 +0100
+++ new/beautifulsoup4-4.14.3/bs4/builder/_lxml.py      2020-02-02 
01:00:00.000000000 +0100
@@ -81,6 +81,12 @@
 
     is_xml: bool = True
 
+    #: Set this to true (probably by passing huge_tree=True into the :
+    #: BeautifulSoup constructor) to enable the lxml feature "disable security
+    #: restrictions and support very deep trees and very long text
+    #: content".
+    huge_tree: bool
+
     processing_instruction_class: Type[ProcessingInstruction]
 
     NAME: str = "lxml-xml"
@@ -153,7 +159,7 @@
         """
         if self._default_parser is not None:
             return self._default_parser
-        return self.DEFAULT_PARSER_CLASS(target=self, recover=True, 
encoding=encoding)
+        return self.DEFAULT_PARSER_CLASS(target=self, recover=True, 
huge_tree=self.huge_tree, encoding=encoding)
 
     def parser_for(self, encoding: Optional[_Encoding]) -> _LXMLParser:
         """Instantiate an appropriate parser for the given encoding.
@@ -166,14 +172,15 @@
 
         if callable(parser):
             # Instantiate the parser with default arguments
-            parser = parser(target=self, recover=True, encoding=encoding)
+            parser = parser(target=self, recover=True, 
huge_tree=self.huge_tree, encoding=encoding)
         return parser
 
     def __init__(
-        self,
-        parser: Optional[etree.XMLParser] = None,
-        empty_element_tags: Optional[Set[str]] = None,
-        **kwargs: Any,
+            self,
+            parser: Optional[etree.XMLParser] = None,
+            empty_element_tags: Optional[Set[str]] = None,
+            huge_tree: bool = False,
+            **kwargs: Any,
     ):
         # TODO: Issue a warning if parser is present but not a
         # callable, since that means there's no way to create new
@@ -189,6 +196,8 @@
 
         if "attribute_dict_class" not in kwargs:
             kwargs["attribute_dict_class"] = XMLAttributeDict
+        self.huge_tree = huge_tree
+
         super(LXMLTreeBuilderForXML, self).__init__(**kwargs)
 
     def _getNsTag(self, tag: str) -> Tuple[Optional[str], str]:
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/beautifulsoup4-4.14.2/bs4/dammit.py 
new/beautifulsoup4-4.14.3/bs4/dammit.py
--- old/beautifulsoup4-4.14.2/bs4/dammit.py     2020-02-02 01:00:00.000000000 
+0100
+++ new/beautifulsoup4-4.14.3/bs4/dammit.py     2020-02-02 01:00:00.000000000 
+0100
@@ -1192,7 +1192,9 @@
     }
 
     #: A map used when removing rogue Windows-1252/ISO-8859-1
-    #: characters in otherwise UTF-8 documents.
+    #: characters in otherwise UTF-8 documents. Also used when a
+    #: numeric character entity has been incorrectly encoded using the
+    #: character's Windows-1252 encoding.
     #:
     #: Note that \\x81, \\x8d, \\x8f, \\x90, and \\x9d are undefined in
     #: Windows-1252.
@@ -1321,8 +1323,38 @@
         0xFC: b"\xc3\xbc",  # ü
         0xFD: b"\xc3\xbd",  # ý
         0xFE: b"\xc3\xbe",  # þ
+        0xFF: b"\xc3\xbf",  # ÿ
     }
 
+    #: :meta private
+    # Note that this isn't all Unicode noncharacters, just the noncontiguous 
ones that need to be listed.
+    #
+    # "A noncharacter is a code point that is in the range
+    # U+FDD0 to U+FDEF, inclusive, or U+FFFE, U+FFFF, U+1FFFE,
+    # U+1FFFF, U+2FFFE, U+2FFFF, U+3FFFE, U+3FFFF, U+4FFFE,
+    # U+4FFFF, U+5FFFE, U+5FFFF, U+6FFFE, U+6FFFF, U+7FFFE,
+    # U+7FFFF, U+8FFFE, U+8FFFF, U+9FFFE, U+9FFFF, U+AFFFE,
+    # U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE, U+CFFFF, U+DFFFE,
+    # U+DFFFF, U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF, U+10FFFE,
+    # or U+10FFFF."
+    ENUMERATED_NONCHARACTERS: Set[int] = set([0xfffe, 0xffff,
+                                              0x1fffe, 0x1ffff,
+                                              0x2fffe, 0x2ffff,
+                                              0x3fffe, 0x3ffff,
+                                              0x4fffe, 0x4ffff,
+                                              0x5fffe, 0x5ffff,
+                                              0x6fffe, 0x6ffff,
+                                              0x7fffe, 0x7ffff,
+                                              0x8fffe, 0x8ffff,
+                                              0x9fffe, 0x9ffff,
+                                              0xafffe, 0xaffff,
+                                              0xbfffe, 0xbffff,
+                                              0xcfffe, 0xcffff,
+                                              0xdfffe, 0xdffff,
+                                              0xefffe, 0xeffff,
+                                              0xffffe, 0xfffff,
+                                              0x10fffe, 0x10ffff])
+
     #: :meta private:
     MULTIBYTE_MARKERS_AND_SIZES: List[Tuple[int, int, int]] = [
         (0xC2, 0xDF, 2),  # 2-byte characters start with a byte C2-DF
@@ -1337,6 +1369,82 @@
     LAST_MULTIBYTE_MARKER: int = MULTIBYTE_MARKERS_AND_SIZES[-1][1]
 
     @classmethod
+    def numeric_character_reference(cls, numeric:int) -> Tuple[str, bool]:
+        """This (mostly) implements the algorithm described in "Numeric 
character
+        reference end state" from the HTML spec:
+        
https://html.spec.whatwg.org/multipage/parsing.html#numeric-character-reference-end-state
+
+        The algorithm is designed to convert numeric character references like 
"&#9731;"
+        to Unicode characters like "☃".
+
+        :return: A 2-tuple (character, replaced). `character` is the Unicode
+        character corresponding to the numeric reference and `replaced` is
+        whether or not an unresolvable character was replaced with REPLACEMENT
+        CHARACTER.
+        """
+        replacement = "\ufffd"
+
+        if numeric == 0x00:
+            # "If the number is 0x00, then this is a
+            # null-character-reference parse error. Set the character
+            # reference code to 0xFFFD."
+            return replacement, True
+
+        if numeric > 0x10ffff:
+            # "If the number is greater than 0x10FFFF, then this is a
+            # character-reference-outside-unicode-range parse
+            # error. Set the character reference code to 0xFFFD."
+            return replacement, True
+
+        if numeric >= 0xd800 and numeric <= 0xdfff:
+            # "If the number is a surrogate, then this is a
+            # surrogate-character-reference parse error. Set the
+            # character reference code to 0xFFFD."
+            return replacement, True
+
+        if (numeric >= 0xfdd0 and numeric <= 0xfdef) or numeric in 
cls.ENUMERATED_NONCHARACTERS:
+            # "If the number is a noncharacter, then this is a
+            # noncharacter-character-reference parse error."
+            #
+            # "The parser resolves such character references as-is."
+            #
+            # I'm not sure what "as-is" means but I think it means that we act
+            # like there was no error condition.
+            return chr(numeric), False
+
+        # "If the number is 0x0D, or a control that's not ASCII whitespace,
+        # then this is a control-character-reference parse error."
+        #
+        # "A control is a C0 control or a code point in the range
+        # U+007F DELETE to U+009F APPLICATION PROGRAM COMMAND,
+        # inclusive."
+        #
+        # "A C0 control is a code point in the range U+0000 NULL to U+001F 
INFORMATION SEPARATOR ONE, inclusive."
+        #
+        # "The parser resolves such character references as-is except C1 
control references that are replaced."
+
+        # First, let's replace the control references that can be replaced.
+        if numeric >= 0x80 and numeric <= 0x9f and numeric in 
cls.WINDOWS_1252_TO_UTF8:
+            # "If the number is one of the numbers in the first column of the
+            # following table, then find the row with that number in the first
+            # column, and set the character reference code to the number in the
+            # second column of that row."
+            #
+            # This is an attempt to catch characters that were encoded to 
numeric
+            # entities using their Windows-1252 encodings rather than their 
UTF-8
+            # encodings.
+            return cls.WINDOWS_1252_TO_UTF8[numeric].decode("utf8"), False
+
+        # Now all that's left are references that should be resolved as-is. 
This
+        # is also the default path for non-weird character references.
+        try:
+            return chr(numeric), False
+        except (ValueError, OverflowError):
+            # This shouldn't happen, since these cases should have been handled
+            # above, but if it does, return REPLACEMENT CHARACTER
+            return replacement, True
+
+    @classmethod
     def detwingle(
         cls,
         in_bytes: bytes,
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/beautifulsoup4-4.14.2/bs4/tests/__init__.py 
new/beautifulsoup4-4.14.3/bs4/tests/__init__.py
--- old/beautifulsoup4-4.14.2/bs4/tests/__init__.py     2020-02-02 
01:00:00.000000000 +0100
+++ new/beautifulsoup4-4.14.3/bs4/tests/__init__.py     2020-02-02 
01:00:00.000000000 +0100
@@ -1125,6 +1125,15 @@
         soup = self.soup(BAD_DOCUMENT)
         self.linkage_validator(soup)
 
+    @pytest.mark.skipif("sys.version_info < (3, 8)")
+    def test_surrogate_in_character_reference(self):
+        # These character references are invalid and should be replaced with 
REPLACEMENT CHARACTER.
+        soup = self.soup("<html><body>&#55357;&#56551;</body></html>")
+        assert soup.body.contents == ['��']
+
+        # Only with html.parser (q.v.) do we actually see the invalid 
character references on the way in. Since
+        # other builders don't see them we don't know the REPLACEMENT 
CHARACTER wasn't originally there.
+        assert soup.contains_replacement_characters == False
 
 class XMLTreeBuilderSmokeTest(TreeBuilderSmokeTest):
     def test_pickle_and_unpickle_identity(self):
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' 
old/beautifulsoup4-4.14.2/bs4/tests/fuzz/clusterfuzz-testcase-minimized-bs4_fuzzer-5703933063462912.testcase
 
new/beautifulsoup4-4.14.3/bs4/tests/fuzz/clusterfuzz-testcase-minimized-bs4_fuzzer-5703933063462912.testcase
--- 
old/beautifulsoup4-4.14.2/bs4/tests/fuzz/clusterfuzz-testcase-minimized-bs4_fuzzer-5703933063462912.testcase
        2020-02-02 01:00:00.000000000 +0100
+++ 
new/beautifulsoup4-4.14.3/bs4/tests/fuzz/clusterfuzz-testcase-minimized-bs4_fuzzer-5703933063462912.testcase
        1970-01-01 01:00:00.000000000 +0100
@@ -1,2 +0,0 @@
-
-<![ 
\ No newline at end of file
Binary files 
old/beautifulsoup4-4.14.2/bs4/tests/fuzz/crash-ffbdfa8a2b26f13537b68d3794b0478a4090ee4a.testcase
 and 
new/beautifulsoup4-4.14.3/bs4/tests/fuzz/crash-ffbdfa8a2b26f13537b68d3794b0478a4090ee4a.testcase
 differ
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/beautifulsoup4-4.14.2/bs4/tests/test_dammit.py 
new/beautifulsoup4-4.14.3/bs4/tests/test_dammit.py
--- old/beautifulsoup4-4.14.2/bs4/tests/test_dammit.py  2020-02-02 
01:00:00.000000000 +0100
+++ new/beautifulsoup4-4.14.3/bs4/tests/test_dammit.py  2020-02-02 
01:00:00.000000000 +0100
@@ -442,3 +442,71 @@
     def test_ambiguous_ampersands_escaped(self, markup, expect):
         assert self.sub.substitute_html(markup) == expect
         assert self.sub.substitute_html5_raw(markup) == expect
+
+class TestNumericCharacterReference:
+
+    # Numeric character references that trigger replacement with REPLACEMENT 
CHARACTER.
+    @pytest.mark.parametrize(
+        "numeric", [0x00,     # Null
+                    0x11ffff, # Out of range
+                    -0x11, # Out of range
+                    0xd800,   # Surrogate
+                    0xda12,   # Surrogate
+                    0xdfff,   # Surrogate
+                    ]
+    )
+    def test_bad_numeric_entities(self, numeric):
+        assert UnicodeDammit.numeric_character_reference(numeric) == 
("\N{REPLACEMENT CHARACTER}", True)
+
+    # Numeric character references that are treated as having been mistakenly 
encoded from Windows-1252.
+    @pytest.mark.parametrize(
+        "numeric,character", [
+            (0x80, "€"),
+            (0x8C, "Œ"),
+            (0x91, "‘"),
+            (0x92, "’"),
+            (0x93, "“"),
+            (0x94, "”"),
+            (0x97, "—"),
+            (0x9F, "Ÿ"),
+        ]
+    )
+    def test_windows_1252_numeric_entities(self, numeric, character):
+        assert UnicodeDammit.numeric_character_reference(numeric) == 
(character, False)
+
+    # Numeric references that are resolved to the corresponding Unicode 
characters, even though
+    # some of them also trigger parser error conditions.
+    @pytest.mark.parametrize(
+        "numeric,character", [
+            # Noncharacters
+            (0xfdd0, "�"),
+            (0xfdd9, "�"),
+            (0xfdef, "�"),
+            (0x2fffe, "�"),
+            (0xaffff, "�"),
+            (0x10ffff, "�"),
+
+            # Characters that are the same in Windows-1252 and UTF-8.
+            (0xFE, "þ"),
+            (0xFF, "ÿ"),
+
+            # ASCII whitespace
+            (0x09, "\t"),
+            (0x0a, "\n"),
+            (0x0c, "\x0c"),
+            (0x0d, "\r"),
+            (0x20, " "),
+
+            # The numeric entity for REPLACEMENT CHARACTER is converted
+            # correctly but the boolean part of the return value is false,
+            # because REPLACEMENT CHARACTER wasn't used to replace some other
+            # character--it _is_ the other character.
+            (0xfffd, "�"),
+
+            # Miscellaneous spot checks.
+            (0x61, "a"),
+            (0x2603, "☃"),
+        ]
+    )
+    def test_normal_numeric_entities(self, numeric, character):
+        assert UnicodeDammit.numeric_character_reference(numeric) == 
(character, False)
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/beautifulsoup4-4.14.2/bs4/tests/test_fuzz.py 
new/beautifulsoup4-4.14.3/bs4/tests/test_fuzz.py
--- old/beautifulsoup4-4.14.2/bs4/tests/test_fuzz.py    2020-02-02 
01:00:00.000000000 +0100
+++ new/beautifulsoup4-4.14.3/bs4/tests/test_fuzz.py    2020-02-02 
01:00:00.000000000 +0100
@@ -66,23 +66,6 @@
             return
         soup.prettify()
 
-    # This class of error has been fixed by catching a less helpful
-    # exception from html.parser and raising ParserRejectedMarkup
-    # instead.
-    @pytest.mark.parametrize(
-        "filename",
-        [
-            "clusterfuzz-testcase-minimized-bs4_fuzzer-5703933063462912",
-            "crash-ffbdfa8a2b26f13537b68d3794b0478a4090ee4a",
-        ],
-    )
-    # Fixed in https://github.com/python/cpython/issues/77057
-    @pytest.mark.skipif("sys.version_info >= (3, 13)")
-    def test_rejected_markup(self, filename):
-        markup = self.__markup(filename)
-        with pytest.raises(ParserRejectedMarkup):
-            BeautifulSoup(markup, "html.parser")
-
     # This class of error has to do with very deeply nested documents
     # which overflow the Python call stack when the tree is converted
     # to a string. This is an issue with Beautiful Soup which was fixed
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/beautifulsoup4-4.14.2/bs4/tests/test_htmlparser.py 
new/beautifulsoup4-4.14.3/bs4/tests/test_htmlparser.py
--- old/beautifulsoup4-4.14.2/bs4/tests/test_htmlparser.py      2020-02-02 
01:00:00.000000000 +0100
+++ new/beautifulsoup4-4.14.3/bs4/tests/test_htmlparser.py      2020-02-02 
01:00:00.000000000 +0100
@@ -8,6 +8,7 @@
     BeautifulSoupHTMLParser,
     HTMLParserTreeBuilder,
 )
+from bs4 import BeautifulSoup
 from bs4.exceptions import ParserRejectedMarkup
 from typing import Any
 from . import HTMLTreeBuilderSmokeTest
@@ -16,28 +17,19 @@
 class TestHTMLParserTreeBuilder(HTMLTreeBuilderSmokeTest):
     default_builder = HTMLParserTreeBuilder
 
-    # Fixed in https://github.com/python/cpython/issues/77057
-    @pytest.mark.skipif("sys.version_info >= (3, 13)")
-    def test_rejected_input(self):
-        # Python's html.parser will occasionally reject markup,
-        # especially when there is a problem with the initial DOCTYPE
-        # declaration. Different versions of Python sound the alarm in
-        # different ways, but Beautiful Soup consistently raises
-        # errors as ParserRejectedMarkup exceptions.
-        bad_markup = [
-            # https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=28873
-            # 
https://github.com/guidovranken/python-library-fuzzers/blob/master/corp-html/519e5b4269a01185a0d5e76295251921da2f0700
-            # https://github.com/python/cpython/issues/81928
-            b"\n<![\xff\xfe\xfe\xcd\x00",
-            # 
https://github.com/guidovranken/python-library-fuzzers/blob/master/corp-html/de32aa55785be29bbc72a1a8e06b00611fb3d9f8
-            # https://github.com/python/cpython/issues/78661
-            #
-            b"<![n\x00",
-            b"<![UNKNOWN[]]>",
-        ]
-        for markup in bad_markup:
-            with pytest.raises(ParserRejectedMarkup):
-                self.soup(markup)
+    def test_feed_raises_correct_exception_on_rejected_input(self):
+        # Mock BeautifulSoupHTMLParser so it raises an AssertionError and 
verify that this is
+        # turned into a ParserRejectedMarkup.
+        #
+        # This replaces a test that relied on bugs in html.parser which have 
been fixed.
+        class Mock(BeautifulSoupHTMLParser):
+            def feed(self, markup):
+                raise AssertionError("all markup is bad!")
+
+        with pytest.raises(ParserRejectedMarkup):
+            builder = HTMLParserTreeBuilder()
+            builder.soup = BeautifulSoup()
+            builder.feed("any markup", Mock)
 
     def test_namespaced_system_doctype(self):
         # html.parser can't handle namespaced doctypes, so skip this one.
@@ -48,7 +40,7 @@
         pass
 
     def test_builder_is_pickled(self):
-        """Unlike most tree builders, HTMLParserTreeBuilder and will
+        """Unlike most tree builders, HTMLParserTreeBuilder can be pickled and 
will
         be restored after pickling.
         """
         tree = self.soup("<a><b>foo</a>")
@@ -161,3 +153,12 @@
         markup = "<p>a &nosuchentity; b</p>"
         soup = self.soup(markup)
         assert "<p>a &amp;nosuchentity b</p>" == soup.p.decode()
+
+    def test_surrogate_in_character_reference(self):
+        # These character references are invalid and should be replaced with 
REPLACEMENT CHARACTER.
+        soup = self.soup("<html><body>&#55357;&#56551;</body></html>")
+        assert soup.body.contents == ['��']
+
+        # Since we do the replacement ourselves, we can set 
contains_replacement_characters appropriately.
+        # lxml and html5lib do the replacement so all we ever see is 
REPLACEMENT CHARACTER.
+        assert soup.contains_replacement_characters == True
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/beautifulsoup4-4.14.2/bs4/tests/test_lxml.py 
new/beautifulsoup4-4.14.3/bs4/tests/test_lxml.py
--- old/beautifulsoup4-4.14.2/bs4/tests/test_lxml.py    2020-02-02 
01:00:00.000000000 +0100
+++ new/beautifulsoup4-4.14.3/bs4/tests/test_lxml.py    2020-02-02 
01:00:00.000000000 +0100
@@ -199,3 +199,31 @@
         assert "some markup" == unpickled.a.string
         assert unpickled.builder != soup.builder
         assert isinstance(unpickled.builder, self.default_builder)
+
+    def test_huge_tree(self):
+        # Verify that a tree with very large text nodes can be completely 
parsed
+        # if huge_tree=True.
+        def doc(size):
+            points = 'A'*size
+            input_svg = f'''<?xml version="1.0" encoding="utf-8"?>
+<svg xmlns="http://www.w3.org/2000/svg";>
+ <g id="one"/>
+ <g id="two">
+    <polygon points="{points}" id="p1"/>
+ </g>
+ <g id="three"/>
+ <g id="four">
+    <polygon points="{points}" id="p2"/>
+    <polygon id="p3" points="AAA"/>
+ </g>
+ <g id="five"/>
+</svg>
+'''
+            return input_svg
+
+        d = doc(10000000) # libxml2 XML_MAX_TEXT_LENGTH
+        soup = self.soup(d, huge_tree=True)
+
+        # This would be 4 with huge_tree=False, but there's no need to
+        # lock in a test for that, since it's undesirable behavior.
+        assert len(soup.find_all("g")) == 5
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/beautifulsoup4-4.14.2/doc/index.rst 
new/beautifulsoup4-4.14.3/doc/index.rst
--- old/beautifulsoup4-4.14.2/doc/index.rst     2020-02-02 01:00:00.000000000 
+0100
+++ new/beautifulsoup4-4.14.3/doc/index.rst     2020-02-02 01:00:00.000000000 
+0100
@@ -16,7 +16,7 @@
 how to use it, how to make it do what you want, and what to do when it
 violates your expectations.
 
-This document covers Beautiful Soup version 4.14.2. The examples in
+This document covers Beautiful Soup version 4.14.3. The examples in
 this documentation were written for Python 3.8.
 
 You might be looking for the documentation for `Beautiful Soup 3
@@ -891,8 +891,8 @@
 gives you the entire ancestry of an element, including the element
 itself::
 
- for parent in link.parents:
-     print(parent.name)
+ for tag in link.self_and_parents:
+     print(tag.name)
  # link
  # p
  # body
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/beautifulsoup4-4.14.2/pyproject.toml 
new/beautifulsoup4-4.14.3/pyproject.toml
--- old/beautifulsoup4-4.14.2/pyproject.toml    2020-02-02 01:00:00.000000000 
+0100
+++ new/beautifulsoup4-4.14.3/pyproject.toml    2020-02-02 01:00:00.000000000 
+0100
@@ -31,7 +31,7 @@
 ]
 dependencies = [
     "typing-extensions>=4.0.0",
-    "soupsieve >1.2",
+    "soupsieve>=1.6.1",
 ]
 
 [project.optional-dependencies]
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/beautifulsoup4-4.14.2/tox.ini 
new/beautifulsoup4-4.14.3/tox.ini
--- old/beautifulsoup4-4.14.2/tox.ini   2020-02-02 01:00:00.000000000 +0100
+++ new/beautifulsoup4-4.14.3/tox.ini   2020-02-02 01:00:00.000000000 +0100
@@ -19,7 +19,7 @@
 commands = pytest {tty:--color=yes} {posargs}
 
 [testenv:docs]
-basepython = python3
+basepython = python3.13
 description = build the documentation
 deps = sphinx
 commands = sphinx-build -b html doc doc/build/html

++++++ htmlparser.patch ++++++
>From 55f655ffb7ef03bdd1df0f013743831fe54e3c7a Mon Sep 17 00:00:00 2001
From: Leonard Richardson <[email protected]>
Date: Mon, 8 Dec 2025 19:34:16 -0500
Subject: * Change the html.parser tree builder's code for handling numeric  
 character references, to avoid a crash when using Python versions   that
 include the fix to Python issue https://bugs.python.org/issue13633   (e.g.
 Python 3.11.13). [bug=2134393]

---
 CHANGELOG                    |  5 +++
 bs4/builder/_htmlparser.py   | 78 ++++++++++++++++++++++++++++++++++++--------
 bs4/tests/test_htmlparser.py | 17 ++++++++++
 3 files changed, 86 insertions(+), 14 deletions(-)

diff --git a/bs4/builder/_htmlparser.py b/bs4/builder/_htmlparser.py
index 165a3d8..ead800f 100644
--- a/bs4/builder/_htmlparser.py
+++ b/bs4/builder/_htmlparser.py
@@ -10,6 +10,7 @@ __all__ = [
 ]
 
 from html.parser import HTMLParser
+import re
 
 from typing import (
     Any,
@@ -223,6 +224,64 @@ class BeautifulSoupHTMLParser(HTMLParser, 
DetectsXMLParsedAsHTML):
         """Handle some textual data that shows up between tags."""
         self.soup.handle_data(data)
 
+    _DECIMAL_REFERENCE_WITH_FOLLOWING_DATA = re.compile("^([0-9]+)(.*)")
+    _HEX_REFERENCE_WITH_FOLLOWING_DATA = re.compile("^([0-9a-f]+)(.*)")
+
+    @classmethod
+    def _dereference_numeric_character_reference(cls, name:str) -> Tuple[str, 
bool, str]:
+        """Convert a numeric character reference into an actual character.
+
+        :param name: The number of the character reference, as
+          obtained by html.parser
+
+        :return: A 3-tuple (dereferenced, replacement_added,
+          extra_data). `dereferenced` is the dereferenced character
+          reference, or the empty string if there was no
+          reference. `replacement_added` is True if the reference
+          could only be dereferenced by replacing content with U+FFFD
+          REPLACEMENT CHARACTER. `extra_data` is a portion of data
+          following the character reference, which was deemed to be
+          normal data and not part of the reference at all.
+        """
+        dereferenced:str = ""
+        replacement_added:bool = False
+        extra_data:str = ""
+
+        base:int = 10
+        reg = cls._DECIMAL_REFERENCE_WITH_FOLLOWING_DATA
+        if name.startswith("x") or name.startswith("X"):
+            # Hex reference
+            name = name[1:]
+            base = 16
+            reg = cls._HEX_REFERENCE_WITH_FOLLOWING_DATA
+
+        real_name:Optional[int] = None
+        try:
+            real_name = int(name, base)
+        except ValueError:
+            # This is either bad data that starts with what looks like
+            # a numeric character reference, or a real numeric
+            # reference that wasn't terminated by a semicolon.
+            #
+            # The fix to https://bugs.python.org/issue13633 made it
+            # our responsibility to handle the extra data.
+            #
+            # To preserve the old behavior, we extract the numeric
+            # portion of the incoming "reference" and treat that as a
+            # numeric reference. All subsequent data will be processed
+            # as string data.
+            match = reg.search(name)
+            if match is not None:
+                real_name = int(match.groups()[0], base)
+                extra_data = match.groups()[1]
+
+        if real_name is None:
+            dereferenced = ""
+            extra_data = name
+        else:
+            dereferenced, replacement_added = 
UnicodeDammit.numeric_character_reference(real_name)
+        return dereferenced, replacement_added, extra_data
+
     def handle_charref(self, name: str) -> None:
         """Handle a numeric character reference by converting it to the
         corresponding Unicode character and treating it as textual
@@ -230,22 +289,13 @@ class BeautifulSoupHTMLParser(HTMLParser, 
DetectsXMLParsedAsHTML):
 
         :param name: Character number, possibly in hexadecimal.
         """
-        # TODO: This was originally a workaround for a bug in
-        # HTMLParser. (http://bugs.python.org/issue13633) The bug has
-        # been fixed, but removing this code still makes some
-        # Beautiful Soup tests fail. This needs investigation.
-        real_name:int
-        if name.startswith("x"):
-            real_name = int(name.lstrip("x"), 16)
-        elif name.startswith("X"):
-            real_name = int(name.lstrip("X"), 16)
-        else:
-            real_name = int(name)
-
-        data, replacement_added = 
UnicodeDammit.numeric_character_reference(real_name)
+        dereferenced, replacement_added, extra_data = 
self._dereference_numeric_character_reference(name)
         if replacement_added:
             self.soup.contains_replacement_characters = True
-        self.handle_data(data)
+        if dereferenced is not None:
+            self.handle_data(dereferenced)
+        if extra_data is not None:
+            self.handle_data(extra_data)
 
     def handle_entityref(self, name: str) -> None:
         """Handle a named entity reference by converting it to the
diff --git a/bs4/tests/test_htmlparser.py b/bs4/tests/test_htmlparser.py
index 0086a9d..cb85b53 100644
--- a/bs4/tests/test_htmlparser.py
+++ b/bs4/tests/test_htmlparser.py
@@ -162,3 +162,20 @@ class TestHTMLParserTreeBuilder(HTMLTreeBuilderSmokeTest):
         # Since we do the replacement ourselves, we can set 
contains_replacement_characters appropriately.
         # lxml and html5lib do the replacement so all we ever see is 
REPLACEMENT CHARACTER.
         assert soup.contains_replacement_characters == True
+
+class TestBeautifulSoupHTMLParser:
+    def test_dereference_numeric_character_reference(self):
+        m = BeautifulSoupHTMLParser._dereference_numeric_character_reference
+        assert m("64") == ("@", False, "")
+        assert m("x64") == ("d", False, "")
+        assert m("X64") == ("d", False, "")
+        assert m("64andsomeextra") == ("@", False, "andsomeextra")
+        assert m("") == ("", False, "")
+        assert m("00whee") == ("�", True, "whee")
+        assert m("xfffdthatsit") == ("�", False, "thatsit")
+        assert m("xabcdplussomeextra") == ("ꯍ", False, "plussomeextra")
+        assert m("obviouslynotnumeric") == ("", False, "obviouslynotnumeric")
+
+        # These are almost certainly wrong but at least it doesn't crash.
+        assert m("xabcdandsomeextra") == ("\U000abcda", False, "ndsomeextra")
+        assert m("xffffffffffffffffffffffbeep") == ("�", True, "p")
-- 
cgit v1.2.3

commit python-beautifulsoup4 for openSUSE:Factory

Reply via email to