Script 'mail_helper' called by obssrc
Hello community,
here is the log from the commit of package python-beautifulsoup4 for
openSUSE:Factory checked in at 2025-12-29 15:15:51
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Comparing /work/SRC/openSUSE:Factory/python-beautifulsoup4 (Old)
and /work/SRC/openSUSE:Factory/.python-beautifulsoup4.new.1928 (New)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Package is "python-beautifulsoup4"
Mon Dec 29 15:15:51 2025 rev:46 rq:1324671 version:4.14.3
Changes:
--------
---
/work/SRC/openSUSE:Factory/python-beautifulsoup4/python-beautifulsoup4.changes
2025-10-15 12:44:37.334994871 +0200
+++
/work/SRC/openSUSE:Factory/.python-beautifulsoup4.new.1928/python-beautifulsoup4.changes
2025-12-29 15:16:06.221284579 +0100
@@ -1,0 +2,21 @@
+Mon Dec 29 09:58:48 UTC 2025 - Markéta Machová <[email protected]>
+
+- update to 4.14.3
+ * When using one of the lxml tree builders, you can pass in
+ huge_tree=True to disable lxml's security restrictions and process
+ files that include huge text nodes. ("huge" means more than
+ 10,000,000 bytes of text in a single node). Without this, lxml may
+ silently stop processing the file after encountering a huge text
+ node.
+ * The html.parser tree builder processes numeric character entities
+ using the algorithm described in the HTML spec. If this means
+ replacing some other character with REPLACEMENT CHARACTER, it will
+ set BeautifulSoup.contains_replacement_characters.
+ * Added a general test of the html.parser tree builder's ability to
+ turn any parsing exception from html.parser into a
+ ParserRejectedMarkup exception. This makes it possible to remove
+ version-dependent tests that depended on the existence of specific
+ bugs in html.parser.
+- Add htmlparser.patch to fix behaviour with cpython interpreters
+
+-------------------------------------------------------------------
Old:
----
beautifulsoup4-4.14.2.tar.gz
New:
----
beautifulsoup4-4.14.3.tar.gz
htmlparser.patch
----------(New B)----------
New: bugs in html.parser.
- Add htmlparser.patch to fix behaviour with cpython interpreters
----------(New E)----------
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Other differences:
------------------
++++++ python-beautifulsoup4.spec ++++++
--- /var/tmp/diff_new_pack.CcxKUl/_old 2025-12-29 15:16:07.021317440 +0100
+++ /var/tmp/diff_new_pack.CcxKUl/_new 2025-12-29 15:16:07.021317440 +0100
@@ -18,12 +18,14 @@
%{?sle15_python_module_pythons}
Name: python-beautifulsoup4
-Version: 4.14.2
+Version: 4.14.3
Release: 0
Summary: HTML/XML Parser for Quick-Turnaround Applications Like
Screen-Scraping
License: MIT
URL: https://www.crummy.com/software/BeautifulSoup/
Source:
https://files.pythonhosted.org/packages/source/b/beautifulsoup4/beautifulsoup4-%{version}.tar.gz
+# PATCH-FIX-UPSTREAM 55f655ffb7ef03bdd1df0f013743831fe54e3c7a Change the
html.parser tree builder's code for handling numeric character references
+Patch0: htmlparser.patch
BuildRequires: %{python_module base >= 3.7}
BuildRequires: %{python_module hatchling}
BuildRequires: %{python_module pip}
++++++ beautifulsoup4-4.14.2.tar.gz -> beautifulsoup4-4.14.3.tar.gz ++++++
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/beautifulsoup4-4.14.2/CHANGELOG
new/beautifulsoup4-4.14.3/CHANGELOG
--- old/beautifulsoup4-4.14.2/CHANGELOG 2020-02-02 01:00:00.000000000 +0100
+++ new/beautifulsoup4-4.14.3/CHANGELOG 2020-02-02 01:00:00.000000000 +0100
@@ -1,3 +1,30 @@
+= 4.14.3 (20251130)
+
+* When using one of the lxml tree builders, you can pass in
+ huge_tree=True to disable lxml's security restrictions and process
+ files that include huge text nodes. ("huge" means more than
+ 10,000,000 bytes of text in a single node). Without this, lxml may
+ silently stop processing the file after encountering a huge text
+ node. [bug=2072424]
+
+* The html.parser tree builder processes numeric character entities
+ using the algorithm described in the HTML spec. If this means
+ replacing some other character with REPLACEMENT CHARACTER, it will
+ set BeautifulSoup.contains_replacement_characters. [bug=2126753]
+
+ The other tree builders rely on the underlying parser to do this
+ sort of replacement. That means that Beautiful Soup never sees the
+ original character reference, so it doesn't know whether
+ REPLACEMENT_CHARACTER was the original content; therefore
+ the html.parser tree builder will set contains_replacement_characters in
+ situations where the other tree builders won't.
+
+* Added a general test of the html.parser tree builder's ability to
+ turn any parsing exception from html.parser into a
+ ParserRejectedMarkup exception. This makes it possible to remove
+ version-dependent tests that depended on the existence of specific
+ bugs in html.parser. [bug=2121335,2121335]
+
= 4.14.2 (20250929)
* Making ResultSet inherit from MutableSequence still resulted in too many
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/beautifulsoup4-4.14.2/PKG-INFO
new/beautifulsoup4-4.14.3/PKG-INFO
--- old/beautifulsoup4-4.14.2/PKG-INFO 2020-02-02 01:00:00.000000000 +0100
+++ new/beautifulsoup4-4.14.3/PKG-INFO 2020-02-02 01:00:00.000000000 +0100
@@ -1,6 +1,6 @@
Metadata-Version: 2.4
Name: beautifulsoup4
-Version: 4.14.2
+Version: 4.14.3
Summary: Screen-scraping library
Project-URL: Download,
https://www.crummy.com/software/BeautifulSoup/bs4/download/
Project-URL: Homepage, https://www.crummy.com/software/BeautifulSoup/bs4/
@@ -19,7 +19,7 @@
Classifier: Topic :: Text Processing :: Markup :: SGML
Classifier: Topic :: Text Processing :: Markup :: XML
Requires-Python: >=3.7.0
-Requires-Dist: soupsieve>1.2
+Requires-Dist: soupsieve>=1.6.1
Requires-Dist: typing-extensions>=4.0.0
Provides-Extra: cchardet
Requires-Dist: cchardet; extra == 'cchardet'
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/beautifulsoup4-4.14.2/bs4/__init__.py
new/beautifulsoup4-4.14.3/bs4/__init__.py
--- old/beautifulsoup4-4.14.2/bs4/__init__.py 2020-02-02 01:00:00.000000000
+0100
+++ new/beautifulsoup4-4.14.3/bs4/__init__.py 2020-02-02 01:00:00.000000000
+0100
@@ -15,7 +15,7 @@
"""
__author__ = "Leonard Richardson ([email protected])"
-__version__ = "4.14.2"
+__version__ = "4.14.3"
__copyright__ = "Copyright (c) 2004-2025 Leonard Richardson"
# Use of this source code is governed by the MIT license.
__license__ = "MIT"
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/beautifulsoup4-4.14.2/bs4/builder/_htmlparser.py
new/beautifulsoup4-4.14.3/bs4/builder/_htmlparser.py
--- old/beautifulsoup4-4.14.2/bs4/builder/_htmlparser.py 2020-02-02
01:00:00.000000000 +0100
+++ new/beautifulsoup4-4.14.3/bs4/builder/_htmlparser.py 2020-02-02
01:00:00.000000000 +0100
@@ -234,6 +234,7 @@
# HTMLParser. (http://bugs.python.org/issue13633) The bug has
# been fixed, but removing this code still makes some
# Beautiful Soup tests fail. This needs investigation.
+ real_name:int
if name.startswith("x"):
real_name = int(name.lstrip("x"), 16)
elif name.startswith("X"):
@@ -241,26 +242,9 @@
else:
real_name = int(name)
- data = None
- if real_name < 256:
- # HTML numeric entities are supposed to reference Unicode
- # code points, but sometimes they reference code points in
- # some other encoding (ahem, Windows-1252). E.g. “
- # instead of É for LEFT DOUBLE QUOTATION MARK. This
- # code tries to detect this situation and compensate.
- for encoding in (self.soup.original_encoding, "windows-1252"):
- if not encoding:
- continue
- try:
- data = bytearray([real_name]).decode(encoding)
- except UnicodeDecodeError:
- pass
- if not data:
- try:
- data = chr(real_name)
- except (ValueError, OverflowError):
- pass
- data = data or "\N{REPLACEMENT CHARACTER}"
+ data, replacement_added =
UnicodeDammit.numeric_character_reference(real_name)
+ if replacement_added:
+ self.soup.contains_replacement_characters = True
self.handle_data(data)
def handle_entityref(self, name: str) -> None:
@@ -446,7 +430,11 @@
dammit.contains_replacement_characters,
)
- def feed(self, markup: _RawMarkup) -> None:
+ def feed(self, markup: _RawMarkup,
_parser_class:type[BeautifulSoupHTMLParser] =BeautifulSoupHTMLParser) -> None:
+ """
+ :param markup: The markup to feed into the parser.
+ :param _parser_class: An HTMLParser subclass to use. This is only
intended for use in unit tests.
+ """
args, kwargs = self.parser_args
# HTMLParser.feed will only handle str, but
@@ -461,7 +449,7 @@
# before calling feed(), so we can assume self.soup
# is set.
assert self.soup is not None
- parser = BeautifulSoupHTMLParser(self.soup, *args, **kwargs)
+ parser = _parser_class(self.soup, *args, **kwargs)
try:
parser.feed(markup)
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/beautifulsoup4-4.14.2/bs4/builder/_lxml.py
new/beautifulsoup4-4.14.3/bs4/builder/_lxml.py
--- old/beautifulsoup4-4.14.2/bs4/builder/_lxml.py 2020-02-02
01:00:00.000000000 +0100
+++ new/beautifulsoup4-4.14.3/bs4/builder/_lxml.py 2020-02-02
01:00:00.000000000 +0100
@@ -81,6 +81,12 @@
is_xml: bool = True
+ #: Set this to true (probably by passing huge_tree=True into the :
+ #: BeautifulSoup constructor) to enable the lxml feature "disable security
+ #: restrictions and support very deep trees and very long text
+ #: content".
+ huge_tree: bool
+
processing_instruction_class: Type[ProcessingInstruction]
NAME: str = "lxml-xml"
@@ -153,7 +159,7 @@
"""
if self._default_parser is not None:
return self._default_parser
- return self.DEFAULT_PARSER_CLASS(target=self, recover=True,
encoding=encoding)
+ return self.DEFAULT_PARSER_CLASS(target=self, recover=True,
huge_tree=self.huge_tree, encoding=encoding)
def parser_for(self, encoding: Optional[_Encoding]) -> _LXMLParser:
"""Instantiate an appropriate parser for the given encoding.
@@ -166,14 +172,15 @@
if callable(parser):
# Instantiate the parser with default arguments
- parser = parser(target=self, recover=True, encoding=encoding)
+ parser = parser(target=self, recover=True,
huge_tree=self.huge_tree, encoding=encoding)
return parser
def __init__(
- self,
- parser: Optional[etree.XMLParser] = None,
- empty_element_tags: Optional[Set[str]] = None,
- **kwargs: Any,
+ self,
+ parser: Optional[etree.XMLParser] = None,
+ empty_element_tags: Optional[Set[str]] = None,
+ huge_tree: bool = False,
+ **kwargs: Any,
):
# TODO: Issue a warning if parser is present but not a
# callable, since that means there's no way to create new
@@ -189,6 +196,8 @@
if "attribute_dict_class" not in kwargs:
kwargs["attribute_dict_class"] = XMLAttributeDict
+ self.huge_tree = huge_tree
+
super(LXMLTreeBuilderForXML, self).__init__(**kwargs)
def _getNsTag(self, tag: str) -> Tuple[Optional[str], str]:
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/beautifulsoup4-4.14.2/bs4/dammit.py
new/beautifulsoup4-4.14.3/bs4/dammit.py
--- old/beautifulsoup4-4.14.2/bs4/dammit.py 2020-02-02 01:00:00.000000000
+0100
+++ new/beautifulsoup4-4.14.3/bs4/dammit.py 2020-02-02 01:00:00.000000000
+0100
@@ -1192,7 +1192,9 @@
}
#: A map used when removing rogue Windows-1252/ISO-8859-1
- #: characters in otherwise UTF-8 documents.
+ #: characters in otherwise UTF-8 documents. Also used when a
+ #: numeric character entity has been incorrectly encoded using the
+ #: character's Windows-1252 encoding.
#:
#: Note that \\x81, \\x8d, \\x8f, \\x90, and \\x9d are undefined in
#: Windows-1252.
@@ -1321,8 +1323,38 @@
0xFC: b"\xc3\xbc", # ü
0xFD: b"\xc3\xbd", # ý
0xFE: b"\xc3\xbe", # þ
+ 0xFF: b"\xc3\xbf", # ÿ
}
+ #: :meta private
+ # Note that this isn't all Unicode noncharacters, just the noncontiguous
ones that need to be listed.
+ #
+ # "A noncharacter is a code point that is in the range
+ # U+FDD0 to U+FDEF, inclusive, or U+FFFE, U+FFFF, U+1FFFE,
+ # U+1FFFF, U+2FFFE, U+2FFFF, U+3FFFE, U+3FFFF, U+4FFFE,
+ # U+4FFFF, U+5FFFE, U+5FFFF, U+6FFFE, U+6FFFF, U+7FFFE,
+ # U+7FFFF, U+8FFFE, U+8FFFF, U+9FFFE, U+9FFFF, U+AFFFE,
+ # U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE, U+CFFFF, U+DFFFE,
+ # U+DFFFF, U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF, U+10FFFE,
+ # or U+10FFFF."
+ ENUMERATED_NONCHARACTERS: Set[int] = set([0xfffe, 0xffff,
+ 0x1fffe, 0x1ffff,
+ 0x2fffe, 0x2ffff,
+ 0x3fffe, 0x3ffff,
+ 0x4fffe, 0x4ffff,
+ 0x5fffe, 0x5ffff,
+ 0x6fffe, 0x6ffff,
+ 0x7fffe, 0x7ffff,
+ 0x8fffe, 0x8ffff,
+ 0x9fffe, 0x9ffff,
+ 0xafffe, 0xaffff,
+ 0xbfffe, 0xbffff,
+ 0xcfffe, 0xcffff,
+ 0xdfffe, 0xdffff,
+ 0xefffe, 0xeffff,
+ 0xffffe, 0xfffff,
+ 0x10fffe, 0x10ffff])
+
#: :meta private:
MULTIBYTE_MARKERS_AND_SIZES: List[Tuple[int, int, int]] = [
(0xC2, 0xDF, 2), # 2-byte characters start with a byte C2-DF
@@ -1337,6 +1369,82 @@
LAST_MULTIBYTE_MARKER: int = MULTIBYTE_MARKERS_AND_SIZES[-1][1]
@classmethod
+ def numeric_character_reference(cls, numeric:int) -> Tuple[str, bool]:
+ """This (mostly) implements the algorithm described in "Numeric
character
+ reference end state" from the HTML spec:
+
https://html.spec.whatwg.org/multipage/parsing.html#numeric-character-reference-end-state
+
+ The algorithm is designed to convert numeric character references like
"☃"
+ to Unicode characters like "☃".
+
+ :return: A 2-tuple (character, replaced). `character` is the Unicode
+ character corresponding to the numeric reference and `replaced` is
+ whether or not an unresolvable character was replaced with REPLACEMENT
+ CHARACTER.
+ """
+ replacement = "\ufffd"
+
+ if numeric == 0x00:
+ # "If the number is 0x00, then this is a
+ # null-character-reference parse error. Set the character
+ # reference code to 0xFFFD."
+ return replacement, True
+
+ if numeric > 0x10ffff:
+ # "If the number is greater than 0x10FFFF, then this is a
+ # character-reference-outside-unicode-range parse
+ # error. Set the character reference code to 0xFFFD."
+ return replacement, True
+
+ if numeric >= 0xd800 and numeric <= 0xdfff:
+ # "If the number is a surrogate, then this is a
+ # surrogate-character-reference parse error. Set the
+ # character reference code to 0xFFFD."
+ return replacement, True
+
+ if (numeric >= 0xfdd0 and numeric <= 0xfdef) or numeric in
cls.ENUMERATED_NONCHARACTERS:
+ # "If the number is a noncharacter, then this is a
+ # noncharacter-character-reference parse error."
+ #
+ # "The parser resolves such character references as-is."
+ #
+ # I'm not sure what "as-is" means but I think it means that we act
+ # like there was no error condition.
+ return chr(numeric), False
+
+ # "If the number is 0x0D, or a control that's not ASCII whitespace,
+ # then this is a control-character-reference parse error."
+ #
+ # "A control is a C0 control or a code point in the range
+ # U+007F DELETE to U+009F APPLICATION PROGRAM COMMAND,
+ # inclusive."
+ #
+ # "A C0 control is a code point in the range U+0000 NULL to U+001F
INFORMATION SEPARATOR ONE, inclusive."
+ #
+ # "The parser resolves such character references as-is except C1
control references that are replaced."
+
+ # First, let's replace the control references that can be replaced.
+ if numeric >= 0x80 and numeric <= 0x9f and numeric in
cls.WINDOWS_1252_TO_UTF8:
+ # "If the number is one of the numbers in the first column of the
+ # following table, then find the row with that number in the first
+ # column, and set the character reference code to the number in the
+ # second column of that row."
+ #
+ # This is an attempt to catch characters that were encoded to
numeric
+ # entities using their Windows-1252 encodings rather than their
UTF-8
+ # encodings.
+ return cls.WINDOWS_1252_TO_UTF8[numeric].decode("utf8"), False
+
+ # Now all that's left are references that should be resolved as-is.
This
+ # is also the default path for non-weird character references.
+ try:
+ return chr(numeric), False
+ except (ValueError, OverflowError):
+ # This shouldn't happen, since these cases should have been handled
+ # above, but if it does, return REPLACEMENT CHARACTER
+ return replacement, True
+
+ @classmethod
def detwingle(
cls,
in_bytes: bytes,
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/beautifulsoup4-4.14.2/bs4/tests/__init__.py
new/beautifulsoup4-4.14.3/bs4/tests/__init__.py
--- old/beautifulsoup4-4.14.2/bs4/tests/__init__.py 2020-02-02
01:00:00.000000000 +0100
+++ new/beautifulsoup4-4.14.3/bs4/tests/__init__.py 2020-02-02
01:00:00.000000000 +0100
@@ -1125,6 +1125,15 @@
soup = self.soup(BAD_DOCUMENT)
self.linkage_validator(soup)
+ @pytest.mark.skipif("sys.version_info < (3, 8)")
+ def test_surrogate_in_character_reference(self):
+ # These character references are invalid and should be replaced with
REPLACEMENT CHARACTER.
+ soup = self.soup("<html><body>��</body></html>")
+ assert soup.body.contents == ['��']
+
+ # Only with html.parser (q.v.) do we actually see the invalid
character references on the way in. Since
+ # other builders don't see them we don't know the REPLACEMENT
CHARACTER wasn't originally there.
+ assert soup.contains_replacement_characters == False
class XMLTreeBuilderSmokeTest(TreeBuilderSmokeTest):
def test_pickle_and_unpickle_identity(self):
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore'
old/beautifulsoup4-4.14.2/bs4/tests/fuzz/clusterfuzz-testcase-minimized-bs4_fuzzer-5703933063462912.testcase
new/beautifulsoup4-4.14.3/bs4/tests/fuzz/clusterfuzz-testcase-minimized-bs4_fuzzer-5703933063462912.testcase
---
old/beautifulsoup4-4.14.2/bs4/tests/fuzz/clusterfuzz-testcase-minimized-bs4_fuzzer-5703933063462912.testcase
2020-02-02 01:00:00.000000000 +0100
+++
new/beautifulsoup4-4.14.3/bs4/tests/fuzz/clusterfuzz-testcase-minimized-bs4_fuzzer-5703933063462912.testcase
1970-01-01 01:00:00.000000000 +0100
@@ -1,2 +0,0 @@
-
-<![
\ No newline at end of file
Binary files
old/beautifulsoup4-4.14.2/bs4/tests/fuzz/crash-ffbdfa8a2b26f13537b68d3794b0478a4090ee4a.testcase
and
new/beautifulsoup4-4.14.3/bs4/tests/fuzz/crash-ffbdfa8a2b26f13537b68d3794b0478a4090ee4a.testcase
differ
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/beautifulsoup4-4.14.2/bs4/tests/test_dammit.py
new/beautifulsoup4-4.14.3/bs4/tests/test_dammit.py
--- old/beautifulsoup4-4.14.2/bs4/tests/test_dammit.py 2020-02-02
01:00:00.000000000 +0100
+++ new/beautifulsoup4-4.14.3/bs4/tests/test_dammit.py 2020-02-02
01:00:00.000000000 +0100
@@ -442,3 +442,71 @@
def test_ambiguous_ampersands_escaped(self, markup, expect):
assert self.sub.substitute_html(markup) == expect
assert self.sub.substitute_html5_raw(markup) == expect
+
+class TestNumericCharacterReference:
+
+ # Numeric character references that trigger replacement with REPLACEMENT
CHARACTER.
+ @pytest.mark.parametrize(
+ "numeric", [0x00, # Null
+ 0x11ffff, # Out of range
+ -0x11, # Out of range
+ 0xd800, # Surrogate
+ 0xda12, # Surrogate
+ 0xdfff, # Surrogate
+ ]
+ )
+ def test_bad_numeric_entities(self, numeric):
+ assert UnicodeDammit.numeric_character_reference(numeric) ==
("\N{REPLACEMENT CHARACTER}", True)
+
+ # Numeric character references that are treated as having been mistakenly
encoded from Windows-1252.
+ @pytest.mark.parametrize(
+ "numeric,character", [
+ (0x80, "€"),
+ (0x8C, "Œ"),
+ (0x91, "‘"),
+ (0x92, "’"),
+ (0x93, "“"),
+ (0x94, "”"),
+ (0x97, "—"),
+ (0x9F, "Ÿ"),
+ ]
+ )
+ def test_windows_1252_numeric_entities(self, numeric, character):
+ assert UnicodeDammit.numeric_character_reference(numeric) ==
(character, False)
+
+ # Numeric references that are resolved to the corresponding Unicode
characters, even though
+ # some of them also trigger parser error conditions.
+ @pytest.mark.parametrize(
+ "numeric,character", [
+ # Noncharacters
+ (0xfdd0, "�"),
+ (0xfdd9, "�"),
+ (0xfdef, "�"),
+ (0x2fffe, "�"),
+ (0xaffff, "�"),
+ (0x10ffff, "�"),
+
+ # Characters that are the same in Windows-1252 and UTF-8.
+ (0xFE, "þ"),
+ (0xFF, "ÿ"),
+
+ # ASCII whitespace
+ (0x09, "\t"),
+ (0x0a, "\n"),
+ (0x0c, "\x0c"),
+ (0x0d, "\r"),
+ (0x20, " "),
+
+ # The numeric entity for REPLACEMENT CHARACTER is converted
+ # correctly but the boolean part of the return value is false,
+ # because REPLACEMENT CHARACTER wasn't used to replace some other
+ # character--it _is_ the other character.
+ (0xfffd, "�"),
+
+ # Miscellaneous spot checks.
+ (0x61, "a"),
+ (0x2603, "☃"),
+ ]
+ )
+ def test_normal_numeric_entities(self, numeric, character):
+ assert UnicodeDammit.numeric_character_reference(numeric) ==
(character, False)
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/beautifulsoup4-4.14.2/bs4/tests/test_fuzz.py
new/beautifulsoup4-4.14.3/bs4/tests/test_fuzz.py
--- old/beautifulsoup4-4.14.2/bs4/tests/test_fuzz.py 2020-02-02
01:00:00.000000000 +0100
+++ new/beautifulsoup4-4.14.3/bs4/tests/test_fuzz.py 2020-02-02
01:00:00.000000000 +0100
@@ -66,23 +66,6 @@
return
soup.prettify()
- # This class of error has been fixed by catching a less helpful
- # exception from html.parser and raising ParserRejectedMarkup
- # instead.
- @pytest.mark.parametrize(
- "filename",
- [
- "clusterfuzz-testcase-minimized-bs4_fuzzer-5703933063462912",
- "crash-ffbdfa8a2b26f13537b68d3794b0478a4090ee4a",
- ],
- )
- # Fixed in https://github.com/python/cpython/issues/77057
- @pytest.mark.skipif("sys.version_info >= (3, 13)")
- def test_rejected_markup(self, filename):
- markup = self.__markup(filename)
- with pytest.raises(ParserRejectedMarkup):
- BeautifulSoup(markup, "html.parser")
-
# This class of error has to do with very deeply nested documents
# which overflow the Python call stack when the tree is converted
# to a string. This is an issue with Beautiful Soup which was fixed
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/beautifulsoup4-4.14.2/bs4/tests/test_htmlparser.py
new/beautifulsoup4-4.14.3/bs4/tests/test_htmlparser.py
--- old/beautifulsoup4-4.14.2/bs4/tests/test_htmlparser.py 2020-02-02
01:00:00.000000000 +0100
+++ new/beautifulsoup4-4.14.3/bs4/tests/test_htmlparser.py 2020-02-02
01:00:00.000000000 +0100
@@ -8,6 +8,7 @@
BeautifulSoupHTMLParser,
HTMLParserTreeBuilder,
)
+from bs4 import BeautifulSoup
from bs4.exceptions import ParserRejectedMarkup
from typing import Any
from . import HTMLTreeBuilderSmokeTest
@@ -16,28 +17,19 @@
class TestHTMLParserTreeBuilder(HTMLTreeBuilderSmokeTest):
default_builder = HTMLParserTreeBuilder
- # Fixed in https://github.com/python/cpython/issues/77057
- @pytest.mark.skipif("sys.version_info >= (3, 13)")
- def test_rejected_input(self):
- # Python's html.parser will occasionally reject markup,
- # especially when there is a problem with the initial DOCTYPE
- # declaration. Different versions of Python sound the alarm in
- # different ways, but Beautiful Soup consistently raises
- # errors as ParserRejectedMarkup exceptions.
- bad_markup = [
- # https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=28873
- #
https://github.com/guidovranken/python-library-fuzzers/blob/master/corp-html/519e5b4269a01185a0d5e76295251921da2f0700
- # https://github.com/python/cpython/issues/81928
- b"\n<![\xff\xfe\xfe\xcd\x00",
- #
https://github.com/guidovranken/python-library-fuzzers/blob/master/corp-html/de32aa55785be29bbc72a1a8e06b00611fb3d9f8
- # https://github.com/python/cpython/issues/78661
- #
- b"<![n\x00",
- b"<![UNKNOWN[]]>",
- ]
- for markup in bad_markup:
- with pytest.raises(ParserRejectedMarkup):
- self.soup(markup)
+ def test_feed_raises_correct_exception_on_rejected_input(self):
+ # Mock BeautifulSoupHTMLParser so it raises an AssertionError and
verify that this is
+ # turned into a ParserRejectedMarkup.
+ #
+ # This replaces a test that relied on bugs in html.parser which have
been fixed.
+ class Mock(BeautifulSoupHTMLParser):
+ def feed(self, markup):
+ raise AssertionError("all markup is bad!")
+
+ with pytest.raises(ParserRejectedMarkup):
+ builder = HTMLParserTreeBuilder()
+ builder.soup = BeautifulSoup()
+ builder.feed("any markup", Mock)
def test_namespaced_system_doctype(self):
# html.parser can't handle namespaced doctypes, so skip this one.
@@ -48,7 +40,7 @@
pass
def test_builder_is_pickled(self):
- """Unlike most tree builders, HTMLParserTreeBuilder and will
+ """Unlike most tree builders, HTMLParserTreeBuilder can be pickled and
will
be restored after pickling.
"""
tree = self.soup("<a><b>foo</a>")
@@ -161,3 +153,12 @@
markup = "<p>a &nosuchentity; b</p>"
soup = self.soup(markup)
assert "<p>a &nosuchentity b</p>" == soup.p.decode()
+
+ def test_surrogate_in_character_reference(self):
+ # These character references are invalid and should be replaced with
REPLACEMENT CHARACTER.
+ soup = self.soup("<html><body>��</body></html>")
+ assert soup.body.contents == ['��']
+
+ # Since we do the replacement ourselves, we can set
contains_replacement_characters appropriately.
+ # lxml and html5lib do the replacement so all we ever see is
REPLACEMENT CHARACTER.
+ assert soup.contains_replacement_characters == True
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/beautifulsoup4-4.14.2/bs4/tests/test_lxml.py
new/beautifulsoup4-4.14.3/bs4/tests/test_lxml.py
--- old/beautifulsoup4-4.14.2/bs4/tests/test_lxml.py 2020-02-02
01:00:00.000000000 +0100
+++ new/beautifulsoup4-4.14.3/bs4/tests/test_lxml.py 2020-02-02
01:00:00.000000000 +0100
@@ -199,3 +199,31 @@
assert "some markup" == unpickled.a.string
assert unpickled.builder != soup.builder
assert isinstance(unpickled.builder, self.default_builder)
+
+ def test_huge_tree(self):
+ # Verify that a tree with very large text nodes can be completely
parsed
+ # if huge_tree=True.
+ def doc(size):
+ points = 'A'*size
+ input_svg = f'''<?xml version="1.0" encoding="utf-8"?>
+<svg xmlns="http://www.w3.org/2000/svg">
+ <g id="one"/>
+ <g id="two">
+ <polygon points="{points}" id="p1"/>
+ </g>
+ <g id="three"/>
+ <g id="four">
+ <polygon points="{points}" id="p2"/>
+ <polygon id="p3" points="AAA"/>
+ </g>
+ <g id="five"/>
+</svg>
+'''
+ return input_svg
+
+ d = doc(10000000) # libxml2 XML_MAX_TEXT_LENGTH
+ soup = self.soup(d, huge_tree=True)
+
+ # This would be 4 with huge_tree=False, but there's no need to
+ # lock in a test for that, since it's undesirable behavior.
+ assert len(soup.find_all("g")) == 5
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/beautifulsoup4-4.14.2/doc/index.rst
new/beautifulsoup4-4.14.3/doc/index.rst
--- old/beautifulsoup4-4.14.2/doc/index.rst 2020-02-02 01:00:00.000000000
+0100
+++ new/beautifulsoup4-4.14.3/doc/index.rst 2020-02-02 01:00:00.000000000
+0100
@@ -16,7 +16,7 @@
how to use it, how to make it do what you want, and what to do when it
violates your expectations.
-This document covers Beautiful Soup version 4.14.2. The examples in
+This document covers Beautiful Soup version 4.14.3. The examples in
this documentation were written for Python 3.8.
You might be looking for the documentation for `Beautiful Soup 3
@@ -891,8 +891,8 @@
gives you the entire ancestry of an element, including the element
itself::
- for parent in link.parents:
- print(parent.name)
+ for tag in link.self_and_parents:
+ print(tag.name)
# link
# p
# body
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/beautifulsoup4-4.14.2/pyproject.toml
new/beautifulsoup4-4.14.3/pyproject.toml
--- old/beautifulsoup4-4.14.2/pyproject.toml 2020-02-02 01:00:00.000000000
+0100
+++ new/beautifulsoup4-4.14.3/pyproject.toml 2020-02-02 01:00:00.000000000
+0100
@@ -31,7 +31,7 @@
]
dependencies = [
"typing-extensions>=4.0.0",
- "soupsieve >1.2",
+ "soupsieve>=1.6.1",
]
[project.optional-dependencies]
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/beautifulsoup4-4.14.2/tox.ini
new/beautifulsoup4-4.14.3/tox.ini
--- old/beautifulsoup4-4.14.2/tox.ini 2020-02-02 01:00:00.000000000 +0100
+++ new/beautifulsoup4-4.14.3/tox.ini 2020-02-02 01:00:00.000000000 +0100
@@ -19,7 +19,7 @@
commands = pytest {tty:--color=yes} {posargs}
[testenv:docs]
-basepython = python3
+basepython = python3.13
description = build the documentation
deps = sphinx
commands = sphinx-build -b html doc doc/build/html
++++++ htmlparser.patch ++++++
>From 55f655ffb7ef03bdd1df0f013743831fe54e3c7a Mon Sep 17 00:00:00 2001
From: Leonard Richardson <[email protected]>
Date: Mon, 8 Dec 2025 19:34:16 -0500
Subject: * Change the html.parser tree builder's code for handling numeric
character references, to avoid a crash when using Python versions that
include the fix to Python issue https://bugs.python.org/issue13633 (e.g.
Python 3.11.13). [bug=2134393]
---
CHANGELOG | 5 +++
bs4/builder/_htmlparser.py | 78 ++++++++++++++++++++++++++++++++++++--------
bs4/tests/test_htmlparser.py | 17 ++++++++++
3 files changed, 86 insertions(+), 14 deletions(-)
diff --git a/bs4/builder/_htmlparser.py b/bs4/builder/_htmlparser.py
index 165a3d8..ead800f 100644
--- a/bs4/builder/_htmlparser.py
+++ b/bs4/builder/_htmlparser.py
@@ -10,6 +10,7 @@ __all__ = [
]
from html.parser import HTMLParser
+import re
from typing import (
Any,
@@ -223,6 +224,64 @@ class BeautifulSoupHTMLParser(HTMLParser,
DetectsXMLParsedAsHTML):
"""Handle some textual data that shows up between tags."""
self.soup.handle_data(data)
+ _DECIMAL_REFERENCE_WITH_FOLLOWING_DATA = re.compile("^([0-9]+)(.*)")
+ _HEX_REFERENCE_WITH_FOLLOWING_DATA = re.compile("^([0-9a-f]+)(.*)")
+
+ @classmethod
+ def _dereference_numeric_character_reference(cls, name:str) -> Tuple[str,
bool, str]:
+ """Convert a numeric character reference into an actual character.
+
+ :param name: The number of the character reference, as
+ obtained by html.parser
+
+ :return: A 3-tuple (dereferenced, replacement_added,
+ extra_data). `dereferenced` is the dereferenced character
+ reference, or the empty string if there was no
+ reference. `replacement_added` is True if the reference
+ could only be dereferenced by replacing content with U+FFFD
+ REPLACEMENT CHARACTER. `extra_data` is a portion of data
+ following the character reference, which was deemed to be
+ normal data and not part of the reference at all.
+ """
+ dereferenced:str = ""
+ replacement_added:bool = False
+ extra_data:str = ""
+
+ base:int = 10
+ reg = cls._DECIMAL_REFERENCE_WITH_FOLLOWING_DATA
+ if name.startswith("x") or name.startswith("X"):
+ # Hex reference
+ name = name[1:]
+ base = 16
+ reg = cls._HEX_REFERENCE_WITH_FOLLOWING_DATA
+
+ real_name:Optional[int] = None
+ try:
+ real_name = int(name, base)
+ except ValueError:
+ # This is either bad data that starts with what looks like
+ # a numeric character reference, or a real numeric
+ # reference that wasn't terminated by a semicolon.
+ #
+ # The fix to https://bugs.python.org/issue13633 made it
+ # our responsibility to handle the extra data.
+ #
+ # To preserve the old behavior, we extract the numeric
+ # portion of the incoming "reference" and treat that as a
+ # numeric reference. All subsequent data will be processed
+ # as string data.
+ match = reg.search(name)
+ if match is not None:
+ real_name = int(match.groups()[0], base)
+ extra_data = match.groups()[1]
+
+ if real_name is None:
+ dereferenced = ""
+ extra_data = name
+ else:
+ dereferenced, replacement_added =
UnicodeDammit.numeric_character_reference(real_name)
+ return dereferenced, replacement_added, extra_data
+
def handle_charref(self, name: str) -> None:
"""Handle a numeric character reference by converting it to the
corresponding Unicode character and treating it as textual
@@ -230,22 +289,13 @@ class BeautifulSoupHTMLParser(HTMLParser,
DetectsXMLParsedAsHTML):
:param name: Character number, possibly in hexadecimal.
"""
- # TODO: This was originally a workaround for a bug in
- # HTMLParser. (http://bugs.python.org/issue13633) The bug has
- # been fixed, but removing this code still makes some
- # Beautiful Soup tests fail. This needs investigation.
- real_name:int
- if name.startswith("x"):
- real_name = int(name.lstrip("x"), 16)
- elif name.startswith("X"):
- real_name = int(name.lstrip("X"), 16)
- else:
- real_name = int(name)
-
- data, replacement_added =
UnicodeDammit.numeric_character_reference(real_name)
+ dereferenced, replacement_added, extra_data =
self._dereference_numeric_character_reference(name)
if replacement_added:
self.soup.contains_replacement_characters = True
- self.handle_data(data)
+ if dereferenced is not None:
+ self.handle_data(dereferenced)
+ if extra_data is not None:
+ self.handle_data(extra_data)
def handle_entityref(self, name: str) -> None:
"""Handle a named entity reference by converting it to the
diff --git a/bs4/tests/test_htmlparser.py b/bs4/tests/test_htmlparser.py
index 0086a9d..cb85b53 100644
--- a/bs4/tests/test_htmlparser.py
+++ b/bs4/tests/test_htmlparser.py
@@ -162,3 +162,20 @@ class TestHTMLParserTreeBuilder(HTMLTreeBuilderSmokeTest):
# Since we do the replacement ourselves, we can set
contains_replacement_characters appropriately.
# lxml and html5lib do the replacement so all we ever see is
REPLACEMENT CHARACTER.
assert soup.contains_replacement_characters == True
+
+class TestBeautifulSoupHTMLParser:
+ def test_dereference_numeric_character_reference(self):
+ m = BeautifulSoupHTMLParser._dereference_numeric_character_reference
+ assert m("64") == ("@", False, "")
+ assert m("x64") == ("d", False, "")
+ assert m("X64") == ("d", False, "")
+ assert m("64andsomeextra") == ("@", False, "andsomeextra")
+ assert m("") == ("", False, "")
+ assert m("00whee") == ("�", True, "whee")
+ assert m("xfffdthatsit") == ("�", False, "thatsit")
+ assert m("xabcdplussomeextra") == ("ꯍ", False, "plussomeextra")
+ assert m("obviouslynotnumeric") == ("", False, "obviouslynotnumeric")
+
+ # These are almost certainly wrong but at least it doesn't crash.
+ assert m("xabcdandsomeextra") == ("\U000abcda", False, "ndsomeextra")
+ assert m("xffffffffffffffffffffffbeep") == ("�", True, "p")
--
cgit v1.2.3