[issue43560] Modify SAX/expat parsing to avoid fragmentation of already-tiny content chunks

Larry Trammell Fri, 19 Mar 2021 11:39:35 -0700


New submission from Larry Trammell <ridge...@nwi.net>:


Issue 43483 was posted as a "bug" but retracted.  Though the problem is real, 
it is tricky to declare an UNSPECIFIED behavior to be a bug.  See that issue 
page for more discussion and a test case.  A brief overview is repeated here.

SCENARIO - XML PARSING LOSES DATA (or not)

The parsing attempts to capture text consisting of very tiny quoted strings. A 
typical content line reads something like this: 

   <p>Colchuck</p>

The parser implements a scheme presented at various tutorial Web sites, using 
two member functions. 

   # Note the name attribute of the current tag group
   def element_handler(self, tagname, attrs) :
       self.CurrentTag = tagname      

   # Record the content from each "p" tag when encountered
   def characters(self, content):
       if self.CurrentTag == "p":
           self.name = content

   ...

   > print(parser.name)
   "Colchuck" 

But then, after successfully extracting content from perhaps hundreds of 
thousands of XML tag sets in this way, the parsing suddenly "drops" a few 
characters of content. 

   > print(parser.name)
   "lchuck" 

While this problem was observed with a SAX parser, it can affect expat parsers 
as well.  It affects 32-bit and 64-bit implementations the same, over several 
major releases of the Python 3 system.  

SPECIFIED BEHAVIOR (or not) 

The "xml.sax.handler" page in the Python 3.9.2 Documentation for the Python 
Standard Library (and many prior versions) states:

-----------
ContentHandler.characters(content) -- The Parser will call this method to 
report each chunk of character data.  SAX parsers may return all contiguous 
character data in a single chunk, or they may split it into several chunks...
-----------

If it happens that the content is delivered in two chunks instead of one, the 
characters() method shown above overwrites the first part of the text with the 
second part, and some content seems lost.  This completely explains the 
observed behavior.  

EXPECTED BEHAVIOR (or not)

Even though the behavior is unspecified, users can have certain expectations 
about what a reasonable parser should do.  Among these:

  -- EFFICIENCY: the parser should do simple things simply, and complicated 
things as simply as possible
  -- CONSISTENCY: the parser behavior should be repeatable and dependable

The design can be considered "poor" if thorough testing cannot identify what 
the actual behaviors are going to be, because those behaviors are rare and 
unpredictable.

The obvious "simple thing," from the user perspective, is that the parser 
should return each tiny text string as one tiny text chunk.  In fact, this is 
precisely what it does... 99.999% of the time.  But then, suddenly, it doesn't. 
 

One hypothesis is that when the parsing scan of raw input text reaches the end 
of a large internal text buffer, it is easier from the implementer's 
perspective to flush any text remaining in the old buffer prior to fetching a 
new one, even if that produces a fragmented chunk with only a couple of 
characters.  

IMPROVEMENTS REQUIRED

Review the code to determine whether the text buffer scenario is in fact the 
primary cause of inconsistent behavior. Modify the data handling to defer 
delivery of content fragments that are small, carrying over a small amount of 
previously scanned text so that small contiguous text chunks are recombined 
rather than reported as multiple fragments. If the length of the content text 
to carry over is greater than some configurable 
xml.sax.handler.ContiguousChunkLength, the parser can go ahead and deliver it 
as a fragment.  

DOCUMENTING THE IMPROVEMENTS 

Strictly speaking:  none required.  Undefined behaviors are undefined, whether 
consistent or otherwise.  But after the improvements are implemented, it would 
be helpful to modify documentation to expose the new performance guarantees, 
making users more aware of the possible hazards.  For example, a new 
description in the "xml.sax.handler" page might read as follows: 

-----------
ContentHandler.characters(content) -- The Parser will call this method to 
report chunks of character data.  In general, character data may be reported as 
a single chunk or as sequence of chunks; but character data sequences with 
fewer than  xml.sax.handler.ContiguousChunkLength characters, when 
uninterrupted any other xml.sax.handler.ContentHandler event, are guaranteed to 
be delivered as a single chunk...  
-----------

----------
components: XML
messages: 389108
nosy: ridgerat1611
priority: normal
severity: normal
status: open
title: Modify SAX/expat parsing to avoid fragmentation of already-tiny content 
chunks
type: enhancement
versions: Python 3.7, Python 3.8, Python 3.9

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue43560>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43560] Modify SAX/expat parsing to avoid fragmentation of already-tiny content chunks

Reply via email to