[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

Ilya Basin (Jira) Sat, 14 Feb 2026 00:33:12 -0800


    [ 
https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18058573#comment-18058573
 ]


Ilya Basin commented on XALANJ-2419:
------------------------------------

Apparently version 2.7.3 on Maven Central was built from the git tag 
xalan-j_2_7_3-rc10. I built this tag locally with the openjdk8, it printed 
warnings like "ClassNotFoundException: com.sun.image.codec.jpeg.JPEGCodec", but 
the jars were built. I compared the class files, they're identical. The fixing 
commit 9e67d121c547dc82157b7852fd7b23eb9f260071 also merges smoothly and my 
XMLs are now good.

To bump the version to 2.7.3.1 I had to edit build.xml:
{code:java}
-  <property name="version"                
value="${version.VERSION}_${version.RELEASE}_${version.DEVELOPER}${version.MINOR}"/><!--
 GUMP: version # of dist file -->
-  <property name="impl.version"           
value="${version.VERSION}.${version.RELEASE}.${version.DEVELOPER}${version.MINOR}"/><!--
 Used in jar task for filtering MANIFEST.MF file -->
+  <property name="version.MINOR2"         value=".1"/>
+  <property name="version"                
value="${version.VERSION}_${version.RELEASE}_${version.DEVELOPER}${version.MINOR}${version.MINOR2}"/><!--
 GUMP: version # of dist file -->
+  <property name="impl.version"           
value="${version.VERSION}.${version.RELEASE}.${version.DEVELOPER}${version.MINOR}${version.MINOR2}"/><!--
 Used in jar task for filtering MANIFEST.MF file --> {code}

> Astral characters written as a pair of NCRs with the surrogate scalar values 
> when using UTF-8
> ---------------------------------------------------------------------------------------------
>
>                 Key: XALANJ-2419
>                 URL: https://issues.apache.org/jira/browse/XALANJ-2419
>             Project: XalanJ2
>          Issue Type: Bug
>          Components: Serialization
>    Affects Versions: 2.7.1
>            Reporter: Henri Sivonen
>            Assignee: Joe Kesselman
>            Priority: Major
>             Fix For: The Latest Development Code
>
>         Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
>                     else if (m_encodingInfo.isInEncoding(ch)) {
>                         // If the character is in the encoding, and
>                         // not in the normal ASCII range, we also
>                         // just leave it get added on to the clean characters
>                         
>                     }
>                     else {
>                         // This is a fallback plan, we should never get here
>                         // but if the character wasn't previously handled
>                         // (i.e. isn't in the encoding, etc.) then what
>                         // should we do?  We choose to write out an entity
>                         writeOutCleanChars(chars, i, lastDirtyCharProcessed);
>                         writer.write("&#");
>                         writer.write(Integer.toString(ch));
>                         writer.write(';');
>                         lastDirtyCharProcessed = i;
>                     }
> This leads to the wrong (latter) if branch running for surrogates, because 
> isInEncoding() for UTF-8 returns false for surrogates. It is always wrong 
> (regardless of encoding) to escape a surrogate as an NCR.
> The practical effect of this bug is that any document with astral characters 
> in it ends up in an ill-formed serialization and does not parse back using an 
> XML parser.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

Reply via email to