[
https://issues.apache.org/jira/browse/CALCITE-7609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18090269#comment-18090269
]
Piotr Karwasz commented on CALCITE-7609:
----------------------------------------
[~zabetak],
Thanks for opening this discussion.
A quick update first: [~ggregory] has created a [commons-xml
repository|https://github.com/apache/commons-xml], so development of the
library will continue under Commons.
There are real pros and cons to weigh when deciding whether to replace 10-20
lines of code with an external dependency. They are exactly the ones I weighed
when writing the library.
XML in Java is unlike other formats in two respects:
* It is *pluggable*: the classpath can influence which implementation is
selected. Since Java 9 you can opt out via the {{newDefaultInstance()}}
methods, but most of our libraries will keep Java 8 compatibility until its EOL
in 2030.
* An implementation ships with the JDK. For a long time it *trailed* the
standalone Xerces and Xalan projects, but those are now largely dormant, so the
JDK implementations are often the better choice today. The JDK still lags
third-party options for {{TransformerFactory}} (Saxon) and {{XMLInputFactory}}
(Woodstox).
So there are three ways to approach XML processing:
* use the pluggable factories;
* pin to the JDK-provided factories (by requiring Java 9+ or via reflection);
* use an external library.
h2. Per-project XML helpers
Each project writing its own helpers to build secure factories is the current
_status quo_.
h3. Pros
* the hardening code has been around for decades;
* it is only a dozen lines;
* no external dependency.
h3. Cons
* it relies on features that not every pluggable implementation supports;
* hardening may be needed not only on the JAXP factory, but also on the objects
it creates. For example:
** configuring {{SAXParserFactory}} is not enough; each {{XMLReader}} it
produces must be configured too;
** {{TransformerFactory.newTransformer(Source)}} and
{{Transformer.transform(Source, Result)}} can fall back to a default parser
when the {{Source}} carries none (e.g. a {{StreamSource}});
* the hardening code usually has no tests;
* the Java 9 "default" factories largely solve the problems above, with the
notable exception of {{XMLInputFactory}}, as Tika learned the hard way recently;
* the Java 9 "default" factories on Android differ from those on the JVM and do
not support the same features.
h2. Commons XML library
The library aims to be easy to use, so it wraps factories when a hardening step
must be deferred.
h3. Pros
* hardening is guaranteed, or the helper throws on an unsupported
implementation;
* hardening is tested;
* hardening lives only at factory-creation level: every object the factory
produces is safe to use unconfigured.
h3. Cons
* it is an *external* dependency;
* even though we intend to make individual hardening recipes easier to shade
separately, the library will be around 40-50 KiB for "a few lines of code".
h2. Preventing regressions
Whichever option we pick, we can keep the raw bootstraps out of the codebase
with [forbidden-apis|https://github.com/policeman-tools/forbidden-apis], which
Calcite already runs. A small signatures file banning
{{DocumentBuilderFactory.newInstance()}}, {{TransformerFactory.newInstance()}},
{{XPathFactory.newInstance()}} and the rest of the
{{newInstance}}/{{newFactory}} family would fail the build on any new direct
use, leaving the helper (or {{commons-xml}}) as the only sanctioned entry point.
> Use Copernik XML Factory for safely parsing XML documents
> ---------------------------------------------------------
>
> Key: CALCITE-7609
> URL: https://issues.apache.org/jira/browse/CALCITE-7609
> Project: Calcite
> Issue Type: Task
> Components: core
> Reporter: Stamatis Zampetakis
> Priority: Major
>
> There are a few places in the project where we manipulate XML documents
> (notably inside the \{{XmlFunctions}} class). The default settings for the
> various XML APIs provide weak security guarantees leading to vulnerabilities
> like the one we had to address in CALCITE-5263.
> The
> [copernik-xml-factory|https://github.com/copernik-eu/copernik-xml-factory]
> library provides various APIs returning factory instances with tighter
> security guarantees and could be used instead of calls to the raw APIs.
> The goal of this ticket is to investigate if its worth introducing the
> copernik library to the project for hardening the security around XML parsing.
> If we decide to move forward with this change we should also find a way to
> prevent the usage of the raw APIs (and always pass through copernik)
> otherwise sooner or later we will end up in the same situation.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)