Hi,
I found a "feature" related to the SHA-1 message digest that is stored in
XmlDocumentProperties when parsing an InputStream together with the
LOAD_STRIP_WHITESPACE option. The digest seems to be calculated over the
unstripped XML while producing a stripped XML.
This might be related to usage of DigestInputStream in the method "parse (
InputStream jiois, SchemaType type, XmlOptions options )" in the class
SchemaTypeLoaderBase, because the message digest is automatically calculated
when read from DigestInputStream, no matter if the read byte is stripped or not
afterwards.
Other stripping XmlOptions might have this "feature" as well, although I havn't
verified it.
In the sample below shows the behavoir, Digest 1 and Digest 2 are equal, while
Digest 3 differs. As I see it, the result should be to have Digest 2 and 3
equal, differing from Digest 1.
String input = ""
+ "<!DOCTYPE doc [<!ATTLIST e9 attr CDATA \"default\">]>\n"
+ "<!-- Comment 2 --><doc>\n"
+ " <e1 />\n"
+ " <e2 ></e2>\n"
+ " <e3 name = \"elem3\" id=\"elem3\" />\n"
+ " <e4 name=\"elem4\" id=\"elem4\" ></e4>\n"
+ " <e5 a:attr=\"out\" b:attr=\"sorted\" attr2=\"all\" attr=\"I'm\"\n"
+ " xmlns:b=\"http://www.ietf.org\"\n"
+ " xmlns:a=\"http://www.w3.org\"\n"
+ " xmlns=\"http://example.org\"/>\n"
+ " <e6 xmlns=\"\" xmlns:a=\"http://www.w3.org\">\n"
+ " <e7 xmlns=\"http://www.ietf.org\">\n"
+ " <e8 xmlns=\"\" xmlns:a=\"http://www.w3.org\">\n"
+ " <e9 xmlns=\"\" xmlns:a=\"http://www.ietf.org\"/>\n"
+ " <text>©</text>\n"
+ " </e8>\n"
+ " </e7>\n"
+ " </e6>\n"
+ "</doc><!-- Comment 3 -->\n";
// Calculate digest over original message
try {
MessageDigest md = MessageDigest.getInstance("SHA1");
DigestInputStream in = new DigestInputStream(
new ByteArrayInputStream( input.getBytes() ), md);
byte[] buffer = new byte[8192];
while (in.read(buffer) != -1) ;
byte[] raw = md.digest();
System.out.println( "Digest 1: " + new String( raw ) ); // Digest of
original XML, including whitespaces
} catch( Exception e ) {
e.printStackTrace();
System.exit( -1 );
}
// Parse XML with whitespace stripping and message digest options set
XmlOptions options = new XmlOptions();
options.setLoadStripWhitespace();
options.setLoadMessageDigest();
XmlObject xo = null;
try {
xo = XmlObject.Factory.parse( new ByteArrayInputStream( input.getBytes()
), options );
} catch ( XmlException e ) {
e.printStackTrace();
System.exit(-1);
} catch( IOException e ) {
e.printStackTrace();
System.exit(-1);
}
System.out.println( "Digest 2: " + new String(
xo.documentProperties().getMessageDigest() ) ); // Digest of parsed XML
// Calculate digest over parsed XML
try {
MessageDigest md = MessageDigest.getInstance("SHA1");
DigestInputStream in = new DigestInputStream( xo.newInputStream(), md);
byte[] buffer = new byte[8192];
while (in.read(buffer) != -1) ;
byte[] raw = md.digest();
System.out.println( "Digest 3: " + new String( raw ) ); // Digest of
parsed XML, excluding whitespaces
} catch( Exception e ) {
e.printStackTrace();
System.exit( -1 );
}
An obvious workaround is to manually calculate the message digest, after the
parsing. However, it is better to have the digest being calculated during the
parsing from a performance perspective, since otherwise you have to run over
the XML twice.
What do you think of this, is this wanted or unwanted behaviour?
Cheers
>> Sami Mäkelä
Heimore Group
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]