[ https://issues.apache.org/jira/browse/TIKA-4137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17767283#comment-17767283 ]
Thorsten Heit commented on TIKA-4137: ------------------------------------- I debugged the failing method {{StackTraceTest#testEmptyParser()}} in Eclipse, turned on logging in CXF and slightly changed the log4j configuration XML so that more debug data was logged. By using this I was able to narrow down the part in the code where an error is thrown: When {{tika-core/org.apache.tika.parser.XHTMLContentHandler#lazyEndHead(...)}} is called, the following snipped is executed (lines 183ff): {code:java} } else { // TIKA-725: Prefer <title></title> over <title/> super.characters(new char[0], 0, 0); } super.endElement(XHTML, "title", "title"); {code} ({{super.endElement()}}) produces a SAXException that is being catched in {{tika-core/org.apache.tika.parser.CompositeParser#parse(...)}}, line 306. According to the debugger the exception contains the following: {noformat} org.apache.tika.sax.TaggedSAXException: Ungültiges XML-Zeichen (Unicode: 0x0) wurde in den Zeichendaten des Knotens gefunden. org.apache.tika.sax.TaggedSAXException: Ungültiges XML-Zeichen (Unicode: 0x0) wurde in den Zeichendaten des Knotens gefunden. org.xml.sax.SAXException: Ungültiges XML-Zeichen (Unicode: 0x0) wurde in den Zeichendaten des Knotens gefunden. {noformat} I have checked the release notes of Java 18-20, but haven't seen hints about corresponding changes in the XML / XSLT stuff in the JDK that could cause this behaviour change. As far as I understood the problem is the {{new char[0]}} in the above code: A null character seems to be processed, finally lands in the depths of the {{java.xml}} module in the class {{com.sun.org.apache.xml.internal.serializer.ToStream#accumDefaultEscape()}} and lets the following snipped generate the execption: {code:java} if (!isVer11 && XMLChar.isInvalid(ch)) { throw new org.xml.sax.SAXException(Utils.messages.createMessage( MsgKey.ER_WF_INVALID_CHARACTER_IN_TEXT, new Object[]{Integer.toHexString(ch)})); {code} > Building current Tika main branch fails under Java 20/21 > -------------------------------------------------------- > > Key: TIKA-4137 > URL: https://issues.apache.org/jira/browse/TIKA-4137 > Project: Tika > Issue Type: Bug > Components: server > Affects Versions: 3.0.0-BETA > Reporter: Thorsten Heit > Priority: Major > Attachments: org.apache.tika.server.core.StackTraceOffTest.txt, > org.apache.tika.server.core.StackTraceTest.txt, > org.apache.tika.server.core.TikaResourceFetcherTest.txt, > org.apache.tika.server.core.TikaResourceTest.txt > > > When I execute "mvn verify" on the current main branch using Java 11 or Java > 17, the build completes. With Java 20 and 21 the same command fails because > now a couple of JUnit tests in tika-server-core fail: > {noformat} > (...) > [INFO] Running org.apache.tika.server.core.StackTraceTest > [ERROR] Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.034 > s <<< FAILURE! -- in org.apache.tika.server.core.StackTraceTest > [ERROR] org.apache.tika.server.core.StackTraceTest.testEmptyParser -- Time > elapsed: 0.007 s <<< FAILURE! > org.opentest4j.AssertionFailedError: bad type: /tika ==> expected: <200> but > was: <500> > at > org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151) > at > org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132) > at > org.junit.jupiter.api.AssertEquals.failNotEqual(AssertEquals.java:197) > at > org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:150) > at org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:559) > at > org.apache.tika.server.core.StackTraceTest.testEmptyParser(StackTraceTest.java:132) > at java.base/java.lang.reflect.Method.invoke(Method.java:580) > at java.base/java.util.ArrayList.forEach(ArrayList.java:1596) > at java.base/java.util.ArrayList.forEach(ArrayList.java:1596) > WARN [main] 21:28:26,651 org.apache.tika.pipes.PipesServer received -1 from > client; shutting down > ERROR [main] 21:28:26,652 org.apache.tika.pipes.PipesServer exiting: 1 > [INFO] > [INFO] Results: > [INFO] > [ERROR] Failures: > [ERROR] StackTraceOffTest.testEmptyParser:137 bad type: /tika ==> expected: > <200> but was: <500> > [ERROR] StackTraceTest.testEmptyParser:132 bad type: /tika ==> expected: > <200> but was: <500> > [ERROR] > TikaResourceFetcherTest.testHeader:101->CXFTestBase.assertContains:66 hello > world not found in: > <?xml version="1.0" encoding="UTF-8"?><html > xmlns="http://www.w3.org/1999/xhtml"> > > <head> > > <meta name="my-key" content="parsers-value"/> > > <meta name="X-TIKA:Parsed-By" > content="org.apache.tika.parser.DefaultParser"/> > > <meta name="X-TIKA:Parsed-By" > content="org.apache.tika.parser.mock.MockParser"/> > > <meta name="author" content="Nikolai Lobachevsky"/> > > <meta name="X-TIKA:sourcePath" content="mock/hello_world.xml"/> > ==> expected: <true> but was: <false> > [ERROR] > TikaResourceFetcherTest.testQueryPart:109->CXFTestBase.assertContains:66 > hello world not found in: > <?xml version="1.0" encoding="UTF-8"?><html > xmlns="http://www.w3.org/1999/xhtml"> > > <head> > > <meta name="my-key" content="parsers-value"/> > > <meta name="X-TIKA:Parsed-By" > content="org.apache.tika.parser.DefaultParser"/> > > <meta name="X-TIKA:Parsed-By" > content="org.apache.tika.parser.mock.MockParser"/> > > <meta name="author" content="Nikolai Lobachevsky"/> > > <meta name="X-TIKA:sourcePath" content="mock/hello_world.xml"/> > ==> expected: <true> but was: <false> > [ERROR] TikaResourceTest.testHeaders:91->CXFTestBase.assertContains:66 > <meta name="mymeta" content="first,second,third"/> not found in: > <?xml version="1.0" encoding="UTF-8"?><html > xmlns="http://www.w3.org/1999/xhtml"> > > <head> > > <meta name="my-key" content="parsers-value"/> > > <meta name="X-TIKA:Parsed-By" > content="org.apache.tika.parser.DefaultParser"/> > > <meta name="X-TIKA:Parsed-By" > content="org.apache.tika.parser.mock.MockParser"/> > > <meta name="author" content="Nikolai Lobachevsky"/> > > <meta name="X-TIKA:digest:SHA1" > content="R5FG5V2U44YXOZTMKGVNTTSPGLF2JH ==> expected: <true> but was: <false> > [ERROR] > TikaResourceTest.testNoWriteLimitOnStreamingWrite:187->CXFTestBase.assertContains:66 > separation.</p> not found in: > <html xmlns="http://www.w3.org/1999/xhtml"> > <head> > <meta name="my-key" content="parsers-value"/> > <meta name="X-TIKA:Parsed-By" > content="org.apache.tika.parser.DefaultParser"/> > <meta name="X-TIKA:Parsed-By" > content="org.apache.tika.parser.mock.MockParser"/> > <meta name="author" content="Nikolai Lobachevsky"/> > <meta name="X-TIKA:digest:SHA1" > content="AQWEMUMSJVFZWYGM4TKXRTQ5Q436X4DN"/> > <meta name="Content-Length" content="1562"/> > <meta name="X ==> expected: <true> but was: <false> > [INFO] > [ERROR] Tests run: 75, Failures: 6, Errors: 0, Skipped: 7 > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)