[ 
https://issues.apache.org/jira/browse/TIKA-4137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17767283#comment-17767283
 ] 

Thorsten Heit commented on TIKA-4137:
-------------------------------------

I debugged the failing method {{StackTraceTest#testEmptyParser()}} in Eclipse, 
turned on logging in CXF and slightly changed the log4j configuration XML so 
that more debug data was logged. By using this I was able to narrow down the 
part in the code where an error is thrown:

When {{tika-core/org.apache.tika.parser.XHTMLContentHandler#lazyEndHead(...)}} 
is called, the following snipped is executed (lines 183ff):

{code:java}
            } else {
                // TIKA-725: Prefer <title></title> over <title/>
                super.characters(new char[0], 0, 0);
            }
            super.endElement(XHTML, "title", "title");
{code}

({{super.endElement()}}) produces a SAXException that is being catched in 
{{tika-core/org.apache.tika.parser.CompositeParser#parse(...)}}, line 306. 
According to the debugger the exception contains the following:

{noformat}
org.apache.tika.sax.TaggedSAXException: Ungültiges XML-Zeichen (Unicode: 0x0) 
wurde in den Zeichendaten des Knotens gefunden.
org.apache.tika.sax.TaggedSAXException: Ungültiges XML-Zeichen (Unicode: 0x0) 
wurde in den Zeichendaten des Knotens gefunden.
org.xml.sax.SAXException: Ungültiges XML-Zeichen (Unicode: 0x0) wurde in den 
Zeichendaten des Knotens gefunden.
{noformat}

I have checked the release notes of Java 18-20, but haven't seen hints about 
corresponding changes in the XML / XSLT stuff in the JDK that could cause this 
behaviour change.

As far as I understood the problem is the {{new char[0]}} in the above code:
A null character seems to be processed, finally lands in the depths of the 
{{java.xml}} module in the class 
{{com.sun.org.apache.xml.internal.serializer.ToStream#accumDefaultEscape()}} 
and lets the following snipped generate the execption:

{code:java}
                if (!isVer11 && XMLChar.isInvalid(ch)) {
                    throw new 
org.xml.sax.SAXException(Utils.messages.createMessage(
                            MsgKey.ER_WF_INVALID_CHARACTER_IN_TEXT,
                            new Object[]{Integer.toHexString(ch)}));
{code}

> Building current Tika main branch fails under Java 20/21
> --------------------------------------------------------
>
>                 Key: TIKA-4137
>                 URL: https://issues.apache.org/jira/browse/TIKA-4137
>             Project: Tika
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.0.0-BETA
>            Reporter: Thorsten Heit
>            Priority: Major
>         Attachments: org.apache.tika.server.core.StackTraceOffTest.txt, 
> org.apache.tika.server.core.StackTraceTest.txt, 
> org.apache.tika.server.core.TikaResourceFetcherTest.txt, 
> org.apache.tika.server.core.TikaResourceTest.txt
>
>
> When I execute "mvn verify" on the current main branch using  Java 11 or Java 
> 17, the build completes. With Java 20 and 21 the same command fails because 
> now a couple of JUnit tests in tika-server-core fail:
> {noformat}
> (...)
> [INFO] Running org.apache.tika.server.core.StackTraceTest
> [ERROR] Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.034 
> s <<< FAILURE! -- in org.apache.tika.server.core.StackTraceTest
> [ERROR] org.apache.tika.server.core.StackTraceTest.testEmptyParser -- Time 
> elapsed: 0.007 s <<< FAILURE!
> org.opentest4j.AssertionFailedError: bad type: /tika ==> expected: <200> but 
> was: <500>
>       at 
> org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
>       at 
> org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
>       at 
> org.junit.jupiter.api.AssertEquals.failNotEqual(AssertEquals.java:197)
>       at 
> org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:150)
>       at org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:559)
>       at 
> org.apache.tika.server.core.StackTraceTest.testEmptyParser(StackTraceTest.java:132)
>       at java.base/java.lang.reflect.Method.invoke(Method.java:580)
>       at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
>       at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
> WARN  [main] 21:28:26,651 org.apache.tika.pipes.PipesServer received -1 from 
> client; shutting down
> ERROR [main] 21:28:26,652 org.apache.tika.pipes.PipesServer exiting: 1
> [INFO] 
> [INFO] Results:
> [INFO] 
> [ERROR] Failures: 
> [ERROR]   StackTraceOffTest.testEmptyParser:137 bad type: /tika ==> expected: 
> <200> but was: <500>
> [ERROR]   StackTraceTest.testEmptyParser:132 bad type: /tika ==> expected: 
> <200> but was: <500>
> [ERROR]   
> TikaResourceFetcherTest.testHeader:101->CXFTestBase.assertContains:66 hello 
> world not found in:
> <?xml version="1.0" encoding="UTF-8"?><html 
> xmlns="http://www.w3.org/1999/xhtml";>
>     
>     <head>
>         
>         <meta name="my-key" content="parsers-value"/>
>         
>         <meta name="X-TIKA:Parsed-By" 
> content="org.apache.tika.parser.DefaultParser"/>
>         
>         <meta name="X-TIKA:Parsed-By" 
> content="org.apache.tika.parser.mock.MockParser"/>
>         
>         <meta name="author" content="Nikolai Lobachevsky"/>
>         
>         <meta name="X-TIKA:sourcePath" content="mock/hello_world.xml"/>
>         ==> expected: <true> but was: <false>
> [ERROR]   
> TikaResourceFetcherTest.testQueryPart:109->CXFTestBase.assertContains:66 
> hello world not found in:
> <?xml version="1.0" encoding="UTF-8"?><html 
> xmlns="http://www.w3.org/1999/xhtml";>
>     
>     <head>
>         
>         <meta name="my-key" content="parsers-value"/>
>         
>         <meta name="X-TIKA:Parsed-By" 
> content="org.apache.tika.parser.DefaultParser"/>
>         
>         <meta name="X-TIKA:Parsed-By" 
> content="org.apache.tika.parser.mock.MockParser"/>
>         
>         <meta name="author" content="Nikolai Lobachevsky"/>
>         
>         <meta name="X-TIKA:sourcePath" content="mock/hello_world.xml"/>
>         ==> expected: <true> but was: <false>
> [ERROR]   TikaResourceTest.testHeaders:91->CXFTestBase.assertContains:66 
> <meta name="mymeta" content="first,second,third"/> not found in:
> <?xml version="1.0" encoding="UTF-8"?><html 
> xmlns="http://www.w3.org/1999/xhtml";>
>     
>     <head>
>         
>         <meta name="my-key" content="parsers-value"/>
>         
>         <meta name="X-TIKA:Parsed-By" 
> content="org.apache.tika.parser.DefaultParser"/>
>         
>         <meta name="X-TIKA:Parsed-By" 
> content="org.apache.tika.parser.mock.MockParser"/>
>         
>         <meta name="author" content="Nikolai Lobachevsky"/>
>         
>         <meta name="X-TIKA:digest:SHA1" 
> content="R5FG5V2U44YXOZTMKGVNTTSPGLF2JH ==> expected: <true> but was: <false>
> [ERROR]   
> TikaResourceTest.testNoWriteLimitOnStreamingWrite:187->CXFTestBase.assertContains:66
>  separation.</p> not found in:
> <html xmlns="http://www.w3.org/1999/xhtml";>
>     <head>
>         <meta name="my-key" content="parsers-value"/>
>         <meta name="X-TIKA:Parsed-By" 
> content="org.apache.tika.parser.DefaultParser"/>
>         <meta name="X-TIKA:Parsed-By" 
> content="org.apache.tika.parser.mock.MockParser"/>
>         <meta name="author" content="Nikolai Lobachevsky"/>
>         <meta name="X-TIKA:digest:SHA1" 
> content="AQWEMUMSJVFZWYGM4TKXRTQ5Q436X4DN"/>
>         <meta name="Content-Length" content="1562"/>
>         <meta name="X ==> expected: <true> but was: <false>
> [INFO] 
> [ERROR] Tests run: 75, Failures: 6, Errors: 0, Skipped: 7
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to