The cause of my problem was that the Mime-Type of the XML Files was set to 
text/xml (inside the application).
With the Mime-Type set to application/xml the XMLParser is called.

The alias in tika-mimetypes.xml Line 3200:

<mime-type type="application/xml">
  <alias type="text/xml"/>

does not seem to have an effect.

Nor does configuring the XMLParser to text/xml in tika-config.xml.

Regards
Paul

-----Original Message-----
From: Paul Bernet [mailto:[email protected]] 
Sent: Mittwoch, 17. Juli 2013 18:41
To: [email protected]
Subject: Problem with Indexing XML Docs with Tika in Jackrabbit 2.6.2

Hi,

I am migrating a Jackrabbit Instance from 2.2.13 to 2.6.2 using:
jackrabbit-core
jackrabbit-jcr-commons
jackrabbit-jcr-rmi
For indexing I am using the module tika-core and parts of tika-parsers.
Because the module tika-parsers is creating problems (among others the 
aspectjrt-1.6.x.jar is in conflict with my one-jar pkg meccano) I try to 
include only those parser classes and their dependencies into the Project, so I 
am able to index .pdf and .xml files. While the indexing via the PDFParser is 
working the DcXMLParser parser is not executed and no content is in the index.
When I configure the EmptyParser with the application/xml Mime-Type EmptyParser 
is not called either.

So what confuses me is that the PDFParser config is read from the 
tika-config.xml (I can proof that with falsifying the Classname) and called at 
runtime.
However, the XMLParser is read as well but not called at runtime.

tika-config.xml
...
<mimeTypeRepository resource="/org/apache/tika/mime/tika-mimetypes.xml" 
magic="false"/>
<parsers>
<parser name="parse-pdf" class="org.apache.tika.parser.pdf.PDFParser">
   <mime>application/pdf</mime>
</parser>

<parser name="parse-dcxml" class="org.apache.tika.parser.xml.DcXMLParser">
  <mime>application/xml</mime>
  <mime>image/svg+xml</mime>
</parser>

<parser class="org.apache.tika.parser.DefaultParser"/>
<parser class=" org.apache.tika.parser.EmptyParser ">
  <!--  <mime>application/xml</mime> -->
</parser>
</parsers>
....

The XML-Files have the Mime-Type application/xml.
The other configuration file 
/resources/META-INF/services/org.apache.tika.parser.Parser is in a sub-jar of 
the one-jar pkg. Because that did not show effect I took it outside and 
referenced it explicitly on the classpath on startup but that did not show any 
effect either. Is this file needed for the Parsers to work?

Thanks for any hints!
Paul

Reply via email to