I see that in org.apache.jackrabbit.core.query.lucene.NodeIndexer
/**
* Returns <code>true</code> if the provided type is among the types
* supported by the Tika parser we are using.
*
* @param type the type to check.
* @return whether the type is supported by the Tika parser we are
using.
*/
protected boolean isSupportedMediaType(final String type) {
if (supportedMediaTypes == null) {
supportedMediaTypes = parser.getSupportedTypes(null);
}
return supportedMediaTypes.contains(MediaType.parse(type));
}
The supportedMediaTypes will be load with:
application/x-tar,
application/x-bzip,
application/x-bzip2,
image/x-icon,
image/vnd.wap.wbmp,
image/vnd.adobe.photoshop,
application/x-cpio,
image/x-xcf,
application/zip,
image/x-ms-bmp,
image/jpeg,
image/png,
application/x-gtar,
application/x-archive,
image/gif,
application/x-gzip
This way the mimetypes I have (txt, office, pdf) will be never extracted...
But, where is the configuration for this? Because the default
tika-config.xml is:
<properties>
<detectors>
<detector class="org.apache.tika.detect.DefaultDetector"/>
</detectors>
<parsers>
<parser class="org.apache.tika.parser.DefaultParser"/>
<parser class="org.apache.tika.parser.EmptyParser">
<mime>application/x-archive</mime>
<mime>application/x-bzip</mime>
<mime>application/x-bzip2</mime>
<mime>application/x-cpio</mime>
<mime>application/x-gtar</mime>
<mime>application/x-gzip</mime>
<mime>application/x-tar</mime>
<mime>application/zip</mime>
<mime>image/bmp</mime>
<mime>image/gif</mime>
<mime>image/jpeg</mime>
<mime>image/png</mime>
<mime>image/vnd.wap.wbmp</mime>
<mime>image/x-icon</mime>
<mime>image/x-psd</mime>
<mime>image/x-xcf</mime>
</parser>
</parsers>
</properties>
I am feeling almost there
Bit lacking this in documentation...
Best Regards
Helio
--
View this message in context:
http://jackrabbit.510166.n4.nabble.com/jackrabbit-2-6-0-Full-Text-Search-tp4658832p4659000.html
Sent from the Jackrabbit - Users mailing list archive at Nabble.com.