[ 
https://issues.apache.org/jira/browse/TIKA-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15758867#comment-15758867
 ] 

David Pilato edited comment on TIKA-2208 at 12/18/16 1:50 PM:
--------------------------------------------------------------

So we now have a regression in Elasticsearch tests.
We are testing that Tika test files are working correctly. For that we are 
using a subset of 
https://github.com/apache/tika/tree/master/tika-parsers/src/test/resources/test-documents

Here, before we excluded {{x-tika-ooxml}} we were able to parse 
{{testPPT.potm}} file.
After applying the exclusion, the document is coming back empty. Before the 
change, that was extracted:

{code}
Attachment Test
Rajiv
This is a test file data with the same content as every other file being tested 
for tika content parsing. This has been developed by Rajiv Kumar Nistala.
Different words to test against
Quest
Hello
Watershed
Avalanche
Black Panther
Mystery
Banking
Investment
{code}

I think I'm just going to add the missing librairies as I don't think I can 
only exclude Visio content, right?




was (Author: dadoonet):
So we now have a regression in Elasticsearch tests.
We are testing that Tika test files are working correctly. For that we are 
using a subset of 
https://github.com/apache/tika/tree/master/tika-parsers/src/test/resources/test-documents

Here, before we excluded {{x-tika-ooxml}} we were able to parse 
{{testPPT.potm}} file.
After applying the exclusion, the document is coming back empty. Before the 
change, that was extracted:

{{code}}
Attachment Test
Rajiv
This is a test file data with the same content as every other file being tested 
for tika content parsing. This has been developed by Rajiv Kumar Nistala.
Different words to test against
Quest
Hello
Watershed
Avalanche
Black Panther
Mystery
Banking
Investment
{{code}}

I think I'm just going to add the missing librairies as I don't think I can 
only exclude Visio content, right?



> Catch missing libraires
> -----------------------
>
>                 Key: TIKA-2208
>                 URL: https://issues.apache.org/jira/browse/TIKA-2208
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: David Pilato
>
> Hi there
> We have decided to remove support for some formats when using Tika to extract 
> text and metadata.
> We defined our list of Parsers:
> {code:java}
>     private static final Parser PARSERS[] = new Parser[] {
>         // documents
>         new org.apache.tika.parser.html.HtmlParser(),
>         new org.apache.tika.parser.rtf.RTFParser(),
>         new org.apache.tika.parser.pdf.PDFParser(),
>         new org.apache.tika.parser.txt.TXTParser(),
>         new org.apache.tika.parser.microsoft.OfficeParser(),
>         new org.apache.tika.parser.microsoft.OldExcelParser(),
>         new org.apache.tika.parser.microsoft.ooxml.OOXMLParser(),
>         new org.apache.tika.parser.odf.OpenDocumentParser(),
>         new org.apache.tika.parser.iwork.IWorkPackageParser(),
>         new org.apache.tika.parser.xml.DcXMLParser(),
>         new org.apache.tika.parser.epub.EpubParser(),
>     };
>     private static final AutoDetectParser PARSER_INSTANCE = new 
> AutoDetectParser(PARSERS);
>     private static final Tika TIKA_INSTANCE = new 
> Tika(PARSER_INSTANCE.getDetector(), PARSER_INSTANCE);
> {code}
> But when a MS Office Word document embeds another non supported document 
> (Like a Visio Schema) an {{NoClassDefFoundError}} is raised.
> Would it be possible to catch such a case and throw in that case a 
> {{TikaException}} so it behaves as an Exception and not as a Throwable?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to