[ https://issues.apache.org/jira/browse/TIKA-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15754538#comment-15754538 ]
Tim Allison edited comment on TIKA-2208 at 12/16/16 2:24 PM: ------------------------------------------------------------- So, I think that should be your solution for now, unless [~gagravarr] can think of any unintended consequences, or unless that is too broad for your use case, [~dadoonet]. However, there are two potential issues that we may want to address: 1) Even with the full Tika with all of its dependencies, I'm getting this: {noformat} java.lang.NoClassDefFoundError: com/microsoft/schemas/office/visio/x2012/main/ConnectsType at com.microsoft.schemas.office.visio.x2012.main.impl.PageContentsTypeImpl.getConnects(Unknown Source) at org.apache.poi.xdgf.usermodel.XDGFBaseContents.onDocumentRead(XDGFBaseContents.java:89) at org.apache.poi.xdgf.usermodel.XDGFPageContents.onDocumentRead(XDGFPageContents.java:73) at org.apache.poi.xdgf.usermodel.XDGFPages.onDocumentRead(XDGFPages.java:94) at org.apache.poi.xdgf.usermodel.XmlVisioDocument.onDocumentRead(XmlVisioDocument.java:108) at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:190) at org.apache.poi.xdgf.usermodel.XmlVisioDocument.<init>(XmlVisioDocument.java:79) at org.apache.poi.xdgf.extractor.XDGFVisioExtractor.<init>(XDGFVisioExtractor.java:41) at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:206) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:101) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:90) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:190) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:330) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:221) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:127) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:137) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:90) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:190) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.Tika.parseToString(Tika.java:527) at org.apache.tika.parser.microsoft.OfficeParserTest.testBasic(OfficeParserTest.java:89) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229) at org.junit.runners.ParentRunner.run(ParentRunner.java:309) at org.junit.runner.JUnitCore.run(JUnitCore.java:160) at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68) at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:51) at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:237) at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147) Caused by: java.lang.ClassNotFoundException: com.microsoft.schemas.office.visio.x2012.main.ConnectsType at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 53 more {noformat} I think this means that we should add this test file to POI so that the appropriate classes are loaded into our slimmed down ooxml-schemas...right, Nick? 2) I found it annoying that we have to turn off the full super-type "x-tika-ooxml", when we might want to turn off only one subtype, e.g. "x-tika-visio-ooxml" or one subsubtype, e.g. "vnd.ms-visio.drawing". In other words, when I tried to exclude "vnd.ms-visio-drawing", our exclusion mechanism didn't work. Do we want to fix this? was (Author: talli...@mitre.org): So, I think that should be your solution for now, unless [~gagravarr] can think of any unintended consequences, or unless that is too broad for your use case, [~dadoonet]. However, there are two potential issues that we may want to address: 1) Even with the full Tika with all of its dependencies, I'm getting this: {noformat} Caused by: java.lang.ClassNotFoundException: com.microsoft.schemas.office.visio.x2012.main.ConnectsType at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) {noformat} I think this means that we should add this test file to POI so that the appropriate classes are loaded into our slimmed down ooxml-schemas...right, Nick? 2) I found it annoying that we have to turn off the full super-type "x-tika-ooxml", when we might want to turn off only one subtype, e.g. "x-tika-visio-ooxml" or one subsubtype, e.g. "vnd.ms-visio.drawing". In other words, when I tried to exclude "vnd.ms-visio-drawing", our exclusion mechanism didn't work. Do we want to fix this? > Catch missing libraires > ----------------------- > > Key: TIKA-2208 > URL: https://issues.apache.org/jira/browse/TIKA-2208 > Project: Tika > Issue Type: Improvement > Components: parser > Reporter: David Pilato > > Hi there > We have decided to remove support for some formats when using Tika to extract > text and metadata. > We defined our list of Parsers: > {code:java} > private static final Parser PARSERS[] = new Parser[] { > // documents > new org.apache.tika.parser.html.HtmlParser(), > new org.apache.tika.parser.rtf.RTFParser(), > new org.apache.tika.parser.pdf.PDFParser(), > new org.apache.tika.parser.txt.TXTParser(), > new org.apache.tika.parser.microsoft.OfficeParser(), > new org.apache.tika.parser.microsoft.OldExcelParser(), > new org.apache.tika.parser.microsoft.ooxml.OOXMLParser(), > new org.apache.tika.parser.odf.OpenDocumentParser(), > new org.apache.tika.parser.iwork.IWorkPackageParser(), > new org.apache.tika.parser.xml.DcXMLParser(), > new org.apache.tika.parser.epub.EpubParser(), > }; > private static final AutoDetectParser PARSER_INSTANCE = new > AutoDetectParser(PARSERS); > private static final Tika TIKA_INSTANCE = new > Tika(PARSER_INSTANCE.getDetector(), PARSER_INSTANCE); > {code} > But when a MS Office Word document embeds another non supported document > (Like a Visio Schema) an {{NoClassDefFoundError}} is raised. > Would it be possible to catch such a case and throw in that case a > {{TikaException}} so it behaves as an Exception and not as a Throwable? -- This message was sent by Atlassian JIRA (v6.3.4#6332)