[jira] [Commented] (TIKA-1791) URI is not hierarchical exception when location model resource is inside a jar in classpath
[ https://issues.apache.org/jira/browse/TIKA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000462#comment-15000462 ] Nick Burch commented on TIKA-1791: -- Thanks for the explanation Next question - what happens if two calls to {{GeoParser}} use different NER paths? eg {code} GeoParser parser = new GeoParser(); ParseContext context = new ParseContext(); GeoParserConfig config = new GeoParserConfig(); context.set(GeoParserConfig.class, config); config.setNERModelPath("/usr/bin"); parser.parse(inputA, metadata, handler, context); config.setNERModelPath("/usr/local/bin"); parser.parse(inputB, metadata, handler, context); {code} Same parser each time, but different paths on the config. At first glance, it looks like your code would cause parsing 2 to use the config from parsing 1? > URI is not hierarchical exception when location model resource is inside a > jar in classpath > --- > > Key: TIKA-1791 > URL: https://issues.apache.org/jira/browse/TIKA-1791 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.11 > Environment: location model file is placed inside a fat Jar (with > all the dependencies) >Reporter: Thamme Gowda N > > {code:title=Stacktrace|borderStyle=solid} > The following error happens when location NER model resource is packaged > inside a jar and GeoTopicParser is enabled. > Caused by: java.lang.IllegalArgumentException: URI is not hierarchical > at java.io.File.(File.java:418) > at > org.apache.tika.parser.geo.topic.GeoParserConfig.(GeoParserConfig.java:33) > at org.apache.tika.parser.geo.topic.GeoParser.(GeoParser.java:54) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at java.lang.Class.newInstance(Class.java:442) > at > org.apache.tika.config.TikaConfig$XmlLoader.loadOne(TikaConfig.java:559) > at > org.apache.tika.config.TikaConfig$XmlLoader.loadOverall(TikaConfig.java:492) > at org.apache.tika.config.TikaConfig.(TikaConfig.java:166) > at org.apache.tika.config.TikaConfig.(TikaConfig.java:149) > at org.apache.tika.config.TikaConfig.(TikaConfig.java:142) > at org.apache.tika.config.TikaConfig.(TikaConfig.java:138) > at edu.usc.cs.ir.cwork.tika.Parser.(Parser.java:45) > {code} > Refernces : > http://stackoverflow.com/questions/18055189/why-my-uri-is-not-hierarchical -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1788) message/rfc822 parser doesn't identify attachment filenames from Content-Disposition header
[ https://issues.apache.org/jira/browse/TIKA-1788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001773#comment-15001773 ] Vjeran Marcinko commented on TIKA-1788: --- I dunno James library at all, so cannot say if this would affect negatively some other portion of the parser, but... Thing is that current Tika's RFC822Parser sets indirectly James' BasicBodyDescriptor instead of MaximalBodyDescriptor, and this is due to the way RFC822Parser instantiates james' MimeStreamParser internally. If this instantiation would be by specifying DefaultBodyDescriptorBuilder: {code} MimeStreamParser parser = new MimeStreamParser(config, null, new DefaultBodyDescriptorBuilder()); {code} This way during James' parsing, the MaximalBodyDescriptor would be created which recognizes Content-Disposition field, and it could be utilized in Tika's MailContentHandler, say in body(...) method if we add: {code} public void body(BodyDescriptor body, InputStream is) throws MimeException, IOException { // use a different metadata object // in order to specify the mime type of the // sub part without damaging the main metadata Metadata submd = new Metadata(); submd.set(Metadata.CONTENT_TYPE, body.getMimeType()); submd.set(Metadata.CONTENT_ENCODING, body.getCharset()); if (body instanceof MaximalBodyDescriptor) { MaximalBodyDescriptor maximalBodyDescriptor = (MaximalBodyDescriptor) body; String contentDispositionFilename = maximalBodyDescriptor.getContentDispositionFilename(); if (contentDispositionFilename != null) { submd.set(Metadata.RESOURCE_NAME_KEY, contentDispositionFilename); } } ... {code} > message/rfc822 parser doesn't identify attachment filenames from > Content-Disposition header > --- > > Key: TIKA-1788 > URL: https://issues.apache.org/jira/browse/TIKA-1788 > Project: Tika > Issue Type: Bug >Affects Versions: 1.11 >Reporter: Sergey Tsalkov > Attachments: grep_content_disposition.zip > > > rfc822 email files can contain attachments as subparts, and they'll > generally specify the filename of the attachment in a manner like > this: > Content-Disposition: attachment; > filename*=utf-8''image001.jpg > Tika doesn't seem to be grabbing that information at all! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1791) URI is not hierarchical exception when location model resource is inside a jar in classpath
[ https://issues.apache.org/jira/browse/TIKA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000975#comment-15000975 ] Thamme Gowda N commented on TIKA-1791: -- Thanks for pointing out the issue. I didn't anticipate changes to configurations after the parser started to run. It's now handled in `intialize()`: {code} if (this.modelUrl != null && this.modelUrl.equals(modelUrl)) { //previously initialized for the same URL return; } {code} If the Tika's environments are so dynamic (like files pointed by URLs are frequently updated/deleted), then probably states shouldn't be used. However, as you can see it's a tradeoff to performance. If this is the case, I can revert back to the older way. > URI is not hierarchical exception when location model resource is inside a > jar in classpath > --- > > Key: TIKA-1791 > URL: https://issues.apache.org/jira/browse/TIKA-1791 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.11 > Environment: location model file is placed inside a fat Jar (with > all the dependencies) >Reporter: Thamme Gowda N > > {code:title=Stacktrace|borderStyle=solid} > The following error happens when location NER model resource is packaged > inside a jar and GeoTopicParser is enabled. > Caused by: java.lang.IllegalArgumentException: URI is not hierarchical > at java.io.File.(File.java:418) > at > org.apache.tika.parser.geo.topic.GeoParserConfig.(GeoParserConfig.java:33) > at org.apache.tika.parser.geo.topic.GeoParser.(GeoParser.java:54) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at java.lang.Class.newInstance(Class.java:442) > at > org.apache.tika.config.TikaConfig$XmlLoader.loadOne(TikaConfig.java:559) > at > org.apache.tika.config.TikaConfig$XmlLoader.loadOverall(TikaConfig.java:492) > at org.apache.tika.config.TikaConfig.(TikaConfig.java:166) > at org.apache.tika.config.TikaConfig.(TikaConfig.java:149) > at org.apache.tika.config.TikaConfig.(TikaConfig.java:142) > at org.apache.tika.config.TikaConfig.(TikaConfig.java:138) > at edu.usc.cs.ir.cwork.tika.Parser.(Parser.java:45) > {code} > Refernces : > http://stackoverflow.com/questions/18055189/why-my-uri-is-not-hierarchical -- This message was sent by Atlassian JIRA (v6.3.4#6332)