[jira] [Commented] (TIKA-1791) URI is not hierarchical exception when location model resource is inside a jar in classpath

2015-11-11 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000462#comment-15000462
 ] 

Nick Burch commented on TIKA-1791:
--

Thanks for the explanation

Next question - what happens if two calls to {{GeoParser}} use different NER 
paths? eg
{code}
GeoParser parser = new GeoParser();

ParseContext context = new ParseContext();
GeoParserConfig config = new GeoParserConfig();
context.set(GeoParserConfig.class, config);

config.setNERModelPath("/usr/bin");
parser.parse(inputA, metadata, handler, context);

config.setNERModelPath("/usr/local/bin");
parser.parse(inputB, metadata, handler, context);
{code}

Same parser each time, but different paths on the config. At first glance, it 
looks like your code would cause parsing 2 to use the config from parsing 1?

> URI is not hierarchical exception when location model resource is inside a 
> jar in classpath
> ---
>
> Key: TIKA-1791
> URL: https://issues.apache.org/jira/browse/TIKA-1791
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.11
> Environment: location model  file is placed inside a fat Jar (with 
> all the dependencies)
>Reporter: Thamme Gowda N
>
> {code:title=Stacktrace|borderStyle=solid}
> The following error happens when location NER model resource is packaged 
> inside a jar and GeoTopicParser is enabled.
> Caused by: java.lang.IllegalArgumentException: URI is not hierarchical
>   at java.io.File.(File.java:418)
>   at 
> org.apache.tika.parser.geo.topic.GeoParserConfig.(GeoParserConfig.java:33)
>   at org.apache.tika.parser.geo.topic.GeoParser.(GeoParser.java:54)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>   at java.lang.Class.newInstance(Class.java:442)
>   at 
> org.apache.tika.config.TikaConfig$XmlLoader.loadOne(TikaConfig.java:559)
>   at 
> org.apache.tika.config.TikaConfig$XmlLoader.loadOverall(TikaConfig.java:492)
>   at org.apache.tika.config.TikaConfig.(TikaConfig.java:166)
>   at org.apache.tika.config.TikaConfig.(TikaConfig.java:149)
>   at org.apache.tika.config.TikaConfig.(TikaConfig.java:142)
>   at org.apache.tika.config.TikaConfig.(TikaConfig.java:138)
>   at edu.usc.cs.ir.cwork.tika.Parser.(Parser.java:45)
> {code}
> Refernces :
> http://stackoverflow.com/questions/18055189/why-my-uri-is-not-hierarchical



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1788) message/rfc822 parser doesn't identify attachment filenames from Content-Disposition header

2015-11-11 Thread Vjeran Marcinko (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001773#comment-15001773
 ] 

Vjeran Marcinko commented on TIKA-1788:
---

I dunno James library at all, so cannot say if this would affect negatively 
some other portion of the parser, but...

Thing is that current Tika's RFC822Parser sets indirectly James' 
BasicBodyDescriptor instead of MaximalBodyDescriptor, and this is due to the 
way RFC822Parser instantiates james' MimeStreamParser internally. If this 
instantiation would be by specifying DefaultBodyDescriptorBuilder:
{code}
MimeStreamParser parser = new MimeStreamParser(config, null, new 
DefaultBodyDescriptorBuilder());
{code}
This way during James' parsing, the MaximalBodyDescriptor would be created 
which recognizes Content-Disposition field, and it could be utilized in Tika's 
MailContentHandler, say in body(...) method if we add:
{code}
public void body(BodyDescriptor body, InputStream is) throws MimeException,
IOException {
// use a different metadata object
// in order to specify the mime type of the
// sub part without damaging the main metadata

Metadata submd = new Metadata();
submd.set(Metadata.CONTENT_TYPE, body.getMimeType());
submd.set(Metadata.CONTENT_ENCODING, body.getCharset());

if (body instanceof MaximalBodyDescriptor) {
MaximalBodyDescriptor maximalBodyDescriptor = 
(MaximalBodyDescriptor) body;
String contentDispositionFilename = 
maximalBodyDescriptor.getContentDispositionFilename();
if (contentDispositionFilename != null) {
submd.set(Metadata.RESOURCE_NAME_KEY, 
contentDispositionFilename);
}
}
...
{code}

> message/rfc822 parser doesn't identify attachment filenames from 
> Content-Disposition header
> ---
>
> Key: TIKA-1788
> URL: https://issues.apache.org/jira/browse/TIKA-1788
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.11
>Reporter: Sergey Tsalkov
> Attachments: grep_content_disposition.zip
>
>
> rfc822 email files can contain attachments as subparts, and they'll
> generally specify the filename of the attachment in a manner like
> this:
> Content-Disposition: attachment;
> filename*=utf-8''image001.jpg
> Tika doesn't seem to be grabbing that information at all!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1791) URI is not hierarchical exception when location model resource is inside a jar in classpath

2015-11-11 Thread Thamme Gowda N (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000975#comment-15000975
 ] 

Thamme Gowda N commented on TIKA-1791:
--

Thanks for pointing out the issue.
I didn't anticipate changes to configurations after the parser started to run. 

It's now handled in `intialize()`:
{code}
if (this.modelUrl != null && this.modelUrl.equals(modelUrl)) {
//previously initialized for the same URL
return;
}
{code}

If the Tika's environments are so dynamic (like files pointed by URLs are 
frequently updated/deleted), then probably states shouldn't be used. However, 
as you can see it's a tradeoff to performance. If this is the case, I can 
revert back to the older way.



> URI is not hierarchical exception when location model resource is inside a 
> jar in classpath
> ---
>
> Key: TIKA-1791
> URL: https://issues.apache.org/jira/browse/TIKA-1791
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.11
> Environment: location model  file is placed inside a fat Jar (with 
> all the dependencies)
>Reporter: Thamme Gowda N
>
> {code:title=Stacktrace|borderStyle=solid}
> The following error happens when location NER model resource is packaged 
> inside a jar and GeoTopicParser is enabled.
> Caused by: java.lang.IllegalArgumentException: URI is not hierarchical
>   at java.io.File.(File.java:418)
>   at 
> org.apache.tika.parser.geo.topic.GeoParserConfig.(GeoParserConfig.java:33)
>   at org.apache.tika.parser.geo.topic.GeoParser.(GeoParser.java:54)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>   at java.lang.Class.newInstance(Class.java:442)
>   at 
> org.apache.tika.config.TikaConfig$XmlLoader.loadOne(TikaConfig.java:559)
>   at 
> org.apache.tika.config.TikaConfig$XmlLoader.loadOverall(TikaConfig.java:492)
>   at org.apache.tika.config.TikaConfig.(TikaConfig.java:166)
>   at org.apache.tika.config.TikaConfig.(TikaConfig.java:149)
>   at org.apache.tika.config.TikaConfig.(TikaConfig.java:142)
>   at org.apache.tika.config.TikaConfig.(TikaConfig.java:138)
>   at edu.usc.cs.ir.cwork.tika.Parser.(Parser.java:45)
> {code}
> Refernces :
> http://stackoverflow.com/questions/18055189/why-my-uri-is-not-hierarchical



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)