[jira] [Commented] (TIKA-1791) URI is not hierarchical exception when location model resource is inside a jar in classpath

2015-11-10 Thread Thamme Gowda N (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14999085#comment-14999085
 ] 

Thamme Gowda N commented on TIKA-1791:
--

Thanks for the feedback. 

* The fix for non-hierarchical URI is done by using URL instead of URI and path 
string. (Learned that we can have a URL to files inside ZIP archive, but not 
URI)

While I modified NER model loading code to make above change possible, I also 
happened to make these changes:

* The NER model was previously reloaded for every `parse()` call. It now reuses 
the model by making use of a state variable.
* The `isAvailable()` function was previously trying to launch an external 
process for every call to figureout availability of 'lucene-geo-gazeteer' 
command (it is invoked in `parse()`). This has been changed to use a state 
variable.
* The model is loaded on first call to `parse()` or `isAviable()` : via lazy 
intialization. My tests showed that it is backward compatible. 

UPDATE : 
Test case is now unaltered.  I was just trying to see if the test cases are 
passing different parse context. The lazy intialization of name extractor is 
gauranteed to work and thus shouldnt be breaking the existing usages. The 
{code} GeoParserConfig.setNERModelPath(String) {code} is also preserved for the 
users who are already using it to supply model path. However, 
{code}GeoParserConfig.getNERPath() {code} is swapped with URL getter.


> URI is not hierarchical exception when location model resource is inside a 
> jar in classpath
> ---
>
> Key: TIKA-1791
> URL: https://issues.apache.org/jira/browse/TIKA-1791
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.11
> Environment: location model  file is placed inside a fat Jar (with 
> all the dependencies)
>Reporter: Thamme Gowda N
>
> {code:title=Stacktrace|borderStyle=solid}
> The following error happens when location NER model resource is packaged 
> inside a jar and GeoTopicParser is enabled.
> Caused by: java.lang.IllegalArgumentException: URI is not hierarchical
>   at java.io.File.(File.java:418)
>   at 
> org.apache.tika.parser.geo.topic.GeoParserConfig.(GeoParserConfig.java:33)
>   at org.apache.tika.parser.geo.topic.GeoParser.(GeoParser.java:54)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>   at java.lang.Class.newInstance(Class.java:442)
>   at 
> org.apache.tika.config.TikaConfig$XmlLoader.loadOne(TikaConfig.java:559)
>   at 
> org.apache.tika.config.TikaConfig$XmlLoader.loadOverall(TikaConfig.java:492)
>   at org.apache.tika.config.TikaConfig.(TikaConfig.java:166)
>   at org.apache.tika.config.TikaConfig.(TikaConfig.java:149)
>   at org.apache.tika.config.TikaConfig.(TikaConfig.java:142)
>   at org.apache.tika.config.TikaConfig.(TikaConfig.java:138)
>   at edu.usc.cs.ir.cwork.tika.Parser.(Parser.java:45)
> {code}
> Refernces :
> http://stackoverflow.com/questions/18055189/why-my-uri-is-not-hierarchical



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1792) Add ASiC-E and ASiC-S mime types

2015-11-10 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14998898#comment-14998898
 ] 

Nick Burch commented on TIKA-1792:
--

Requiring it to be first + uncompressed does indeed change things! I've updated 
the priority and comments in r1713697.

(Quite a few zip-based formats do have mime magic for the "best case", but they 
need the zip container detector to work in the general case)

> Add ASiC-E and ASiC-S mime types
> 
>
> Key: TIKA-1792
> URL: https://issues.apache.org/jira/browse/TIKA-1792
> Project: Tika
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 1.11
>Reporter: Roberto Benedetti
>Priority: Minor
> Fix For: 1.12
>
> Attachments: report-7.asics, report-8.asice
>
>
> These are the references:
> * 
> [http://www.iana.org/assignments/media-types/application/vnd.etsi.asic-e+zip]
> * 
> [http://www.iana.org/assignments/media-types/application/vnd.etsi.asic-s+zip]
> My {{custom-mimetypes.xml}} is:
> {code:xml}
> 
> 
>   
> ASiC-E
> <_comment>Extended Associated Signature Container
> 
>   
>  offset="30" />
>   
> 
> 
>   
>   
> ASiC-S
> <_comment>Simple Associated Signature Container
> 
>   
>  offset="30" />
>   
> 
> 
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1792) Add ASiC-E and ASiC-S mime types

2015-11-10 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14998886#comment-14998886
 ] 

Hudson commented on TIKA-1792:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #882 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/882/])
TIKA-1792 ASiC E and S mimetypes, detection and tests. Files and mimetype from 
Roberto Benedetti (nick: 
[http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1713677])
* trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
* 
trunk/tika-parsers/src/main/java/org/apache/tika/parser/pkg/ZipContainerDetector.java
* 
trunk/tika-parsers/src/test/java/org/apache/tika/detect/TestContainerAwareDetector.java
* trunk/tika-parsers/src/test/resources/test-documents/testASiCE.asice
* trunk/tika-parsers/src/test/resources/test-documents/testASiCS.asics


> Add ASiC-E and ASiC-S mime types
> 
>
> Key: TIKA-1792
> URL: https://issues.apache.org/jira/browse/TIKA-1792
> Project: Tika
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 1.11
>Reporter: Roberto Benedetti
>Priority: Minor
> Fix For: 1.12
>
> Attachments: report-7.asics, report-8.asice
>
>
> These are the references:
> * 
> [http://www.iana.org/assignments/media-types/application/vnd.etsi.asic-e+zip]
> * 
> [http://www.iana.org/assignments/media-types/application/vnd.etsi.asic-s+zip]
> My {{custom-mimetypes.xml}} is:
> {code:xml}
> 
> 
>   
> ASiC-E
> <_comment>Extended Associated Signature Container
> 
>   
>  offset="30" />
>   
> 
> 
>   
>   
> ASiC-S
> <_comment>Simple Associated Signature Container
> 
>   
>  offset="30" />
>   
> 
> 
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1792) Add ASiC-E and ASiC-S mime types

2015-11-10 Thread Roberto Benedetti (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14998849#comment-14998849
 ] 

Roberto Benedetti commented on TIKA-1792:
-

Ops... You did it before i could reply ;-)

> Add ASiC-E and ASiC-S mime types
> 
>
> Key: TIKA-1792
> URL: https://issues.apache.org/jira/browse/TIKA-1792
> Project: Tika
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 1.11
>Reporter: Roberto Benedetti
>Priority: Minor
> Fix For: 1.12
>
> Attachments: report-7.asics, report-8.asice
>
>
> These are the references:
> * 
> [http://www.iana.org/assignments/media-types/application/vnd.etsi.asic-e+zip]
> * 
> [http://www.iana.org/assignments/media-types/application/vnd.etsi.asic-s+zip]
> My {{custom-mimetypes.xml}} is:
> {code:xml}
> 
> 
>   
> ASiC-E
> <_comment>Extended Associated Signature Container
> 
>   
>  offset="30" />
>   
> 
> 
>   
>   
> ASiC-S
> <_comment>Simple Associated Signature Container
> 
>   
>  offset="30" />
>   
> 
> 
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1792) Add ASiC-E and ASiC-S mime types

2015-11-10 Thread Roberto Benedetti (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14998845#comment-14998845
 ] 

Roberto Benedetti commented on TIKA-1792:
-

I see, but tika-mimetypes.xml already contains such entries for EPUB and iBooks.
Furthermore [Annex 
A|http://www.etsi.org/deliver/etsi_ts/102900_102999/102918/01.03.01_60/ts_102918v010301p.pdf]
 of the specification requires mimetype entry to be _the first_ and 
_uncompressed_ entry just to use mime magics.

> Add ASiC-E and ASiC-S mime types
> 
>
> Key: TIKA-1792
> URL: https://issues.apache.org/jira/browse/TIKA-1792
> Project: Tika
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 1.11
>Reporter: Roberto Benedetti
>Priority: Minor
> Fix For: 1.12
>
> Attachments: report-7.asics, report-8.asice
>
>
> These are the references:
> * 
> [http://www.iana.org/assignments/media-types/application/vnd.etsi.asic-e+zip]
> * 
> [http://www.iana.org/assignments/media-types/application/vnd.etsi.asic-s+zip]
> My {{custom-mimetypes.xml}} is:
> {code:xml}
> 
> 
>   
> ASiC-E
> <_comment>Extended Associated Signature Container
> 
>   
>  offset="30" />
>   
> 
> 
>   
>   
> ASiC-S
> <_comment>Simple Associated Signature Container
> 
>   
>  offset="30" />
>   
> 
> 
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1792) Add ASiC-E and ASiC-S mime types

2015-11-10 Thread Nick Burch (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch resolved TIKA-1792.
--
   Resolution: Fixed
Fix Version/s: (was: 2.0)

> Add ASiC-E and ASiC-S mime types
> 
>
> Key: TIKA-1792
> URL: https://issues.apache.org/jira/browse/TIKA-1792
> Project: Tika
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 1.11
>Reporter: Roberto Benedetti
>Priority: Minor
> Fix For: 1.12
>
> Attachments: report-7.asics, report-8.asice
>
>
> These are the references:
> * 
> [http://www.iana.org/assignments/media-types/application/vnd.etsi.asic-e+zip]
> * 
> [http://www.iana.org/assignments/media-types/application/vnd.etsi.asic-s+zip]
> My {{custom-mimetypes.xml}} is:
> {code:xml}
> 
> 
>   
> ASiC-E
> <_comment>Extended Associated Signature Container
> 
>   
>  offset="30" />
>   
> 
> 
>   
>   
> ASiC-S
> <_comment>Simple Associated Signature Container
> 
>   
>  offset="30" />
>   
> 
> 
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1792) Add ASiC-E and ASiC-S mime types

2015-11-10 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14998841#comment-14998841
 ] 

Nick Burch commented on TIKA-1792:
--

Luckily these files use the same mimetype storage convention as ODF does, so 
our existing Zip-aware container detector was able to handle them as-is

In r1713677, I've added your test files (thanks!), slightly tweaked mimetype 
entries, and unit tests

> Add ASiC-E and ASiC-S mime types
> 
>
> Key: TIKA-1792
> URL: https://issues.apache.org/jira/browse/TIKA-1792
> Project: Tika
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 1.11
>Reporter: Roberto Benedetti
>Priority: Minor
> Fix For: 1.12
>
> Attachments: report-7.asics, report-8.asice
>
>
> These are the references:
> * 
> [http://www.iana.org/assignments/media-types/application/vnd.etsi.asic-e+zip]
> * 
> [http://www.iana.org/assignments/media-types/application/vnd.etsi.asic-s+zip]
> My {{custom-mimetypes.xml}} is:
> {code:xml}
> 
> 
>   
> ASiC-E
> <_comment>Extended Associated Signature Container
> 
>   
>  offset="30" />
>   
> 
> 
>   
>   
> ASiC-S
> <_comment>Simple Associated Signature Container
> 
>   
>  offset="30" />
>   
> 
> 
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1792) Add ASiC-E and ASiC-S mime types

2015-11-10 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14998818#comment-14998818
 ] 

Nick Burch commented on TIKA-1792:
--

I don't think we can use the mime magic you have supplied, as there is no 
guarantee that the mimetype entry will be the first one in the zip. If I 
re-order the files inside your samples, then the magic stops working. Sadly, 
the only way to correctly detect container-based files such as these is what we 
do for OOXML, iWorks, ODF and friends, which is with a Zip-specific 
container-aware detector

> Add ASiC-E and ASiC-S mime types
> 
>
> Key: TIKA-1792
> URL: https://issues.apache.org/jira/browse/TIKA-1792
> Project: Tika
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 1.11
>Reporter: Roberto Benedetti
>Priority: Minor
> Fix For: 2.0, 1.12
>
> Attachments: report-7.asics, report-8.asice
>
>
> These are the references:
> * 
> [http://www.iana.org/assignments/media-types/application/vnd.etsi.asic-e+zip]
> * 
> [http://www.iana.org/assignments/media-types/application/vnd.etsi.asic-s+zip]
> My {{custom-mimetypes.xml}} is:
> {code:xml}
> 
> 
>   
> ASiC-E
> <_comment>Extended Associated Signature Container
> 
>   
>  offset="30" />
>   
> 
> 
>   
>   
> ASiC-S
> <_comment>Simple Associated Signature Container
> 
>   
>  offset="30" />
>   
> 
> 
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1791) URI is not hierarchical exception when location model resource is inside a jar in classpath

2015-11-10 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14998814#comment-14998814
 ] 

Nick Burch commented on TIKA-1791:
--

There seems to be quite a few changes in the patch, not just a simple String to 
URL swap. Would you be able to explain a bit more about why you needed to make 
the additional changes you did, and why you took the approach you did to 
refactor things for the change?

I'm also a little worried about the {{geoparser.initialize(context);}} lines in 
the test - does that mean the parser stops working for people who don't add 
this additional step? If so, it's a no-go as most people will probably be using 
it via one of the facades like {{AutoDetectParser}} or {{DefaultParser}} so 
won't know to do things like that. 

> URI is not hierarchical exception when location model resource is inside a 
> jar in classpath
> ---
>
> Key: TIKA-1791
> URL: https://issues.apache.org/jira/browse/TIKA-1791
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.11
> Environment: location model  file is placed inside a fat Jar (with 
> all the dependencies)
>Reporter: Thamme Gowda N
>
> {code:title=Stacktrace|borderStyle=solid}
> The following error happens when location NER model resource is packaged 
> inside a jar and GeoTopicParser is enabled.
> Caused by: java.lang.IllegalArgumentException: URI is not hierarchical
>   at java.io.File.(File.java:418)
>   at 
> org.apache.tika.parser.geo.topic.GeoParserConfig.(GeoParserConfig.java:33)
>   at org.apache.tika.parser.geo.topic.GeoParser.(GeoParser.java:54)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>   at java.lang.Class.newInstance(Class.java:442)
>   at 
> org.apache.tika.config.TikaConfig$XmlLoader.loadOne(TikaConfig.java:559)
>   at 
> org.apache.tika.config.TikaConfig$XmlLoader.loadOverall(TikaConfig.java:492)
>   at org.apache.tika.config.TikaConfig.(TikaConfig.java:166)
>   at org.apache.tika.config.TikaConfig.(TikaConfig.java:149)
>   at org.apache.tika.config.TikaConfig.(TikaConfig.java:142)
>   at org.apache.tika.config.TikaConfig.(TikaConfig.java:138)
>   at edu.usc.cs.ir.cwork.tika.Parser.(Parser.java:45)
> {code}
> Refernces :
> http://stackoverflow.com/questions/18055189/why-my-uri-is-not-hierarchical



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1792) Add ASiC-E and ASiC-S mime types

2015-11-10 Thread Roberto Benedetti (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roberto Benedetti updated TIKA-1792:

Attachment: report-8.asice
report-7.asics

test files

> Add ASiC-E and ASiC-S mime types
> 
>
> Key: TIKA-1792
> URL: https://issues.apache.org/jira/browse/TIKA-1792
> Project: Tika
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 1.11
>Reporter: Roberto Benedetti
>Priority: Minor
> Fix For: 2.0, 1.12
>
> Attachments: report-7.asics, report-8.asice
>
>
> These are the references:
> * 
> [http://www.iana.org/assignments/media-types/application/vnd.etsi.asic-e+zip]
> * 
> [http://www.iana.org/assignments/media-types/application/vnd.etsi.asic-s+zip]
> My {{custom-mimetypes.xml}} is:
> {code:xml}
> 
> 
>   
> ASiC-E
> <_comment>Extended Associated Signature Container
> 
>   
>  offset="30" />
>   
> 
> 
>   
>   
> ASiC-S
> <_comment>Simple Associated Signature Container
> 
>   
>  offset="30" />
>   
> 
> 
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1792) Add ASiC-E and ASiC-S mime types

2015-11-10 Thread Roberto Benedetti (JIRA)
Roberto Benedetti created TIKA-1792:
---

 Summary: Add ASiC-E and ASiC-S mime types
 Key: TIKA-1792
 URL: https://issues.apache.org/jira/browse/TIKA-1792
 Project: Tika
  Issue Type: Improvement
  Components: core
Affects Versions: 1.11
Reporter: Roberto Benedetti
Priority: Minor
 Fix For: 2.0, 1.12


These are the references:
* [http://www.iana.org/assignments/media-types/application/vnd.etsi.asic-e+zip]
* [http://www.iana.org/assignments/media-types/application/vnd.etsi.asic-s+zip]

My {{custom-mimetypes.xml}} is:
{code:xml}


  
ASiC-E
<_comment>Extended Associated Signature Container

  

  


  

  
ASiC-S
<_comment>Simple Associated Signature Container

  

  


  

{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)