[jira] [Commented] (TIKA-1791) URI is not hierarchical exception when location model resource is inside a jar in classpath
[ https://issues.apache.org/jira/browse/TIKA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14999085#comment-14999085 ] Thamme Gowda N commented on TIKA-1791: -- Thanks for the feedback. * The fix for non-hierarchical URI is done by using URL instead of URI and path string. (Learned that we can have a URL to files inside ZIP archive, but not URI) While I modified NER model loading code to make above change possible, I also happened to make these changes: * The NER model was previously reloaded for every `parse()` call. It now reuses the model by making use of a state variable. * The `isAvailable()` function was previously trying to launch an external process for every call to figureout availability of 'lucene-geo-gazeteer' command (it is invoked in `parse()`). This has been changed to use a state variable. * The model is loaded on first call to `parse()` or `isAviable()` : via lazy intialization. My tests showed that it is backward compatible. UPDATE : Test case is now unaltered. I was just trying to see if the test cases are passing different parse context. The lazy intialization of name extractor is gauranteed to work and thus shouldnt be breaking the existing usages. The {code} GeoParserConfig.setNERModelPath(String) {code} is also preserved for the users who are already using it to supply model path. However, {code}GeoParserConfig.getNERPath() {code} is swapped with URL getter. > URI is not hierarchical exception when location model resource is inside a > jar in classpath > --- > > Key: TIKA-1791 > URL: https://issues.apache.org/jira/browse/TIKA-1791 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.11 > Environment: location model file is placed inside a fat Jar (with > all the dependencies) >Reporter: Thamme Gowda N > > {code:title=Stacktrace|borderStyle=solid} > The following error happens when location NER model resource is packaged > inside a jar and GeoTopicParser is enabled. > Caused by: java.lang.IllegalArgumentException: URI is not hierarchical > at java.io.File.(File.java:418) > at > org.apache.tika.parser.geo.topic.GeoParserConfig.(GeoParserConfig.java:33) > at org.apache.tika.parser.geo.topic.GeoParser.(GeoParser.java:54) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at java.lang.Class.newInstance(Class.java:442) > at > org.apache.tika.config.TikaConfig$XmlLoader.loadOne(TikaConfig.java:559) > at > org.apache.tika.config.TikaConfig$XmlLoader.loadOverall(TikaConfig.java:492) > at org.apache.tika.config.TikaConfig.(TikaConfig.java:166) > at org.apache.tika.config.TikaConfig.(TikaConfig.java:149) > at org.apache.tika.config.TikaConfig.(TikaConfig.java:142) > at org.apache.tika.config.TikaConfig.(TikaConfig.java:138) > at edu.usc.cs.ir.cwork.tika.Parser.(Parser.java:45) > {code} > Refernces : > http://stackoverflow.com/questions/18055189/why-my-uri-is-not-hierarchical -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1792) Add ASiC-E and ASiC-S mime types
[ https://issues.apache.org/jira/browse/TIKA-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14998898#comment-14998898 ] Nick Burch commented on TIKA-1792: -- Requiring it to be first + uncompressed does indeed change things! I've updated the priority and comments in r1713697. (Quite a few zip-based formats do have mime magic for the "best case", but they need the zip container detector to work in the general case) > Add ASiC-E and ASiC-S mime types > > > Key: TIKA-1792 > URL: https://issues.apache.org/jira/browse/TIKA-1792 > Project: Tika > Issue Type: Improvement > Components: core >Affects Versions: 1.11 >Reporter: Roberto Benedetti >Priority: Minor > Fix For: 1.12 > > Attachments: report-7.asics, report-8.asice > > > These are the references: > * > [http://www.iana.org/assignments/media-types/application/vnd.etsi.asic-e+zip] > * > [http://www.iana.org/assignments/media-types/application/vnd.etsi.asic-s+zip] > My {{custom-mimetypes.xml}} is: > {code:xml} > > > > ASiC-E > <_comment>Extended Associated Signature Container > > > offset="30" /> > > > > > > ASiC-S > <_comment>Simple Associated Signature Container > > > offset="30" /> > > > > > > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1792) Add ASiC-E and ASiC-S mime types
[ https://issues.apache.org/jira/browse/TIKA-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14998886#comment-14998886 ] Hudson commented on TIKA-1792: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #882 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/882/]) TIKA-1792 ASiC E and S mimetypes, detection and tests. Files and mimetype from Roberto Benedetti (nick: [http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1713677]) * trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml * trunk/tika-parsers/src/main/java/org/apache/tika/parser/pkg/ZipContainerDetector.java * trunk/tika-parsers/src/test/java/org/apache/tika/detect/TestContainerAwareDetector.java * trunk/tika-parsers/src/test/resources/test-documents/testASiCE.asice * trunk/tika-parsers/src/test/resources/test-documents/testASiCS.asics > Add ASiC-E and ASiC-S mime types > > > Key: TIKA-1792 > URL: https://issues.apache.org/jira/browse/TIKA-1792 > Project: Tika > Issue Type: Improvement > Components: core >Affects Versions: 1.11 >Reporter: Roberto Benedetti >Priority: Minor > Fix For: 1.12 > > Attachments: report-7.asics, report-8.asice > > > These are the references: > * > [http://www.iana.org/assignments/media-types/application/vnd.etsi.asic-e+zip] > * > [http://www.iana.org/assignments/media-types/application/vnd.etsi.asic-s+zip] > My {{custom-mimetypes.xml}} is: > {code:xml} > > > > ASiC-E > <_comment>Extended Associated Signature Container > > > offset="30" /> > > > > > > ASiC-S > <_comment>Simple Associated Signature Container > > > offset="30" /> > > > > > > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1792) Add ASiC-E and ASiC-S mime types
[ https://issues.apache.org/jira/browse/TIKA-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14998849#comment-14998849 ] Roberto Benedetti commented on TIKA-1792: - Ops... You did it before i could reply ;-) > Add ASiC-E and ASiC-S mime types > > > Key: TIKA-1792 > URL: https://issues.apache.org/jira/browse/TIKA-1792 > Project: Tika > Issue Type: Improvement > Components: core >Affects Versions: 1.11 >Reporter: Roberto Benedetti >Priority: Minor > Fix For: 1.12 > > Attachments: report-7.asics, report-8.asice > > > These are the references: > * > [http://www.iana.org/assignments/media-types/application/vnd.etsi.asic-e+zip] > * > [http://www.iana.org/assignments/media-types/application/vnd.etsi.asic-s+zip] > My {{custom-mimetypes.xml}} is: > {code:xml} > > > > ASiC-E > <_comment>Extended Associated Signature Container > > > offset="30" /> > > > > > > ASiC-S > <_comment>Simple Associated Signature Container > > > offset="30" /> > > > > > > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1792) Add ASiC-E and ASiC-S mime types
[ https://issues.apache.org/jira/browse/TIKA-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14998845#comment-14998845 ] Roberto Benedetti commented on TIKA-1792: - I see, but tika-mimetypes.xml already contains such entries for EPUB and iBooks. Furthermore [Annex A|http://www.etsi.org/deliver/etsi_ts/102900_102999/102918/01.03.01_60/ts_102918v010301p.pdf] of the specification requires mimetype entry to be _the first_ and _uncompressed_ entry just to use mime magics. > Add ASiC-E and ASiC-S mime types > > > Key: TIKA-1792 > URL: https://issues.apache.org/jira/browse/TIKA-1792 > Project: Tika > Issue Type: Improvement > Components: core >Affects Versions: 1.11 >Reporter: Roberto Benedetti >Priority: Minor > Fix For: 1.12 > > Attachments: report-7.asics, report-8.asice > > > These are the references: > * > [http://www.iana.org/assignments/media-types/application/vnd.etsi.asic-e+zip] > * > [http://www.iana.org/assignments/media-types/application/vnd.etsi.asic-s+zip] > My {{custom-mimetypes.xml}} is: > {code:xml} > > > > ASiC-E > <_comment>Extended Associated Signature Container > > > offset="30" /> > > > > > > ASiC-S > <_comment>Simple Associated Signature Container > > > offset="30" /> > > > > > > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1792) Add ASiC-E and ASiC-S mime types
[ https://issues.apache.org/jira/browse/TIKA-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-1792. -- Resolution: Fixed Fix Version/s: (was: 2.0) > Add ASiC-E and ASiC-S mime types > > > Key: TIKA-1792 > URL: https://issues.apache.org/jira/browse/TIKA-1792 > Project: Tika > Issue Type: Improvement > Components: core >Affects Versions: 1.11 >Reporter: Roberto Benedetti >Priority: Minor > Fix For: 1.12 > > Attachments: report-7.asics, report-8.asice > > > These are the references: > * > [http://www.iana.org/assignments/media-types/application/vnd.etsi.asic-e+zip] > * > [http://www.iana.org/assignments/media-types/application/vnd.etsi.asic-s+zip] > My {{custom-mimetypes.xml}} is: > {code:xml} > > > > ASiC-E > <_comment>Extended Associated Signature Container > > > offset="30" /> > > > > > > ASiC-S > <_comment>Simple Associated Signature Container > > > offset="30" /> > > > > > > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1792) Add ASiC-E and ASiC-S mime types
[ https://issues.apache.org/jira/browse/TIKA-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14998841#comment-14998841 ] Nick Burch commented on TIKA-1792: -- Luckily these files use the same mimetype storage convention as ODF does, so our existing Zip-aware container detector was able to handle them as-is In r1713677, I've added your test files (thanks!), slightly tweaked mimetype entries, and unit tests > Add ASiC-E and ASiC-S mime types > > > Key: TIKA-1792 > URL: https://issues.apache.org/jira/browse/TIKA-1792 > Project: Tika > Issue Type: Improvement > Components: core >Affects Versions: 1.11 >Reporter: Roberto Benedetti >Priority: Minor > Fix For: 1.12 > > Attachments: report-7.asics, report-8.asice > > > These are the references: > * > [http://www.iana.org/assignments/media-types/application/vnd.etsi.asic-e+zip] > * > [http://www.iana.org/assignments/media-types/application/vnd.etsi.asic-s+zip] > My {{custom-mimetypes.xml}} is: > {code:xml} > > > > ASiC-E > <_comment>Extended Associated Signature Container > > > offset="30" /> > > > > > > ASiC-S > <_comment>Simple Associated Signature Container > > > offset="30" /> > > > > > > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1792) Add ASiC-E and ASiC-S mime types
[ https://issues.apache.org/jira/browse/TIKA-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14998818#comment-14998818 ] Nick Burch commented on TIKA-1792: -- I don't think we can use the mime magic you have supplied, as there is no guarantee that the mimetype entry will be the first one in the zip. If I re-order the files inside your samples, then the magic stops working. Sadly, the only way to correctly detect container-based files such as these is what we do for OOXML, iWorks, ODF and friends, which is with a Zip-specific container-aware detector > Add ASiC-E and ASiC-S mime types > > > Key: TIKA-1792 > URL: https://issues.apache.org/jira/browse/TIKA-1792 > Project: Tika > Issue Type: Improvement > Components: core >Affects Versions: 1.11 >Reporter: Roberto Benedetti >Priority: Minor > Fix For: 2.0, 1.12 > > Attachments: report-7.asics, report-8.asice > > > These are the references: > * > [http://www.iana.org/assignments/media-types/application/vnd.etsi.asic-e+zip] > * > [http://www.iana.org/assignments/media-types/application/vnd.etsi.asic-s+zip] > My {{custom-mimetypes.xml}} is: > {code:xml} > > > > ASiC-E > <_comment>Extended Associated Signature Container > > > offset="30" /> > > > > > > ASiC-S > <_comment>Simple Associated Signature Container > > > offset="30" /> > > > > > > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1791) URI is not hierarchical exception when location model resource is inside a jar in classpath
[ https://issues.apache.org/jira/browse/TIKA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14998814#comment-14998814 ] Nick Burch commented on TIKA-1791: -- There seems to be quite a few changes in the patch, not just a simple String to URL swap. Would you be able to explain a bit more about why you needed to make the additional changes you did, and why you took the approach you did to refactor things for the change? I'm also a little worried about the {{geoparser.initialize(context);}} lines in the test - does that mean the parser stops working for people who don't add this additional step? If so, it's a no-go as most people will probably be using it via one of the facades like {{AutoDetectParser}} or {{DefaultParser}} so won't know to do things like that. > URI is not hierarchical exception when location model resource is inside a > jar in classpath > --- > > Key: TIKA-1791 > URL: https://issues.apache.org/jira/browse/TIKA-1791 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.11 > Environment: location model file is placed inside a fat Jar (with > all the dependencies) >Reporter: Thamme Gowda N > > {code:title=Stacktrace|borderStyle=solid} > The following error happens when location NER model resource is packaged > inside a jar and GeoTopicParser is enabled. > Caused by: java.lang.IllegalArgumentException: URI is not hierarchical > at java.io.File.(File.java:418) > at > org.apache.tika.parser.geo.topic.GeoParserConfig.(GeoParserConfig.java:33) > at org.apache.tika.parser.geo.topic.GeoParser.(GeoParser.java:54) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at java.lang.Class.newInstance(Class.java:442) > at > org.apache.tika.config.TikaConfig$XmlLoader.loadOne(TikaConfig.java:559) > at > org.apache.tika.config.TikaConfig$XmlLoader.loadOverall(TikaConfig.java:492) > at org.apache.tika.config.TikaConfig.(TikaConfig.java:166) > at org.apache.tika.config.TikaConfig.(TikaConfig.java:149) > at org.apache.tika.config.TikaConfig.(TikaConfig.java:142) > at org.apache.tika.config.TikaConfig.(TikaConfig.java:138) > at edu.usc.cs.ir.cwork.tika.Parser.(Parser.java:45) > {code} > Refernces : > http://stackoverflow.com/questions/18055189/why-my-uri-is-not-hierarchical -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1792) Add ASiC-E and ASiC-S mime types
[ https://issues.apache.org/jira/browse/TIKA-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Roberto Benedetti updated TIKA-1792: Attachment: report-8.asice report-7.asics test files > Add ASiC-E and ASiC-S mime types > > > Key: TIKA-1792 > URL: https://issues.apache.org/jira/browse/TIKA-1792 > Project: Tika > Issue Type: Improvement > Components: core >Affects Versions: 1.11 >Reporter: Roberto Benedetti >Priority: Minor > Fix For: 2.0, 1.12 > > Attachments: report-7.asics, report-8.asice > > > These are the references: > * > [http://www.iana.org/assignments/media-types/application/vnd.etsi.asic-e+zip] > * > [http://www.iana.org/assignments/media-types/application/vnd.etsi.asic-s+zip] > My {{custom-mimetypes.xml}} is: > {code:xml} > > > > ASiC-E > <_comment>Extended Associated Signature Container > > > offset="30" /> > > > > > > ASiC-S > <_comment>Simple Associated Signature Container > > > offset="30" /> > > > > > > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1792) Add ASiC-E and ASiC-S mime types
Roberto Benedetti created TIKA-1792: --- Summary: Add ASiC-E and ASiC-S mime types Key: TIKA-1792 URL: https://issues.apache.org/jira/browse/TIKA-1792 Project: Tika Issue Type: Improvement Components: core Affects Versions: 1.11 Reporter: Roberto Benedetti Priority: Minor Fix For: 2.0, 1.12 These are the references: * [http://www.iana.org/assignments/media-types/application/vnd.etsi.asic-e+zip] * [http://www.iana.org/assignments/media-types/application/vnd.etsi.asic-s+zip] My {{custom-mimetypes.xml}} is: {code:xml} ASiC-E <_comment>Extended Associated Signature Container ASiC-S <_comment>Simple Associated Signature Container {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)