[jira] [Commented] (TIKA-2944) TikaConfig should support the parameters without XML type attribute
[ https://issues.apache.org/jira/browse/TIKA-2944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16936074#comment-16936074 ] Sergey Beryozkin commented on TIKA-2944: {{Param}} class supporting a {{boolean}} (XMLSchema type) in addition to {{bool}} should help in meantime > TikaConfig should support the parameters without XML type attribute > --- > > Key: TIKA-2944 > URL: https://issues.apache.org/jira/browse/TIKA-2944 > Project: Tika > Issue Type: Improvement > Components: config >Reporter: Sergey Beryozkin >Assignee: Sergey Beryozkin >Priority: Major > Fix For: 2.0.0, 1.23 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (TIKA-2946) Review how TikaConfig can avoid parsing XML itself
Sergey Beryozkin created TIKA-2946: -- Summary: Review how TikaConfig can avoid parsing XML itself Key: TIKA-2946 URL: https://issues.apache.org/jira/browse/TIKA-2946 Project: Tika Issue Type: Improvement Components: config Reporter: Sergey Beryozkin Fix For: 2.0 I have some issues right now with initializing the {{TikaConfig}} at the Quarkus build time. The reason I'd like to do it is to avoid having the XML classes loaded into the memory when the application starts. Moving the XML parsing code out (perhaps into a static TikaConfig factory method) and few other minor tweaks will help. I'll try to provide more input when 2.0 will become closer -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Updated] (TIKA-2945) AutoDetectParser should skip the content type detection if Metadata already has it
[ https://issues.apache.org/jira/browse/TIKA-2945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Beryozkin updated TIKA-2945: --- Summary: AutoDetectParser should skip the content type detection if Metadata already has it (was: AutoDetectParser should skip the conetnt type detection if Metadata already has it) > AutoDetectParser should skip the content type detection if Metadata already > has it > -- > > Key: TIKA-2945 > URL: https://issues.apache.org/jira/browse/TIKA-2945 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: Sergey Beryozkin >Assignee: Sergey Beryozkin >Priority: Minor > Fix For: 2.0.0, 1.23 > > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (TIKA-2943) Modularize tika-parsers
Sergey Beryozkin created TIKA-2943: -- Summary: Modularize tika-parsers Key: TIKA-2943 URL: https://issues.apache.org/jira/browse/TIKA-2943 Project: Tika Issue Type: Improvement Reporter: Sergey Beryozkin Assignee: Sergey Beryozkin Fix For: 2.0.0 This effort will be based on the work done by Bob at the [2.x branch|https://github.com/apache/tika/tree/2.x/tika-parser-modules] -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Updated] (TIKA-2944) TikaConfig should support the parameters without XML type attribute
[ https://issues.apache.org/jira/browse/TIKA-2944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Beryozkin updated TIKA-2944: --- Summary: TikaConfig should support the parameters without XML type attribute (was: TikaConfig should support the parameters with the XML type attribute) > TikaConfig should support the parameters without XML type attribute > --- > > Key: TIKA-2944 > URL: https://issues.apache.org/jira/browse/TIKA-2944 > Project: Tika > Issue Type: Improvement > Components: config >Reporter: Sergey Beryozkin >Assignee: Sergey Beryozkin >Priority: Major > Fix For: 2.0.0, 1.23 > > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (TIKA-2945) AutoDetectParser should skip the conetnt type detection if Metadata already has it
Sergey Beryozkin created TIKA-2945: -- Summary: AutoDetectParser should skip the conetnt type detection if Metadata already has it Key: TIKA-2945 URL: https://issues.apache.org/jira/browse/TIKA-2945 Project: Tika Issue Type: Improvement Components: parser Reporter: Sergey Beryozkin Assignee: Sergey Beryozkin Fix For: 2.0.0, 1.23 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (TIKA-2944) TikaConfig should support the parameters with the XML type attribute
Sergey Beryozkin created TIKA-2944: -- Summary: TikaConfig should support the parameters with the XML type attribute Key: TIKA-2944 URL: https://issues.apache.org/jira/browse/TIKA-2944 Project: Tika Issue Type: Improvement Components: config Reporter: Sergey Beryozkin Assignee: Sergey Beryozkin Fix For: 2.0.0, 1.23 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (TIKA-2882) Parsers should not include HTTP client code
[ https://issues.apache.org/jira/browse/TIKA-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16929130#comment-16929130 ] Sergey Beryozkin commented on TIKA-2882: I'll create a dedicated issue so that I can link from it from the other sources, etc > Parsers should not include HTTP client code > --- > > Key: TIKA-2882 > URL: https://issues.apache.org/jira/browse/TIKA-2882 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.21 >Reporter: Jonathan Essex >Assignee: Sergey Beryozkin >Priority: Major > > Folks, does it really make sense for a parser to have a REST client built in? > The GROBID and NLTKNERecogniser parsers use the apache CXF client directly. > > Since I don't use CXF and my entire app is built on a different JAX-RS stack > this just dropped me straight into dependency hell. > Surely it would make more sense to keep the parsers... well, parsers... and > build support for delegating parsing to other services into some higher level > in the stack (such as the server, where the CXF dependency is more benign). > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (TIKA-2882) Parsers should not include HTTP client code
[ https://issues.apache.org/jira/browse/TIKA-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16909117#comment-16909117 ] Sergey Beryozkin commented on TIKA-2882: OK, I've assigned to myself. Well, now that I'll have to do it, I have to say, my priority is to prepare to the Apache Con EU Tika talk well, but I'll give a try (and copy Bob's work from the 2.x branch :-) ) asap. Thanks > Parsers should not include HTTP client code > --- > > Key: TIKA-2882 > URL: https://issues.apache.org/jira/browse/TIKA-2882 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.21 >Reporter: Jonathan Essex >Assignee: Sergey Beryozkin >Priority: Major > > Folks, does it really make sense for a parser to have a REST client built in? > The GROBID and NLTKNERecogniser parsers use the apache CXF client directly. > > Since I don't use CXF and my entire app is built on a different JAX-RS stack > this just dropped me straight into dependency hell. > Surely it would make more sense to keep the parsers... well, parsers... and > build support for delegating parsing to other services into some higher level > in the stack (such as the server, where the CXF dependency is more benign). > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Comment Edited] (TIKA-2882) Parsers should not include HTTP client code
[ https://issues.apache.org/jira/browse/TIKA-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16909117#comment-16909117 ] Sergey Beryozkin edited comment on TIKA-2882 at 8/16/19 3:11 PM: - OK, I've assigned to myself. Well, now that I'll have to do it, I have to say, my priority is to prepare to the Apache Con EU Tika talk well, but I'll give it a try (and copy Bob's work from the 2.x branch :-) ) asap. Thanks was (Author: sergey_beryozkin): OK, I've assigned to myself. Well, now that I'll have to do it, I have to say, my priority is to prepare to the Apache Con EU Tika talk well, but I'll give a try (and copy Bob's work from the 2.x branch :-) ) asap. Thanks > Parsers should not include HTTP client code > --- > > Key: TIKA-2882 > URL: https://issues.apache.org/jira/browse/TIKA-2882 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.21 >Reporter: Jonathan Essex >Assignee: Sergey Beryozkin >Priority: Major > > Folks, does it really make sense for a parser to have a REST client built in? > The GROBID and NLTKNERecogniser parsers use the apache CXF client directly. > > Since I don't use CXF and my entire app is built on a different JAX-RS stack > this just dropped me straight into dependency hell. > Surely it would make more sense to keep the parsers... well, parsers... and > build support for delegating parsing to other services into some higher level > in the stack (such as the server, where the CXF dependency is more benign). > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (TIKA-2882) Parsers should not include HTTP client code
[ https://issues.apache.org/jira/browse/TIKA-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Beryozkin reassigned TIKA-2882: -- Assignee: Sergey Beryozkin > Parsers should not include HTTP client code > --- > > Key: TIKA-2882 > URL: https://issues.apache.org/jira/browse/TIKA-2882 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.21 >Reporter: Jonathan Essex >Assignee: Sergey Beryozkin >Priority: Major > > Folks, does it really make sense for a parser to have a REST client built in? > The GROBID and NLTKNERecogniser parsers use the apache CXF client directly. > > Since I don't use CXF and my entire app is built on a different JAX-RS stack > this just dropped me straight into dependency hell. > Surely it would make more sense to keep the parsers... well, parsers... and > build support for delegating parsing to other services into some higher level > in the stack (such as the server, where the CXF dependency is more benign). > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (TIKA-2882) Parsers should not include HTTP client code
[ https://issues.apache.org/jira/browse/TIKA-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16909101#comment-16909101 ] Sergey Beryozkin commented on TIKA-2882: Hi Tim Can we consider giving it a go ? Bob agrees to focus on the modules only, so all we have to do to get it started is to create few modules grouping the specific parsers, and have the existing tika-parsers incorporating those new modules. There should be no even coding involved unless I'm missing something. If you can create a quick PR only and then I can test it with Quarkus, etc, what do you think ? > Parsers should not include HTTP client code > --- > > Key: TIKA-2882 > URL: https://issues.apache.org/jira/browse/TIKA-2882 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.21 >Reporter: Jonathan Essex >Priority: Major > > Folks, does it really make sense for a parser to have a REST client built in? > The GROBID and NLTKNERecogniser parsers use the apache CXF client directly. > > Since I don't use CXF and my entire app is built on a different JAX-RS stack > this just dropped me straight into dependency hell. > Surely it would make more sense to keep the parsers... well, parsers... and > build support for delegating parsing to other services into some higher level > in the stack (such as the server, where the CXF dependency is more benign). > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (TIKA-2910) Text extraction using Tika command line and Tika server differs
[ https://issues.apache.org/jira/browse/TIKA-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16905636#comment-16905636 ] Sergey Beryozkin commented on TIKA-2910: Hi [~talli...@apache.org], IMHO it should be fixed in the 1.x branch as well, may be with a property letting the users to enable or disable this fix at runtime > Text extraction using Tika command line and Tika server differs > --- > > Key: TIKA-2910 > URL: https://issues.apache.org/jira/browse/TIKA-2910 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.21 >Reporter: Walter >Priority: Major > Labels: newbie > Attachments: CorpusP_25471990.xml > > > When extracting TXT from the very same XML file using either Tika command > line utility or the Tika in server mode, the results differ. > It looks as if PCDATA in deeper nested XML structures are just ignored and > only an empty line is returned. > I assume both use the same base code. Are there any default settings that may > differ or can be set? > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (TIKA-2910) Text extraction using Tika command line and Tika server differs
[ https://issues.apache.org/jira/browse/TIKA-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902845#comment-16902845 ] Sergey Beryozkin commented on TIKA-2910: Thanks, Tim may be away so lets wait till he is back > Text extraction using Tika command line and Tika server differs > --- > > Key: TIKA-2910 > URL: https://issues.apache.org/jira/browse/TIKA-2910 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.21 >Reporter: Walter >Priority: Major > Labels: newbie > Attachments: CorpusP_25471990.xml > > > When extracting TXT from the very same XML file using either Tika command > line utility or the Tika in server mode, the results differ. > It looks as if PCDATA in deeper nested XML structures are just ignored and > only an empty line is returned. > I assume both use the same base code. Are there any default settings that may > differ or can be set? > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (TIKA-2910) Text extraction using Tika command line and Tika server differs
[ https://issues.apache.org/jira/browse/TIKA-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902836#comment-16902836 ] Sergey Beryozkin commented on TIKA-2910: Hi [~akit] Can you please download the source and debug ? It can help > Text extraction using Tika command line and Tika server differs > --- > > Key: TIKA-2910 > URL: https://issues.apache.org/jira/browse/TIKA-2910 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.21 >Reporter: Walter >Priority: Major > Labels: newbie > Attachments: CorpusP_25471990.xml > > > When extracting TXT from the very same XML file using either Tika command > line utility or the Tika in server mode, the results differ. > It looks as if PCDATA in deeper nested XML structures are just ignored and > only an empty line is returned. > I assume both use the same base code. Are there any default settings that may > differ or can be set? > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (TIKA-2896) NullPointerException in MimeTypesReader.releaseParser()
[ https://issues.apache.org/jira/browse/TIKA-2896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Beryozkin resolved TIKA-2896. Resolution: Fixed Fix Version/s: 1.22 Thanks for the patch > NullPointerException in MimeTypesReader.releaseParser() > --- > > Key: TIKA-2896 > URL: https://issues.apache.org/jira/browse/TIKA-2896 > Project: Tika > Issue Type: Bug > Components: mime >Affects Versions: 1.21 >Reporter: Eamonn Saunders >Priority: Major > Fix For: 1.22 > > > We have encountered a situation where the call to parser.reset() in the > following code snippet results in a NullPointerException. > {code:java} > private static void releaseParser(SAXParser parser) { > try { > parser.reset(); > } catch (UnsupportedOperationException e) { > //ignore > } > {code} > releaseParser() is called in the finally block of MimeTypesReader.read() > {code:java} > public void read(InputStream stream) throws IOException, > MimeTypeException { > SAXParser parser = null; > try { > parser = acquireSAXParser(); > parser.parse(stream, this); > } catch (TikaException e) { > throw new MimeTypeException("Unable to create an XML parser", e); > } catch (SAXException e) { > throw new MimeTypeException("Invalid type configuration", e); > } finally { > releaseParser(parser); > } > }{code} > The parser variable will be null coming out of acquireSAXParser() if > acquireSAXParser() is called on a thread that is interrupted (i.e. the > InterruptedException is handled in the following code): > {code:java} > private static SAXParser acquireSAXParser() > throws TikaException { > while (true) { > SAXParser parser = null; > try { > READ_WRITE_LOCK.readLock().lock(); > parser = SAX_PARSERS.poll(10, TimeUnit.MILLISECONDS); > } catch (InterruptedException e) { > throw new TikaException("interrupted while waiting for > SAXParser", e); > } finally { > READ_WRITE_LOCK.readLock().unlock(); > } > if (parser != null) { > return parser; > } > } > }{code} > A simple fix would be to check for null before calling releaseParser() in the > finally block. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2889) Tika Server keeps crashing
[ https://issues.apache.org/jira/browse/TIKA-2889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16859265#comment-16859265 ] Sergey Beryozkin commented on TIKA-2889: [~talli...@apache.org] sorry, was away this week. This issue has been closed now so may be CXF/Jetty was not at fault in this case > Tika Server keeps crashing > -- > > Key: TIKA-2889 > URL: https://issues.apache.org/jira/browse/TIKA-2889 > Project: Tika > Issue Type: Bug > Components: server >Affects Versions: 1.18, 1.19, 1.19.1, 1.21 > Environment: Both Ubuntu and Windows have the same bug/issue >Reporter: Thomas van Hesteren >Priority: Minor > Attachments: log4j.xml, tika-2.log, tika-server-everything-2.log, > tika-server-everything.log, tika-server-everything.log, > tika-server-everything.log, tika-server-everything.log, > tika-server-everything3.log, tika.log, tika.log, tika3.log > > > I have a document processor which sends documents to the Tika Server over > cUrl. However, the server crashes multiple times (not document specific). The > response I get from cUrl if it happens is as follows: > Connection error: Couldn't connect to server > > The Tika server is started when the script starts executing. For now, I fixed > the issue by making a watcher which restarts the tika server when it crashes. > It then processes a few other documents and crashes again (after a few > minutes, let's say 5 minutes tops). > > Is there any possibility to catch the exception (if it throws any?) > > A log which shows the crash of the server: > 04-06-2019 15:49:25|Processing a file of: 52.3kB > 04-06-2019 15:49:24|Processing a file of: 255.5kB > 04-06-2019 15:49:24|Processing a file of: 241.6kB > 04-06-2019 15:49:23|Processing a file of: 37.7kB > 04-06-2019 15:49:22|Processing a file of: 1.27MB > 04-06-2019 15:49:21|Processing a file of: 55.8kB > 04-06-2019 15:49:17|Processing a file of: 114.5kB > 04-06-2019 15:49:08|Server is not running. Restarting Server. Connection > error: Couldn't connect to server > 04-06-2019 15:49:03|Processing a file of: 41.0kB > 04-06-2019 15:49:00|Processing a file of: 38.0kB > 04-06-2019 15:48:59|ProcesPsing a file of: 37.1kB > 04-06-2019 15:48:59|Processing a file of: 60.2kB > 04-06-2019 15:48:59|Processing a file of: 280.7kB > 04-06-2019 15:48:59|Processing a file of: 3.30MB -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (TIKA-2882) Parsers should not include HTTP client code
[ https://issues.apache.org/jira/browse/TIKA-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16851032#comment-16851032 ] Sergey Beryozkin edited comment on TIKA-2882 at 5/29/19 4:26 PM: - I will definitely support getting the tika-parser-modules idea in 2.0 prioritized (please start a dev thread if you'd like, so that the categories can be reviewed what goes where etc). May be it is a simplistic view but if we postpone the OSGI-ification aspects till later then it is about moving specific parsers out of tika-parsers to respective modules ? Sorry if it is not the case :-). I can definitely help with testing. Or moving some parsers to a new module once I see how you do it one of these modules :-) was (Author: sergey_beryozkin): I will definitely support getting the tika-parser-modules idea in 2.0 prioritized (please start a dev thread if you'd like, so that the categories can be reviewed what goes where etc). May be it is a simplistic view but if postpone the OSGI-ification aspects till later then it is about moving specific parsers out of tika-parsers to respective modules ? Sorry if it is not the case :-). I can definitely help with testing. Or moving some parsers to a new module once I see how you do it one of these modules :-) > Parsers should not include HTTP client code > --- > > Key: TIKA-2882 > URL: https://issues.apache.org/jira/browse/TIKA-2882 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.21 >Reporter: Jonathan Essex >Priority: Major > > Folks, does it really make sense for a parser to have a REST client built in? > The GROBID and NLTKNERecogniser parsers use the apache CXF client directly. > > Since I don't use CXF and my entire app is built on a different JAX-RS stack > this just dropped me straight into dependency hell. > Surely it would make more sense to keep the parsers... well, parsers... and > build support for delegating parsing to other services into some higher level > in the stack (such as the server, where the CXF dependency is more benign). > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2882) Parsers should not include HTTP client code
[ https://issues.apache.org/jira/browse/TIKA-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16851032#comment-16851032 ] Sergey Beryozkin commented on TIKA-2882: I will definitely support getting the tika-parser-modules idea in 2.0 prioritized (please start a dev thread if you'd like, so that the categories can be reviewed what goes where etc). May be it is a simplistic view but if postpone the OSGI-ification aspects till later then it is about moving specific parsers out of tika-parsers to respective modules ? Sorry if it is not the case :-). I can definitely help with testing. Or moving some parsers to a new module once I see how you do it one of these modules :-) > Parsers should not include HTTP client code > --- > > Key: TIKA-2882 > URL: https://issues.apache.org/jira/browse/TIKA-2882 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.21 >Reporter: Jonathan Essex >Priority: Major > > Folks, does it really make sense for a parser to have a REST client built in? > The GROBID and NLTKNERecogniser parsers use the apache CXF client directly. > > Since I don't use CXF and my entire app is built on a different JAX-RS stack > this just dropped me straight into dependency hell. > Surely it would make more sense to keep the parsers... well, parsers... and > build support for delegating parsing to other services into some higher level > in the stack (such as the server, where the CXF dependency is more benign). > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (TIKA-2882) Parsers should not include HTTP client code
[ https://issues.apache.org/jira/browse/TIKA-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849697#comment-16849697 ] Sergey Beryozkin edited comment on TIKA-2882 at 5/28/19 1:39 PM: - I see, I was thinking of the 2.x branch :-) Lets start with the https://github.com/apache/tika/tree/2.x/tika-parser-modules idea in 2.0 master ? was (Author: sergey_beryozkin): I see, I was thinking of the 2.x branch :-) Lets starts with the https://github.com/apache/tika/tree/2.x/tika-parser-modules idea in 2.0 master ? > Parsers should not include HTTP client code > --- > > Key: TIKA-2882 > URL: https://issues.apache.org/jira/browse/TIKA-2882 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.21 >Reporter: Jonathan Essex >Priority: Major > > Folks, does it really make sense for a parser to have a REST client built in? > The GROBID and NLTKNERecogniser parsers use the apache CXF client directly. > > Since I don't use CXF and my entire app is built on a different JAX-RS stack > this just dropped me straight into dependency hell. > Surely it would make more sense to keep the parsers... well, parsers... and > build support for delegating parsing to other services into some higher level > in the stack (such as the server, where the CXF dependency is more benign). > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2882) Parsers should not include HTTP client code
[ https://issues.apache.org/jira/browse/TIKA-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849697#comment-16849697 ] Sergey Beryozkin commented on TIKA-2882: I see, I was thinking of the 2.x branch :-) Lets starts with the https://github.com/apache/tika/tree/2.x/tika-parser-modules idea in 2.0 master ? > Parsers should not include HTTP client code > --- > > Key: TIKA-2882 > URL: https://issues.apache.org/jira/browse/TIKA-2882 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.21 >Reporter: Jonathan Essex >Priority: Major > > Folks, does it really make sense for a parser to have a REST client built in? > The GROBID and NLTKNERecogniser parsers use the apache CXF client directly. > > Since I don't use CXF and my entire app is built on a different JAX-RS stack > this just dropped me straight into dependency hell. > Surely it would make more sense to keep the parsers... well, parsers... and > build support for delegating parsing to other services into some higher level > in the stack (such as the server, where the CXF dependency is more benign). > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2882) Parsers should not include HTTP client code
[ https://issues.apache.org/jira/browse/TIKA-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849593#comment-16849593 ] Sergey Beryozkin commented on TIKA-2882: [~talli...@apache.org], so as far as Tika 2.0 is concerned, would it make sense to start applying similar ideas in the 1.x line ? Or make the 2.0 branch a master ? > Parsers should not include HTTP client code > --- > > Key: TIKA-2882 > URL: https://issues.apache.org/jira/browse/TIKA-2882 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.21 >Reporter: Jonathan Essex >Priority: Major > > Folks, does it really make sense for a parser to have a REST client built in? > The GROBID and NLTKNERecogniser parsers use the apache CXF client directly. > > Since I don't use CXF and my entire app is built on a different JAX-RS stack > this just dropped me straight into dependency hell. > Surely it would make more sense to keep the parsers... well, parsers... and > build support for delegating parsing to other services into some higher level > in the stack (such as the server, where the CXF dependency is more benign). > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2882) Parsers should not include HTTP client code
[ https://issues.apache.org/jira/browse/TIKA-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16848866#comment-16848866 ] Sergey Beryozkin commented on TIKA-2882: Oh, is it multipart ? In that case may be it has to be replaced with something neutral such as Apache HttpClient or even the manual multipart payload creation. > Parsers should not include HTTP client code > --- > > Key: TIKA-2882 > URL: https://issues.apache.org/jira/browse/TIKA-2882 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.21 >Reporter: Jonathan Essex >Priority: Major > > Folks, does it really make sense for a parser to have a REST client built in? > The GROBID and NLTKNERecogniser parsers use the apache CXF client directly. > > Since I don't use CXF and my entire app is built on a different JAX-RS stack > this just dropped me straight into dependency hell. > Surely it would make more sense to keep the parsers... well, parsers... and > build support for delegating parsing to other services into some higher level > in the stack (such as the server, where the CXF dependency is more benign). > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2882) Parsers should not include HTTP client code
[ https://issues.apache.org/jira/browse/TIKA-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16848489#comment-16848489 ] Sergey Beryozkin commented on TIKA-2882: Give a try please, I can help with the migration if needed > Parsers should not include HTTP client code > --- > > Key: TIKA-2882 > URL: https://issues.apache.org/jira/browse/TIKA-2882 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.21 >Reporter: Jonathan Essex >Priority: Major > > Folks, does it really make sense for a parser to have a REST client built in? > The GROBID and NLTKNERecogniser parsers use the apache CXF client directly. > > Since I don't use CXF and my entire app is built on a different JAX-RS stack > this just dropped me straight into dependency hell. > Surely it would make more sense to keep the parsers... well, parsers... and > build support for delegating parsing to other services into some higher level > in the stack (such as the server, where the CXF dependency is more benign). > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2882) Parsers should not include HTTP client code
[ https://issues.apache.org/jira/browse/TIKA-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16848485#comment-16848485 ] Sergey Beryozkin commented on TIKA-2882: Can you consider a PR where CXF WebClient code is replaced by JAX-RS 2.0 client API ? > Parsers should not include HTTP client code > --- > > Key: TIKA-2882 > URL: https://issues.apache.org/jira/browse/TIKA-2882 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.21 >Reporter: Jonathan Essex >Priority: Major > > Folks, does it really make sense for a parser to have a REST client built in? > The GROBID and NLTKNERecogniser parsers use the apache CXF client directly. > > Since I don't use CXF and my entire app is built on a different JAX-RS stack > this just dropped me straight into dependency hell. > Surely it would make more sense to keep the parsers... well, parsers... and > build support for delegating parsing to other services into some higher level > in the stack (such as the server, where the CXF dependency is more benign). > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (TIKA-2862) Make PDF Parser GraalVM native mode ready
[ https://issues.apache.org/jira/browse/TIKA-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Beryozkin resolved TIKA-2862. Resolution: Not A Problem The issue is at the PDFBox level so it will be addressed in PDFBox then Tika will get it as part of the regular PDFBox dependency update > Make PDF Parser GraalVM native mode ready > -- > > Key: TIKA-2862 > URL: https://issues.apache.org/jira/browse/TIKA-2862 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.20 >Reporter: Sergey Beryozkin >Assignee: Sergey Beryozkin >Priority: Major > > PDF Parser is not Graal native mode ready yet, the following is reported when > it is processed as part of Quarkus native mode build: > {noformat} > Error: Detected a FileDescriptor in the image heap. You can manually > delay class initialization to image run time by using the option > --delay-class-initialization-to-runtime=. ... > Detailed message: > Trace: object org.apache.fontbox.ttf.BufferedRandomAccessFile > object org.apache.fontbox.ttf.RAFDataStream > object org.apache.fontbox.ttf.TrueTypeFont > object org.apache.pdfbox.pdmodel.font.PDType1Font > method > org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults() > Call path from entry point to > org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults(): > > at > org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults(PDAcroForm.java:106) > at > org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.(PDAcroForm.java:93) > at > org.apache.pdfbox.pdmodel.PDDocumentCatalog.getAcroForm(PDDocumentCatalog.java:108) > at org.apache.tika.parser.pdf.PDFParser.handleXFAOnly(PDFParser.java:534) > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:164) > {noformat} > See also > [https://medium.com/graalvm/understanding-class-initialization-in-graalvm-native-image-generation-d765b7e4d6ed] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2862) Make PDF Parser GraalVM native mode ready
[ https://issues.apache.org/jira/browse/TIKA-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16839640#comment-16839640 ] Sergey Beryozkin commented on TIKA-2862: See https://github.com/apache/pdfbox/pull/69 > Make PDF Parser GraalVM native mode ready > -- > > Key: TIKA-2862 > URL: https://issues.apache.org/jira/browse/TIKA-2862 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.20 >Reporter: Sergey Beryozkin >Assignee: Sergey Beryozkin >Priority: Major > > PDF Parser is not Graal native mode ready yet, the following is reported when > it is processed as part of Quarkus native mode build: > {noformat} > Error: Detected a FileDescriptor in the image heap. You can manually > delay class initialization to image run time by using the option > --delay-class-initialization-to-runtime=. ... > Detailed message: > Trace: object org.apache.fontbox.ttf.BufferedRandomAccessFile > object org.apache.fontbox.ttf.RAFDataStream > object org.apache.fontbox.ttf.TrueTypeFont > object org.apache.pdfbox.pdmodel.font.PDType1Font > method > org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults() > Call path from entry point to > org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults(): > > at > org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults(PDAcroForm.java:106) > at > org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.(PDAcroForm.java:93) > at > org.apache.pdfbox.pdmodel.PDDocumentCatalog.getAcroForm(PDDocumentCatalog.java:108) > at org.apache.tika.parser.pdf.PDFParser.handleXFAOnly(PDFParser.java:534) > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:164) > {noformat} > See also > [https://medium.com/graalvm/understanding-class-initialization-in-graalvm-native-image-generation-d765b7e4d6ed] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TIKA-2862) Make PDF Parser GraalVM native mode ready
[ https://issues.apache.org/jira/browse/TIKA-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Beryozkin updated TIKA-2862: --- Summary: Make PDF Parser GraalVM native mode ready (was: Make PDF Parser Graal native mode ready ) > Make PDF Parser GraalVM native mode ready > -- > > Key: TIKA-2862 > URL: https://issues.apache.org/jira/browse/TIKA-2862 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.20 >Reporter: Sergey Beryozkin >Assignee: Sergey Beryozkin >Priority: Major > > PDF Parser is not Graal native mode ready yet, the following is reported when > it is processed as part of Quarkus native mode build: > {noformat} > Error: Detected a FileDescriptor in the image heap. You can manually > delay class initialization to image run time by using the option > --delay-class-initialization-to-runtime=. ... > Detailed message: > Trace: object org.apache.fontbox.ttf.BufferedRandomAccessFile > object org.apache.fontbox.ttf.RAFDataStream > object org.apache.fontbox.ttf.TrueTypeFont > object org.apache.pdfbox.pdmodel.font.PDType1Font > method > org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults() > Call path from entry point to > org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults(): > > at > org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults(PDAcroForm.java:106) > at > org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.(PDAcroForm.java:93) > at > org.apache.pdfbox.pdmodel.PDDocumentCatalog.getAcroForm(PDDocumentCatalog.java:108) > at org.apache.tika.parser.pdf.PDFParser.handleXFAOnly(PDFParser.java:534) > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:164) > {noformat} > See also > [https://medium.com/graalvm/understanding-class-initialization-in-graalvm-native-image-generation-d765b7e4d6ed] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TIKA-2862) Make PDF Parser Graal native mode ready
[ https://issues.apache.org/jira/browse/TIKA-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Beryozkin updated TIKA-2862: --- Description: PDF Parser is not Graal native mode ready yet, the following is reported when it is processed as part of Quarkus native mode build: {noformat} Error: Detected a FileDescriptor in the image heap. You can manually delay class initialization to image run time by using the option --delay-class-initialization-to-runtime=. ... Detailed message: Trace: object org.apache.fontbox.ttf.BufferedRandomAccessFile object org.apache.fontbox.ttf.RAFDataStream object org.apache.fontbox.ttf.TrueTypeFont object org.apache.pdfbox.pdmodel.font.PDType1Font method org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults() Call path from entry point to org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults(): at org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults(PDAcroForm.java:106) at org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.(PDAcroForm.java:93) at org.apache.pdfbox.pdmodel.PDDocumentCatalog.getAcroForm(PDDocumentCatalog.java:108) at org.apache.tika.parser.pdf.PDFParser.handleXFAOnly(PDFParser.java:534) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:164) {noformat} See also [https://medium.com/graalvm/understanding-class-initialization-in-graalvm-native-image-generation-d765b7e4d6ed] was: PDF Parser is not Graal native mode ready yet, the following is reported when it is processed as part of Quarkus native mode build: Error: Detected a FileDescriptor in the image heap. You can manually delay class initialization to image run time by using the option --delay-class-initialization-to-runtime=. ... Detailed message: Trace: object org.apache.fontbox.ttf.BufferedRandomAccessFile object org.apache.fontbox.ttf.RAFDataStream object org.apache.fontbox.ttf.TrueTypeFont object org.apache.pdfbox.pdmodel.font.PDType1Font method org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults() Call path from entry point to org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults(): at org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults(PDAcroForm.java:106) at org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.(PDAcroForm.java:93) at org.apache.pdfbox.pdmodel.PDDocumentCatalog.getAcroForm(PDDocumentCatalog.java:108) at org.apache.tika.parser.pdf.PDFParser.handleXFAOnly(PDFParser.java:534) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:164) See also [https://medium.com/graalvm/understanding-class-initialization-in-graalvm-native-image-generation-d765b7e4d6ed] > Make PDF Parser Graal native mode ready > > > Key: TIKA-2862 > URL: https://issues.apache.org/jira/browse/TIKA-2862 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.20 >Reporter: Sergey Beryozkin >Assignee: Sergey Beryozkin >Priority: Major > > PDF Parser is not Graal native mode ready yet, the following is reported when > it is processed as part of Quarkus native mode build: > {noformat} > Error: Detected a FileDescriptor in the image heap. You can manually > delay class initialization to image run time by using the option > --delay-class-initialization-to-runtime=. ... > Detailed message: > Trace: object org.apache.fontbox.ttf.BufferedRandomAccessFile > object org.apache.fontbox.ttf.RAFDataStream > object org.apache.fontbox.ttf.TrueTypeFont > object org.apache.pdfbox.pdmodel.font.PDType1Font > method > org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults() > Call path from entry point to > org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults(): > > at > org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults(PDAcroForm.java:106) > at > org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.(PDAcroForm.java:93) > at > org.apache.pdfbox.pdmodel.PDDocumentCatalog.getAcroForm(PDDocumentCatalog.java:108) > at org.apache.tika.parser.pdf.PDFParser.handleXFAOnly(PDFParser.java:534) > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:164) > {noformat} > See also > [https://medium.com/graalvm/understanding-class-initialization-in-graalvm-native-image-generation-d765b7e4d6ed] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (TIKA-2862) Make PDF Parser Graal native mode ready
[ https://issues.apache.org/jira/browse/TIKA-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837432#comment-16837432 ] Sergey Beryozkin edited comment on TIKA-2862 at 5/10/19 4:37 PM: - The call path from PDType1Font to RAFDataStream: {noformat} at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:132) at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:87) at org.apache.pdfbox.pdmodel.font.FileSystemFontProvider.readTrueTypeFont(FileSystemFontProvider.java:731) at org.apache.pdfbox.pdmodel.font.FileSystemFontProvider.getTrueTypeFont(FileSystemFontProvider.java:696) at org.apache.pdfbox.pdmodel.font.FileSystemFontProvider.access$200(FileSystemFontProvider.java:55) at org.apache.pdfbox.pdmodel.font.FileSystemFontProvider$FSFontInfo.getFont(FileSystemFontProvider.java:132) at org.apache.pdfbox.pdmodel.font.FontMapperImpl.findFont(FontMapperImpl.java:436) at org.apache.pdfbox.pdmodel.font.FontMapperImpl.findFontBoxFont(FontMapperImpl.java:382) at org.apache.pdfbox.pdmodel.font.FontMapperImpl.getFontBoxFont(FontMapperImpl.java:359) at org.apache.pdfbox.pdmodel.font.PDType1Font.(PDType1Font.java:146) at org.apache.pdfbox.pdmodel.font.PDType1Font.(PDType1Font.java:91) {noformat} was (Author: sergey_beryozkin): The call path from PDType1Font to RAFDataStream: {noformat} 17:31:09,714 ERROR [org.apa.pdf.pdm.fon.FileSystemFontProvider] Could not load font file: /usr/share/fonts/liberation/LiberationSans-Regular.ttf: java.lang.NullPointerException at org.apache.fontbox.ttf.RAFDataStream.readSignedShort(RAFDataStream.java:77) at org.apache.fontbox.ttf.TTFDataStream.read32Fixed(TTFDataStream.java:50) at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:132) at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:87) at org.apache.pdfbox.pdmodel.font.FileSystemFontProvider.readTrueTypeFont(FileSystemFontProvider.java:731) at org.apache.pdfbox.pdmodel.font.FileSystemFontProvider.getTrueTypeFont(FileSystemFontProvider.java:696) at org.apache.pdfbox.pdmodel.font.FileSystemFontProvider.access$200(FileSystemFontProvider.java:55) at org.apache.pdfbox.pdmodel.font.FileSystemFontProvider$FSFontInfo.getFont(FileSystemFontProvider.java:132) at org.apache.pdfbox.pdmodel.font.FontMapperImpl.findFont(FontMapperImpl.java:436) at org.apache.pdfbox.pdmodel.font.FontMapperImpl.findFontBoxFont(FontMapperImpl.java:382) at org.apache.pdfbox.pdmodel.font.FontMapperImpl.getFontBoxFont(FontMapperImpl.java:359) at org.apache.pdfbox.pdmodel.font.PDType1Font.(PDType1Font.java:146) at org.apache.pdfbox.pdmodel.font.PDType1Font.(PDType1Font.java:91) {noformat} > Make PDF Parser Graal native mode ready > > > Key: TIKA-2862 > URL: https://issues.apache.org/jira/browse/TIKA-2862 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.20 >Reporter: Sergey Beryozkin >Assignee: Sergey Beryozkin >Priority: Major > > PDF Parser is not Graal native mode ready yet, the following is reported when > it is processed as part of Quarkus native mode build: > Error: Detected a FileDescriptor in the image heap. You can manually > delay class initialization to image run time by using the option > --delay-class-initialization-to-runtime=. ... > Detailed message: > Trace: object org.apache.fontbox.ttf.BufferedRandomAccessFile > object org.apache.fontbox.ttf.RAFDataStream > object org.apache.fontbox.ttf.TrueTypeFont > object org.apache.pdfbox.pdmodel.font.PDType1Font > method > org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults() > Call path from entry point to > org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults(): > > at > org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults(PDAcroForm.java:106) > at > org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.(PDAcroForm.java:93) > at > org.apache.pdfbox.pdmodel.PDDocumentCatalog.getAcroForm(PDDocumentCatalog.java:108) > at org.apache.tika.parser.pdf.PDFParser.handleXFAOnly(PDFParser.java:534) > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:164) > > See also > [https://medium.com/graalvm/understanding-class-initialization-in-graalvm-native-image-generation-d765b7e4d6ed] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2862) Make PDF Parser Graal native mode ready
[ https://issues.apache.org/jira/browse/TIKA-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837432#comment-16837432 ] Sergey Beryozkin commented on TIKA-2862: The call path from PDType1Font to RAFDataStream: {noformat} 17:31:09,714 ERROR [org.apa.pdf.pdm.fon.FileSystemFontProvider] Could not load font file: /usr/share/fonts/liberation/LiberationSans-Regular.ttf: java.lang.NullPointerException at org.apache.fontbox.ttf.RAFDataStream.readSignedShort(RAFDataStream.java:77) at org.apache.fontbox.ttf.TTFDataStream.read32Fixed(TTFDataStream.java:50) at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:132) at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:87) at org.apache.pdfbox.pdmodel.font.FileSystemFontProvider.readTrueTypeFont(FileSystemFontProvider.java:731) at org.apache.pdfbox.pdmodel.font.FileSystemFontProvider.getTrueTypeFont(FileSystemFontProvider.java:696) at org.apache.pdfbox.pdmodel.font.FileSystemFontProvider.access$200(FileSystemFontProvider.java:55) at org.apache.pdfbox.pdmodel.font.FileSystemFontProvider$FSFontInfo.getFont(FileSystemFontProvider.java:132) at org.apache.pdfbox.pdmodel.font.FontMapperImpl.findFont(FontMapperImpl.java:436) at org.apache.pdfbox.pdmodel.font.FontMapperImpl.findFontBoxFont(FontMapperImpl.java:382) at org.apache.pdfbox.pdmodel.font.FontMapperImpl.getFontBoxFont(FontMapperImpl.java:359) at org.apache.pdfbox.pdmodel.font.PDType1Font.(PDType1Font.java:146) at org.apache.pdfbox.pdmodel.font.PDType1Font.(PDType1Font.java:91) {noformat} > Make PDF Parser Graal native mode ready > > > Key: TIKA-2862 > URL: https://issues.apache.org/jira/browse/TIKA-2862 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.20 >Reporter: Sergey Beryozkin >Assignee: Sergey Beryozkin >Priority: Major > > PDF Parser is not Graal native mode ready yet, the following is reported when > it is processed as part of Quarkus native mode build: > Error: Detected a FileDescriptor in the image heap. You can manually > delay class initialization to image run time by using the option > --delay-class-initialization-to-runtime=. ... > Detailed message: > Trace: object org.apache.fontbox.ttf.BufferedRandomAccessFile > object org.apache.fontbox.ttf.RAFDataStream > object org.apache.fontbox.ttf.TrueTypeFont > object org.apache.pdfbox.pdmodel.font.PDType1Font > method > org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults() > Call path from entry point to > org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults(): > > at > org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults(PDAcroForm.java:106) > at > org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.(PDAcroForm.java:93) > at > org.apache.pdfbox.pdmodel.PDDocumentCatalog.getAcroForm(PDDocumentCatalog.java:108) > at org.apache.tika.parser.pdf.PDFParser.handleXFAOnly(PDFParser.java:534) > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:164) > > See also > [https://medium.com/graalvm/understanding-class-initialization-in-graalvm-native-image-generation-d765b7e4d6ed] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (TIKA-2862) Make PDF Parser Graal native mode ready
[ https://issues.apache.org/jira/browse/TIKA-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Beryozkin reassigned TIKA-2862: -- Assignee: Sergey Beryozkin > Make PDF Parser Graal native mode ready > > > Key: TIKA-2862 > URL: https://issues.apache.org/jira/browse/TIKA-2862 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.20 >Reporter: Sergey Beryozkin >Assignee: Sergey Beryozkin >Priority: Major > > PDF Parser is not Graal native mode ready yet, the following is reported when > it is processed as part of Quarkus native mode build: > Error: Detected a FileDescriptor in the image heap. You can manually > delay class initialization to image run time by using the option > --delay-class-initialization-to-runtime=. ... > Detailed message: > Trace: object org.apache.fontbox.ttf.BufferedRandomAccessFile > object org.apache.fontbox.ttf.RAFDataStream > object org.apache.fontbox.ttf.TrueTypeFont > object org.apache.pdfbox.pdmodel.font.PDType1Font > method > org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults() > Call path from entry point to > org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults(): > > at > org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults(PDAcroForm.java:106) > at > org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.(PDAcroForm.java:93) > at > org.apache.pdfbox.pdmodel.PDDocumentCatalog.getAcroForm(PDDocumentCatalog.java:108) > at org.apache.tika.parser.pdf.PDFParser.handleXFAOnly(PDFParser.java:534) > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:164) > > See also > [https://medium.com/graalvm/understanding-class-initialization-in-graalvm-native-image-generation-d765b7e4d6ed] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (TIKA-2862) Make PDF Parser Graal native mode ready
Sergey Beryozkin created TIKA-2862: -- Summary: Make PDF Parser Graal native mode ready Key: TIKA-2862 URL: https://issues.apache.org/jira/browse/TIKA-2862 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.20 Reporter: Sergey Beryozkin PDF Parser is not Graal native mode ready yet, the following is reported when it is processed as part of Quarkus native mode build: Error: Detected a FileDescriptor in the image heap. You can manually delay class initialization to image run time by using the option --delay-class-initialization-to-runtime=. ... Detailed message: Trace: object org.apache.fontbox.ttf.BufferedRandomAccessFile object org.apache.fontbox.ttf.RAFDataStream object org.apache.fontbox.ttf.TrueTypeFont object org.apache.pdfbox.pdmodel.font.PDType1Font method org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults() Call path from entry point to org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults(): at org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults(PDAcroForm.java:106) at org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.(PDAcroForm.java:93) at org.apache.pdfbox.pdmodel.PDDocumentCatalog.getAcroForm(PDDocumentCatalog.java:108) at org.apache.tika.parser.pdf.PDFParser.handleXFAOnly(PDFParser.java:534) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:164) See also [https://medium.com/graalvm/understanding-class-initialization-in-graalvm-native-image-generation-d765b7e4d6ed] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (TIKA-2476) Metadata.toString always returns a trailing space
[ https://issues.apache.org/jira/browse/TIKA-2476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Beryozkin resolved TIKA-2476. Resolution: Fixed Assignee: Sergey Beryozkin > Metadata.toString always returns a trailing space > - > > Key: TIKA-2476 > URL: https://issues.apache.org/jira/browse/TIKA-2476 > Project: Tika > Issue Type: Improvement > Components: core >Affects Versions: 1.16 >Reporter: Sergey Beryozkin >Assignee: Sergey Beryozkin >Priority: Trivial > Fix For: 1.17 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (TIKA-2476) Metadata.toString always returns a trailing space
Sergey Beryozkin created TIKA-2476: -- Summary: Metadata.toString always returns a trailing space Key: TIKA-2476 URL: https://issues.apache.org/jira/browse/TIKA-2476 Project: Tika Issue Type: Improvement Components: core Affects Versions: 1.16 Reporter: Sergey Beryozkin Priority: Trivial Fix For: 1.17 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (TIKA-2472) Implement Metadata.hashCode
[ https://issues.apache.org/jira/browse/TIKA-2472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Beryozkin resolved TIKA-2472. Resolution: Fixed > Implement Metadata.hashCode > --- > > Key: TIKA-2472 > URL: https://issues.apache.org/jira/browse/TIKA-2472 > Project: Tika > Issue Type: Improvement >Affects Versions: 1.16 >Reporter: Sergey Beryozkin >Assignee: Sergey Beryozkin >Priority: Trivial > Fix For: 1.17 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2472) Implement Metadata.hashCode
[ https://issues.apache.org/jira/browse/TIKA-2472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16198709#comment-16198709 ] Sergey Beryozkin commented on TIKA-2472: This is fixed now... > Implement Metadata.hashCode > --- > > Key: TIKA-2472 > URL: https://issues.apache.org/jira/browse/TIKA-2472 > Project: Tika > Issue Type: Improvement >Affects Versions: 1.16 >Reporter: Sergey Beryozkin >Assignee: Sergey Beryozkin >Priority: Trivial > Fix For: 1.17 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2472) Implement Metadata.hashCode
[ https://issues.apache.org/jira/browse/TIKA-2472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16195792#comment-16195792 ] Sergey Beryozkin commented on TIKA-2472: Ken, thanks for the tip, makes sense to follow this path at the Metadata level as well > Implement Metadata.hashCode > --- > > Key: TIKA-2472 > URL: https://issues.apache.org/jira/browse/TIKA-2472 > Project: Tika > Issue Type: Improvement >Affects Versions: 1.16 >Reporter: Sergey Beryozkin >Assignee: Sergey Beryozkin >Priority: Trivial > Fix For: 1.17 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2472) Implement Metadata.hashCode
[ https://issues.apache.org/jira/browse/TIKA-2472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16195303#comment-16195303 ] Sergey Beryozkin commented on TIKA-2472: I've got a bit of shock with this code: {code:java} @Test public void testIt() { Map map1 = new HashMap(); map1.put("A", new String[] {"a"}); Map map2 = new HashMap(); map2.put("A", new String[] {"a"}); System.out.println(map1.equals(map2)); System.out.println(map1.hashCode() == map2.hashCode()); } {code} Seeing 'false' printed in both cases which is obvious really given that 'identity' situation for the arrays. Eugene, you are right, thanks for being on top of these changes, you'll make me a Java champion soon :-) Guys, should we update Metadata to use List of Strings ? (though it is a sep issue) > Implement Metadata.hashCode > --- > > Key: TIKA-2472 > URL: https://issues.apache.org/jira/browse/TIKA-2472 > Project: Tika > Issue Type: Improvement >Affects Versions: 1.16 >Reporter: Sergey Beryozkin >Assignee: Sergey Beryozkin >Priority: Trivial > Fix For: 1.17 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Reopened] (TIKA-2472) Implement Metadata.hashCode
[ https://issues.apache.org/jira/browse/TIKA-2472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Beryozkin reopened TIKA-2472: With thanks to Eugene... > Implement Metadata.hashCode > --- > > Key: TIKA-2472 > URL: https://issues.apache.org/jira/browse/TIKA-2472 > Project: Tika > Issue Type: Improvement >Affects Versions: 1.16 >Reporter: Sergey Beryozkin >Assignee: Sergey Beryozkin >Priority: Trivial > Fix For: 1.17 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2472) Implement Metadata.hashCode
[ https://issues.apache.org/jira/browse/TIKA-2472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16195275#comment-16195275 ] Sergey Beryozkin commented on TIKA-2472: I'd not qualify it as incorrect but as sub-optimal. And I know how the relevant Map hashCode is implemented - I copied that to ParseResult as a temp substitution (to be honest it does not really matter how hashCode or even equals are implemented if ParseResult will keep a file location which is the real key). That said I've no problems with making this code done better > Implement Metadata.hashCode > --- > > Key: TIKA-2472 > URL: https://issues.apache.org/jira/browse/TIKA-2472 > Project: Tika > Issue Type: Improvement >Affects Versions: 1.16 >Reporter: Sergey Beryozkin >Assignee: Sergey Beryozkin >Priority: Trivial > Fix For: 1.17 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (TIKA-2472) Implement Metadata.hashCode
[ https://issues.apache.org/jira/browse/TIKA-2472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Beryozkin resolved TIKA-2472. Resolution: Fixed > Implement Metadata.hashCode > --- > > Key: TIKA-2472 > URL: https://issues.apache.org/jira/browse/TIKA-2472 > Project: Tika > Issue Type: Improvement >Affects Versions: 1.16 >Reporter: Sergey Beryozkin >Assignee: Sergey Beryozkin >Priority: Trivial > Fix For: 1.17 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (TIKA-2472) Implement Metadata.hashCode
Sergey Beryozkin created TIKA-2472: -- Summary: Implement Metadata.hashCode Key: TIKA-2472 URL: https://issues.apache.org/jira/browse/TIKA-2472 Project: Tika Issue Type: Improvement Affects Versions: 1.16 Reporter: Sergey Beryozkin Assignee: Sergey Beryozkin Priority: Trivial Fix For: 1.17 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2384) Double close of InputStream in accept text/plain in tika-server
[ https://issues.apache.org/jira/browse/TIKA-2384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034923#comment-16034923 ] Sergey Beryozkin commented on TIKA-2384: Not sure - may be they can be closed in TikaResource's StreamingOutput implementation ? > Double close of InputStream in accept text/plain in tika-server > --- > > Key: TIKA-2384 > URL: https://issues.apache.org/jira/browse/TIKA-2384 > Project: Tika > Issue Type: Bug > Components: server >Affects Versions: 1.15 >Reporter: Tim Allison >Priority: Blocker > > As reported by Haris Osmanagic on the user list, TikaResource closes the > InputStream twice on requests for text/plain. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Comment Edited] (TIKA-2384) Double close of InputStream in accept text/plain in tika-server
[ https://issues.apache.org/jira/browse/TIKA-2384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034859#comment-16034859 ] Sergey Beryozkin edited comment on TIKA-2384 at 6/2/17 3:24 PM: Hi Tim, CXF will only auto-close InputStream if its task is to copy InputStream (or it can avoid doing it if configured as such), which can only happen if InputStream is returned as a service method response (directly or as JAX-RS Response entity) was (Author: sergey_beryozkin): Hi Tim, CXF will only auto-close InputStream if its task is to copy InputStream (or it can avoid doing it if configured as such), which can only happen if InputStream returned as a service method response (directly or as JAX-RS Response entity) > Double close of InputStream in accept text/plain in tika-server > --- > > Key: TIKA-2384 > URL: https://issues.apache.org/jira/browse/TIKA-2384 > Project: Tika > Issue Type: Bug > Components: server >Affects Versions: 1.15 >Reporter: Tim Allison >Priority: Blocker > > As reported by Haris Osmanagic on the user list, TikaResource closes the > InputStream twice on requests for text/plain. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (TIKA-2384) Double close of InputStream in accept text/plain in tika-server
[ https://issues.apache.org/jira/browse/TIKA-2384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034859#comment-16034859 ] Sergey Beryozkin commented on TIKA-2384: Hi Tim, CXF will only auto-close InputStream if its task is to copy InputStream (or it can avoid doing it if configured as such), which can only happen if InputStream returned as a service method response (directly or as JAX-RS Response entity) > Double close of InputStream in accept text/plain in tika-server > --- > > Key: TIKA-2384 > URL: https://issues.apache.org/jira/browse/TIKA-2384 > Project: Tika > Issue Type: Bug > Components: server >Affects Versions: 1.15 >Reporter: Tim Allison >Priority: Blocker > > As reported by Haris Osmanagic on the user list, TikaResource closes the > InputStream twice on requests for text/plain. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (TIKA-2292) Update CXF version to 3.0.12
[ https://issues.apache.org/jira/browse/TIKA-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15906697#comment-15906697 ] Sergey Beryozkin commented on TIKA-2292: Hi - thanks for this fix :-) > Update CXF version to 3.0.12 > > > Key: TIKA-2292 > URL: https://issues.apache.org/jira/browse/TIKA-2292 > Project: Tika > Issue Type: Task > Components: server >Reporter: Sergey Beryozkin >Assignee: Sergey Beryozkin >Priority: Minor > Fix For: 1.15 > > > This is the last version in the CXF 3.0.x line which supports Java 6 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (TIKA-2292) Update CXF version to 3.0.12
Sergey Beryozkin created TIKA-2292: -- Summary: Update CXF version to 3.0.12 Key: TIKA-2292 URL: https://issues.apache.org/jira/browse/TIKA-2292 Project: Tika Issue Type: Task Components: server Reporter: Sergey Beryozkin Assignee: Sergey Beryozkin Priority: Minor Fix For: 1.15 This is the last version in the CXF 3.0.x line which supports Java 6 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (TIKA-2017) Tika Server Cannot handle large files
[ https://issues.apache.org/jira/browse/TIKA-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15348132#comment-15348132 ] Sergey Beryozkin commented on TIKA-2017: Might also be worth trying multiparts, I've updated the wiki to note that Metadata, RecursiveMetadata and TikaResource support multipart requests: https://wiki.apache.org/tika/TikaJAXRS#preview By the way I recall updating a PDF parser awhile back for it to parse the metadata only without touching the content, ContentHandler needs to be set to null. See testPdfParsingMetadataOnly in PDFParserTest. Might make sense to update other parsers too, though in this case using multiparts alone might help. > Tika Server Cannot handle large files > - > > Key: TIKA-2017 > URL: https://issues.apache.org/jira/browse/TIKA-2017 > Project: Tika > Issue Type: Bug >Reporter: Harshavardhan Manjunatha > Fix For: 1.14 > > > Tika-Python uses Tika REST Server to parse both content & metadata. In this > case, the CSV file was 600 MB in size. Tika REST Server runs out of Heap > Space since it tries to parse Content also. There should an option to make a > REST API call to Tika Server just to parse & return metadata. > {code} > Jun 22, 2016 6:38:40 PM org.slf4j.impl.JCLLoggerAdapter warn > WARNING: /rmeta/text > java.lang.RuntimeException: org.apache.cxf.interceptor.Fault: Java heap space > at > org.apache.cxf.interceptor.AbstractFaultChainInitiatorObserver.onMessage(AbstractFaultChainInitiatorObserver.java:116) > at > org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:371) > at > org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) > at > org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251) > at > org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261) > at > org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) > at > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) > at org.eclipse.jetty.server.Server.handle(Server.java:370) > at > org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494) > at > org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:982) > at > org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1043) > at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865) > at > org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240) > at > org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82) > at > org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696) > at > org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53) > at > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) > at > org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.cxf.interceptor.Fault: Java heap space > at > org.apache.cxf.service.invoker.AbstractInvoker.createFault(AbstractInvoker.java:163) > at > org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:129) > at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:200) > at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:99) > at > org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59) > at > org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96) > at > org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) > ... 21 more > Caused by: java.lang.OutOfMemoryError: Java heap space > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1871) Update Tika JAXRS wiki page with the info about multipart/form-data
[ https://issues.apache.org/jira/browse/TIKA-1871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Beryozkin resolved TIKA-1871. Resolution: Fixed > Update Tika JAXRS wiki page with the info about multipart/form-data > --- > > Key: TIKA-1871 > URL: https://issues.apache.org/jira/browse/TIKA-1871 > Project: Tika > Issue Type: Task > Components: documentation >Reporter: Sergey Beryozkin >Assignee: Sergey Beryozkin > Fix For: 1.13 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1871) Update Tika JAXRS wiki page with the info about multipart/form-data
[ https://issues.apache.org/jira/browse/TIKA-1871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15204947#comment-15204947 ] Sergey Beryozkin commented on TIKA-1871: Minor update done to the introduction at https://wiki.apache.org/tika/TikaJAXRS#Services and an extra sub-section added at https://wiki.apache.org/tika/TikaJAXRS#Tika_Resource: https://wiki.apache.org/tika/TikaJAXRS#Multipart_Support > Update Tika JAXRS wiki page with the info about multipart/form-data > --- > > Key: TIKA-1871 > URL: https://issues.apache.org/jira/browse/TIKA-1871 > Project: Tika > Issue Type: Task > Components: documentation >Reporter: Sergey Beryozkin >Assignee: Sergey Beryozkin > Fix For: 1.13 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1871) Update Tika JAXRS wiki page with the info about multipart/form-data
[ https://issues.apache.org/jira/browse/TIKA-1871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Beryozkin updated TIKA-1871: --- Priority: Major (was: Minor) Fix Version/s: 1.13 > Update Tika JAXRS wiki page with the info about multipart/form-data > --- > > Key: TIKA-1871 > URL: https://issues.apache.org/jira/browse/TIKA-1871 > Project: Tika > Issue Type: Task > Components: documentation >Reporter: Sergey Beryozkin >Assignee: Sergey Beryozkin > Fix For: 1.13 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1871) Update Tika JAXRS wiki page with the info about multipart/form-data
Sergey Beryozkin created TIKA-1871: -- Summary: Update Tika JAXRS wiki page with the info about multipart/form-data Key: TIKA-1871 URL: https://issues.apache.org/jira/browse/TIKA-1871 Project: Tika Issue Type: Task Components: documentation Reporter: Sergey Beryozkin Assignee: Sergey Beryozkin Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1712) GROBID parser fails in tika-app
[ https://issues.apache.org/jira/browse/TIKA-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14699760#comment-14699760 ] Sergey Beryozkin commented on TIKA-1712: Great, thanks > GROBID parser fails in tika-app > --- > > Key: TIKA-1712 > URL: https://issues.apache.org/jira/browse/TIKA-1712 > Project: Tika > Issue Type: Bug > Components: cli, server >Reporter: Chris A. Mattmann >Assignee: Chris A. Mattmann > Fix For: 1.11 > > > Hey Sergey do you have any idea why CXF's 3.0.3 rt-client would work fine in > tika-server, but fail in tika-app? I'm seeing that with the GROBID parser. > See: > https://issues.apache.org/jira/browse/CXF-6545 > Try calling the GROBID parser from Tika app: > java -classpath > $HOME/git/grobidparser-resources/:target/tika-app-1.11-SNAPSHOT.jar > org.apache.tika.cli.TikaCLI > --config=$HOME/git/grobidparser-resources/tika-config.xml -J > $HOME/git/grobid/papers/ICSE06.pdf > After following this guide: > https://wiki.apache.org/tika/GrobidJournalParser > Works fine in Tika-Server - dies in Tika-app with: > {noformat} > java.lang.NullPointerException > at > org.apache.cxf.jaxrs.client.AbstractClient.setupOutInterceptorChain(AbstractClient.java:849) > at > org.apache.cxf.jaxrs.client.AbstractClient.createMessage(AbstractClient.java:923) > at > org.apache.cxf.jaxrs.client.WebClient.finalizeMessage(WebClient.java:1125) > at > org.apache.cxf.jaxrs.client.WebClient.doChainedInvocation(WebClient.java:1098) > at org.apache.cxf.jaxrs.client.WebClient.doInvoke(WebClient.java:894) > at org.apache.cxf.jaxrs.client.WebClient.doInvoke(WebClient.java:865) > at org.apache.cxf.jaxrs.client.WebClient.invoke(WebClient.java:331) > at org.apache.cxf.jaxrs.client.WebClient.post(WebClient.java:340) > at > org.apache.tika.parser.journal.GrobidRESTParser.parse(GrobidRESTParser.java:82) > at > org.apache.tika.parser.journal.JournalParser.parse(JournalParser.java:67) > at > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at > org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:158) > at org.apache.tika.cli.TikaCLI.handleRecursiveJson(TikaCLI.java:504) > at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:484) > at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:139) > java.lang.NullPointerException > at > org.apache.tika.parser.journal.GrobidRESTParser.parse(GrobidRESTParser.java:89) > at > org.apache.tika.parser.journal.JournalParser.parse(JournalParser.java:67) > at > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at > org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:158) > at org.apache.tika.cli.TikaCLI.handleRecursiveJson(TikaCLI.java:504) > at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:484) > at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:139) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1712) GROBID parser fails in tika-app
[ https://issues.apache.org/jira/browse/TIKA-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14699687#comment-14699687 ] Sergey Beryozkin commented on TIKA-1712: Hi Chris, Dan Kulp reminded me that back in CXF 2.7.x a maven-shade-plugin was used: https://fisheye6.atlassian.com/browse/cxf/osgi/bundle/all/pom.xml?r=12acd46e3dbe98fa1321374b09174d5876271f08#to448 and it can help with collapsing files into a single one. I guess tika-app assembly needs to be updated accordingly, I've never tried working with the shade plugin myself thanks > GROBID parser fails in tika-app > --- > > Key: TIKA-1712 > URL: https://issues.apache.org/jira/browse/TIKA-1712 > Project: Tika > Issue Type: Bug > Components: cli, server >Reporter: Chris A. Mattmann >Assignee: Sergey Beryozkin > Fix For: 1.11 > > > Hey Sergey do you have any idea why CXF's 3.0.3 rt-client would work fine in > tika-server, but fail in tika-app? I'm seeing that with the GROBID parser. > See: > https://issues.apache.org/jira/browse/CXF-6545 > Try calling the GROBID parser from Tika app: > java -classpath > $HOME/git/grobidparser-resources/:target/tika-app-1.11-SNAPSHOT.jar > org.apache.tika.cli.TikaCLI > --config=$HOME/git/grobidparser-resources/tika-config.xml -J > $HOME/git/grobid/papers/ICSE06.pdf > After following this guide: > https://wiki.apache.org/tika/GrobidJournalParser > Works fine in Tika-Server - dies in Tika-app with: > {noformat} > java.lang.NullPointerException > at > org.apache.cxf.jaxrs.client.AbstractClient.setupOutInterceptorChain(AbstractClient.java:849) > at > org.apache.cxf.jaxrs.client.AbstractClient.createMessage(AbstractClient.java:923) > at > org.apache.cxf.jaxrs.client.WebClient.finalizeMessage(WebClient.java:1125) > at > org.apache.cxf.jaxrs.client.WebClient.doChainedInvocation(WebClient.java:1098) > at org.apache.cxf.jaxrs.client.WebClient.doInvoke(WebClient.java:894) > at org.apache.cxf.jaxrs.client.WebClient.doInvoke(WebClient.java:865) > at org.apache.cxf.jaxrs.client.WebClient.invoke(WebClient.java:331) > at org.apache.cxf.jaxrs.client.WebClient.post(WebClient.java:340) > at > org.apache.tika.parser.journal.GrobidRESTParser.parse(GrobidRESTParser.java:82) > at > org.apache.tika.parser.journal.JournalParser.parse(JournalParser.java:67) > at > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at > org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:158) > at org.apache.tika.cli.TikaCLI.handleRecursiveJson(TikaCLI.java:504) > at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:484) > at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:139) > java.lang.NullPointerException > at > org.apache.tika.parser.journal.GrobidRESTParser.parse(GrobidRESTParser.java:89) > at > org.apache.tika.parser.journal.JournalParser.parse(JournalParser.java:67) > at > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at > org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:158) > at org.apache.tika.cli.TikaCLI.handleRecursiveJson(TikaCLI.java:504) > at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:484) > at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:139) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1712) GROBID parser fails in tika-app
[ https://issues.apache.org/jira/browse/TIKA-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14699646#comment-14699646 ] Sergey Beryozkin edited comment on TIKA-1712 at 8/17/15 3:09 PM: - Add the following to bus-extensions.txt in your local tika-app.jar to validate it will fix it: org.apache.cxf.bus.managers.PhaseManagerImpl:org.apache.cxf.phase.PhaseManager:true org.apache.cxf.bus.managers.WorkQueueManagerImpl:org.apache.cxf.workqueue.WorkQueueManager:true org.apache.cxf.bus.managers.CXFBusLifeCycleManager:org.apache.cxf.buslifecycle.BusLifeCycleManager:true org.apache.cxf.bus.managers.ServerRegistryImpl:org.apache.cxf.endpoint.ServerRegistry:true org.apache.cxf.bus.managers.EndpointResolverRegistryImpl:org.apache.cxf.endpoint.EndpointResolverRegistry:true org.apache.cxf.bus.managers.HeaderManagerImpl:org.apache.cxf.headers.HeaderManager:true org.apache.cxf.service.factory.FactoryBeanListenerManager::true org.apache.cxf.bus.managers.ServerLifeCycleManagerImpl:org.apache.cxf.endpoint.ServerLifeCycleManager:true org.apache.cxf.bus.managers.ClientLifeCycleManagerImpl:org.apache.cxf.endpoint.ClientLifeCycleManager:true org.apache.cxf.bus.resource.ResourceManagerImpl:org.apache.cxf.resource.ResourceManager:true org.apache.cxf.catalog.OASISCatalogManager:org.apache.cxf.catalog.OASISCatalogManager:true The 1st line is probably enough... was (Author: sergey_beryozkin): Add the following to bus-extensions.txt in your local tika-app.jar to validate it will fix it: org.apache.cxf.bus.managers.PhaseManagerImpl:org.apache.cxf.phase.PhaseManager:true org.apache.cxf.bus.managers.WorkQueueManagerImpl:org.apache.cxf.workqueue.WorkQueueManager:true org.apache.cxf.bus.managers.CXFBusLifeCycleManager:org.apache.cxf.buslifecycle.BusLifeCycleManager:true org.apache.cxf.bus.managers.ServerRegistryImpl:org.apache.cxf.endpoint.ServerRegistry:true org.apache.cxf.bus.managers.EndpointResolverRegistryImpl:org.apache.cxf.endpoint.EndpointResolverRegistry:true org.apache.cxf.bus.managers.HeaderManagerImpl:org.apache.cxf.headers.HeaderManager:true org.apache.cxf.service.factory.FactoryBeanListenerManager::true org.apache.cxf.bus.managers.ServerLifeCycleManagerImpl:org.apache.cxf.endpoint.ServerLifeCycleManager:true org.apache.cxf.bus.managers.ClientLifeCycleManagerImpl:org.apache.cxf.endpoint.ClientLifeCycleManager:true org.apache.cxf.bus.resource.ResourceManagerImpl:org.apache.cxf.resource.ResourceManager:true org.apache.cxf.catalog.OASISCatalogManager:org.apache.cxf.catalog.OASISCatalogManager:true > GROBID parser fails in tika-app > --- > > Key: TIKA-1712 > URL: https://issues.apache.org/jira/browse/TIKA-1712 > Project: Tika > Issue Type: Bug > Components: cli, server >Reporter: Chris A. Mattmann >Assignee: Sergey Beryozkin > Fix For: 1.11 > > > Hey Sergey do you have any idea why CXF's 3.0.3 rt-client would work fine in > tika-server, but fail in tika-app? I'm seeing that with the GROBID parser. > See: > https://issues.apache.org/jira/browse/CXF-6545 > Try calling the GROBID parser from Tika app: > java -classpath > $HOME/git/grobidparser-resources/:target/tika-app-1.11-SNAPSHOT.jar > org.apache.tika.cli.TikaCLI > --config=$HOME/git/grobidparser-resources/tika-config.xml -J > $HOME/git/grobid/papers/ICSE06.pdf > After following this guide: > https://wiki.apache.org/tika/GrobidJournalParser > Works fine in Tika-Server - dies in Tika-app with: > {noformat} > java.lang.NullPointerException > at > org.apache.cxf.jaxrs.client.AbstractClient.setupOutInterceptorChain(AbstractClient.java:849) > at > org.apache.cxf.jaxrs.client.AbstractClient.createMessage(AbstractClient.java:923) > at > org.apache.cxf.jaxrs.client.WebClient.finalizeMessage(WebClient.java:1125) > at > org.apache.cxf.jaxrs.client.WebClient.doChainedInvocation(WebClient.java:1098) > at org.apache.cxf.jaxrs.client.WebClient.doInvoke(WebClient.java:894) > at org.apache.cxf.jaxrs.client.WebClient.doInvoke(WebClient.java:865) > at org.apache.cxf.jaxrs.client.WebClient.invoke(WebClient.java:331) > at org.apache.cxf.jaxrs.client.WebClient.post(WebClient.java:340) > at > org.apache.tika.parser.journal.GrobidRESTParser.parse(GrobidRESTParser.java:82) > at > org.apache.tika.parser.journal.JournalParser.parse(JournalParser.java:67) > at > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at > org.apache.tika.parser.RecursiveParserWrapper.parse(Recurs
[jira] [Commented] (TIKA-1712) GROBID parser fails in tika-app
[ https://issues.apache.org/jira/browse/TIKA-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14699646#comment-14699646 ] Sergey Beryozkin commented on TIKA-1712: Add the following to bus-extensions.txt in your local tika-app.jar to validate it will fix it: org.apache.cxf.bus.managers.PhaseManagerImpl:org.apache.cxf.phase.PhaseManager:true org.apache.cxf.bus.managers.WorkQueueManagerImpl:org.apache.cxf.workqueue.WorkQueueManager:true org.apache.cxf.bus.managers.CXFBusLifeCycleManager:org.apache.cxf.buslifecycle.BusLifeCycleManager:true org.apache.cxf.bus.managers.ServerRegistryImpl:org.apache.cxf.endpoint.ServerRegistry:true org.apache.cxf.bus.managers.EndpointResolverRegistryImpl:org.apache.cxf.endpoint.EndpointResolverRegistry:true org.apache.cxf.bus.managers.HeaderManagerImpl:org.apache.cxf.headers.HeaderManager:true org.apache.cxf.service.factory.FactoryBeanListenerManager::true org.apache.cxf.bus.managers.ServerLifeCycleManagerImpl:org.apache.cxf.endpoint.ServerLifeCycleManager:true org.apache.cxf.bus.managers.ClientLifeCycleManagerImpl:org.apache.cxf.endpoint.ClientLifeCycleManager:true org.apache.cxf.bus.resource.ResourceManagerImpl:org.apache.cxf.resource.ResourceManager:true org.apache.cxf.catalog.OASISCatalogManager:org.apache.cxf.catalog.OASISCatalogManager:true > GROBID parser fails in tika-app > --- > > Key: TIKA-1712 > URL: https://issues.apache.org/jira/browse/TIKA-1712 > Project: Tika > Issue Type: Bug > Components: cli, server >Reporter: Chris A. Mattmann >Assignee: Sergey Beryozkin > Fix For: 1.11 > > > Hey Sergey do you have any idea why CXF's 3.0.3 rt-client would work fine in > tika-server, but fail in tika-app? I'm seeing that with the GROBID parser. > See: > https://issues.apache.org/jira/browse/CXF-6545 > Try calling the GROBID parser from Tika app: > java -classpath > $HOME/git/grobidparser-resources/:target/tika-app-1.11-SNAPSHOT.jar > org.apache.tika.cli.TikaCLI > --config=$HOME/git/grobidparser-resources/tika-config.xml -J > $HOME/git/grobid/papers/ICSE06.pdf > After following this guide: > https://wiki.apache.org/tika/GrobidJournalParser > Works fine in Tika-Server - dies in Tika-app with: > {noformat} > java.lang.NullPointerException > at > org.apache.cxf.jaxrs.client.AbstractClient.setupOutInterceptorChain(AbstractClient.java:849) > at > org.apache.cxf.jaxrs.client.AbstractClient.createMessage(AbstractClient.java:923) > at > org.apache.cxf.jaxrs.client.WebClient.finalizeMessage(WebClient.java:1125) > at > org.apache.cxf.jaxrs.client.WebClient.doChainedInvocation(WebClient.java:1098) > at org.apache.cxf.jaxrs.client.WebClient.doInvoke(WebClient.java:894) > at org.apache.cxf.jaxrs.client.WebClient.doInvoke(WebClient.java:865) > at org.apache.cxf.jaxrs.client.WebClient.invoke(WebClient.java:331) > at org.apache.cxf.jaxrs.client.WebClient.post(WebClient.java:340) > at > org.apache.tika.parser.journal.GrobidRESTParser.parse(GrobidRESTParser.java:82) > at > org.apache.tika.parser.journal.JournalParser.parse(JournalParser.java:67) > at > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at > org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:158) > at org.apache.tika.cli.TikaCLI.handleRecursiveJson(TikaCLI.java:504) > at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:484) > at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:139) > java.lang.NullPointerException > at > org.apache.tika.parser.journal.GrobidRESTParser.parse(GrobidRESTParser.java:89) > at > org.apache.tika.parser.journal.JournalParser.parse(JournalParser.java:67) > at > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at > org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:158) > at org.apache.tika.cli.TikaCLI.handleRecursiveJson(TikaCLI.java:504) > at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:484) > at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:139) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1712) GROBID parser fails in tika-app
[ https://issues.apache.org/jira/browse/TIKA-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14699635#comment-14699635 ] Sergey Beryozkin commented on TIKA-1712: Hi Chris The problem is META-INF/cxf/bus-extensions.txt in tika-app.jar is incomplete, it only contains what is available inside cxf-rt-transports-http, but has no content available in cxf-core/bus-extensions.txt, after updating the file in tika-app manually I got the GROBID example working. The solution is to have the content of all META-INF/cxf/bus-extensions.txt files available in tika-app, in a single file. Not sure how this can be realized though Sergey > GROBID parser fails in tika-app > --- > > Key: TIKA-1712 > URL: https://issues.apache.org/jira/browse/TIKA-1712 > Project: Tika > Issue Type: Bug > Components: cli, server >Reporter: Chris A. Mattmann >Assignee: Sergey Beryozkin > Fix For: 1.11 > > > Hey Sergey do you have any idea why CXF's 3.0.3 rt-client would work fine in > tika-server, but fail in tika-app? I'm seeing that with the GROBID parser. > See: > https://issues.apache.org/jira/browse/CXF-6545 > Try calling the GROBID parser from Tika app: > java -classpath > $HOME/git/grobidparser-resources/:target/tika-app-1.11-SNAPSHOT.jar > org.apache.tika.cli.TikaCLI > --config=$HOME/git/grobidparser-resources/tika-config.xml -J > $HOME/git/grobid/papers/ICSE06.pdf > After following this guide: > https://wiki.apache.org/tika/GrobidJournalParser > Works fine in Tika-Server - dies in Tika-app with: > {noformat} > java.lang.NullPointerException > at > org.apache.cxf.jaxrs.client.AbstractClient.setupOutInterceptorChain(AbstractClient.java:849) > at > org.apache.cxf.jaxrs.client.AbstractClient.createMessage(AbstractClient.java:923) > at > org.apache.cxf.jaxrs.client.WebClient.finalizeMessage(WebClient.java:1125) > at > org.apache.cxf.jaxrs.client.WebClient.doChainedInvocation(WebClient.java:1098) > at org.apache.cxf.jaxrs.client.WebClient.doInvoke(WebClient.java:894) > at org.apache.cxf.jaxrs.client.WebClient.doInvoke(WebClient.java:865) > at org.apache.cxf.jaxrs.client.WebClient.invoke(WebClient.java:331) > at org.apache.cxf.jaxrs.client.WebClient.post(WebClient.java:340) > at > org.apache.tika.parser.journal.GrobidRESTParser.parse(GrobidRESTParser.java:82) > at > org.apache.tika.parser.journal.JournalParser.parse(JournalParser.java:67) > at > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at > org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:158) > at org.apache.tika.cli.TikaCLI.handleRecursiveJson(TikaCLI.java:504) > at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:484) > at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:139) > java.lang.NullPointerException > at > org.apache.tika.parser.journal.GrobidRESTParser.parse(GrobidRESTParser.java:89) > at > org.apache.tika.parser.journal.JournalParser.parse(JournalParser.java:67) > at > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at > org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:158) > at org.apache.tika.cli.TikaCLI.handleRecursiveJson(TikaCLI.java:504) > at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:484) > at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:139) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1712) GROBID parser fails in tika-app
[ https://issues.apache.org/jira/browse/TIKA-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14699377#comment-14699377 ] Sergey Beryozkin commented on TIKA-1712: Never mind, steps are described :-) > GROBID parser fails in tika-app > --- > > Key: TIKA-1712 > URL: https://issues.apache.org/jira/browse/TIKA-1712 > Project: Tika > Issue Type: Bug > Components: cli, server >Reporter: Chris A. Mattmann >Assignee: Sergey Beryozkin > Fix For: 1.11 > > > Hey Sergey do you have any idea why CXF's 3.0.3 rt-client would work fine in > tika-server, but fail in tika-app? I'm seeing that with the GROBID parser. > See: > https://issues.apache.org/jira/browse/CXF-6545 > Try calling the GROBID parser from Tika app: > java -classpath > $HOME/git/grobidparser-resources/:target/tika-app-1.11-SNAPSHOT.jar > org.apache.tika.cli.TikaCLI > --config=$HOME/git/grobidparser-resources/tika-config.xml -J > $HOME/git/grobid/papers/ICSE06.pdf > After following this guide: > https://wiki.apache.org/tika/GrobidJournalParser > Works fine in Tika-Server - dies in Tika-app with: > {noformat} > java.lang.NullPointerException > at > org.apache.cxf.jaxrs.client.AbstractClient.setupOutInterceptorChain(AbstractClient.java:849) > at > org.apache.cxf.jaxrs.client.AbstractClient.createMessage(AbstractClient.java:923) > at > org.apache.cxf.jaxrs.client.WebClient.finalizeMessage(WebClient.java:1125) > at > org.apache.cxf.jaxrs.client.WebClient.doChainedInvocation(WebClient.java:1098) > at org.apache.cxf.jaxrs.client.WebClient.doInvoke(WebClient.java:894) > at org.apache.cxf.jaxrs.client.WebClient.doInvoke(WebClient.java:865) > at org.apache.cxf.jaxrs.client.WebClient.invoke(WebClient.java:331) > at org.apache.cxf.jaxrs.client.WebClient.post(WebClient.java:340) > at > org.apache.tika.parser.journal.GrobidRESTParser.parse(GrobidRESTParser.java:82) > at > org.apache.tika.parser.journal.JournalParser.parse(JournalParser.java:67) > at > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at > org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:158) > at org.apache.tika.cli.TikaCLI.handleRecursiveJson(TikaCLI.java:504) > at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:484) > at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:139) > java.lang.NullPointerException > at > org.apache.tika.parser.journal.GrobidRESTParser.parse(GrobidRESTParser.java:89) > at > org.apache.tika.parser.journal.JournalParser.parse(JournalParser.java:67) > at > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at > org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:158) > at org.apache.tika.cli.TikaCLI.handleRecursiveJson(TikaCLI.java:504) > at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:484) > at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:139) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1712) GROBID parser fails in tika-app
[ https://issues.apache.org/jira/browse/TIKA-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14699362#comment-14699362 ] Sergey Beryozkin commented on TIKA-1712: Hi Chris, all, sorry I did miss this issue completely, I might've thought it was related to the actual parsing only. Can you please give me a favour and attach a sample GROBID resource plus whatever is needed in a grobidparser-resources folder to have the server starting and the issue reproduced ? Thanks > GROBID parser fails in tika-app > --- > > Key: TIKA-1712 > URL: https://issues.apache.org/jira/browse/TIKA-1712 > Project: Tika > Issue Type: Bug > Components: cli, server >Reporter: Chris A. Mattmann >Assignee: Sergey Beryozkin > Fix For: 1.11 > > > Hey Sergey do you have any idea why CXF's 3.0.3 rt-client would work fine in > tika-server, but fail in tika-app? I'm seeing that with the GROBID parser. > See: > https://issues.apache.org/jira/browse/CXF-6545 > Try calling the GROBID parser from Tika app: > java -classpath > $HOME/git/grobidparser-resources/:target/tika-app-1.11-SNAPSHOT.jar > org.apache.tika.cli.TikaCLI > --config=$HOME/git/grobidparser-resources/tika-config.xml -J > $HOME/git/grobid/papers/ICSE06.pdf > After following this guide: > https://wiki.apache.org/tika/GrobidJournalParser > Works fine in Tika-Server - dies in Tika-app with: > {noformat} > java.lang.NullPointerException > at > org.apache.cxf.jaxrs.client.AbstractClient.setupOutInterceptorChain(AbstractClient.java:849) > at > org.apache.cxf.jaxrs.client.AbstractClient.createMessage(AbstractClient.java:923) > at > org.apache.cxf.jaxrs.client.WebClient.finalizeMessage(WebClient.java:1125) > at > org.apache.cxf.jaxrs.client.WebClient.doChainedInvocation(WebClient.java:1098) > at org.apache.cxf.jaxrs.client.WebClient.doInvoke(WebClient.java:894) > at org.apache.cxf.jaxrs.client.WebClient.doInvoke(WebClient.java:865) > at org.apache.cxf.jaxrs.client.WebClient.invoke(WebClient.java:331) > at org.apache.cxf.jaxrs.client.WebClient.post(WebClient.java:340) > at > org.apache.tika.parser.journal.GrobidRESTParser.parse(GrobidRESTParser.java:82) > at > org.apache.tika.parser.journal.JournalParser.parse(JournalParser.java:67) > at > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at > org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:158) > at org.apache.tika.cli.TikaCLI.handleRecursiveJson(TikaCLI.java:504) > at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:484) > at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:139) > java.lang.NullPointerException > at > org.apache.tika.parser.journal.GrobidRESTParser.parse(GrobidRESTParser.java:89) > at > org.apache.tika.parser.journal.JournalParser.parse(JournalParser.java:67) > at > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at > org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:158) > at org.apache.tika.cli.TikaCLI.handleRecursiveJson(TikaCLI.java:504) > at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:484) > at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:139) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new HashMap data structure for persitsence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14504999#comment-14504999 ] Sergey Beryozkin commented on TIKA-1607: Hi, IMHO it indeed makes sense to keep the existing Metadata methods that return String values but also offer an optional support for representing Metadata as a multivalued map of arbitrary object key/values where the original String to String[] pairs are converted into something more sophisticated if required... By the way, JAX-RS API has this interface: http://docs.oracle.com/javaee/7/api/javax/ws/rs/core/MultivaluedMap.html Not suggesting to use natively in Tika, but it might be of interest... Cheers, Sergey > Introduce new HashMap data structure for persitsence of Tika > Metadata > - > > Key: TIKA-1607 > URL: https://issues.apache.org/jira/browse/TIKA-1607 > Project: Tika > Issue Type: Improvement > Components: core, metadata >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Critical > Fix For: 1.9 > > > I am currently working implementing more comprehensive extraction and > enhancement of the Tika support for Phone number extraction and metadata > modeling. > Right now we utilize the String[] multivalued support available within Tika > to persist phone numbers as > {code} > Metadata: String: String[] > Metadata: phonenumbers: number1, number2, number3, ... > {code} > I would like to propose we extend multi-valued support outside of the > String[] paradigm by implementing a more abstract Collection of Objects such > that we could consider and implement the phone number use case as follows > {code} > Metadata: String: Object > {code} > Where Object could be a Collection HashMap> e.g. > {code} > Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), > (LibPN-NumberType: International), (etc: etc)...), (+1292611054: > LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) > (etc)] > {code} > There are obvious backwards compatibility issues with this approach... > additionally it is a fundamental change to the code Metadata API. I hope that > the Mapping however is flexible enough to allow me to model > Tika Metadata the way I want. > Any comments folks? Thanks > Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-891) Use POST in addition to PUT on method calls in tika-server
[ https://issues.apache.org/jira/browse/TIKA-891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14362467#comment-14362467 ] Sergey Beryozkin commented on TIKA-891: --- What I do feel rather strongly about is that the existing users should not start seeing the client code expecting PUT supported breaking. The duplication is usually of some concern to developers initially but always proves to be negligible in terms of what actually (how much) is duplicated. Some people can do a ContainerRequestFilter that will adapt PUT requests to POST and save a bit on typing duplicate method definitions though at a minor extra cost of having to maintain yet another provider. > Use POST in addition to PUT on method calls in tika-server > -- > > Key: TIKA-891 > URL: https://issues.apache.org/jira/browse/TIKA-891 > Project: Tika > Issue Type: Improvement > Components: general >Reporter: Chris A. Mattmann >Assignee: Chris A. Mattmann >Priority: Trivial > Labels: newbie > Fix For: 1.9 > > > Per Jukka's email: > http://s.apache.org/uR > It would be a better use of REST/HTTP "verbs" to use POST to put content to a > resource where we don't intend to store that content (which is the > implication of PUT). Max suggested adding: > {code} > @POST > {code} > annotations to the methods we are currently exposing using PUT to take care > of this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-891) Use POST in addition to PUT on method calls in tika-server
[ https://issues.apache.org/jira/browse/TIKA-891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14362039#comment-14362039 ] Sergey Beryozkin commented on TIKA-891: --- It is not the case, in JAX-RS, if you need to have multiple methods supported on the same path then one just has multiple methods, annotated with PUT, POST, etc, delegating to a common implementation. > Use POST in addition to PUT on method calls in tika-server > -- > > Key: TIKA-891 > URL: https://issues.apache.org/jira/browse/TIKA-891 > Project: Tika > Issue Type: Improvement > Components: general >Reporter: Chris A. Mattmann >Assignee: Chris A. Mattmann >Priority: Trivial > Labels: newbie > Fix For: 1.9 > > > Per Jukka's email: > http://s.apache.org/uR > It would be a better use of REST/HTTP "verbs" to use POST to put content to a > resource where we don't intend to store that content (which is the > implication of PUT). Max suggested adding: > {code} > @POST > {code} > annotations to the methods we are currently exposing using PUT to take care > of this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-891) Use POST in addition to PUT on method calls in tika-server
[ https://issues.apache.org/jira/browse/TIKA-891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14344857#comment-14344857 ] Sergey Beryozkin commented on TIKA-891: --- Well, I guess we have to be careful with replacing all PUTs with POSTs. As far as I recall, one of curl commands documented at the Tika JAXRS page uses PUT implicitly. Supporting both methods is a more conservative approach in the short term at least. > Use POST in addition to PUT on method calls in tika-server > -- > > Key: TIKA-891 > URL: https://issues.apache.org/jira/browse/TIKA-891 > Project: Tika > Issue Type: Improvement > Components: general >Reporter: Chris A. Mattmann >Assignee: Chris A. Mattmann >Priority: Trivial > Labels: newbie > Fix For: 1.9 > > > Per Jukka's email: > http://s.apache.org/uR > It would be a better use of REST/HTTP "verbs" to use POST to put content to a > resource where we don't intend to store that content (which is the > implication of PUT). Max suggested adding: > {code} > @POST > {code} > annotations to the methods we are currently exposing using PUT to take care > of this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-891) Use POST in addition to PUT on method calls in tika-server
[ https://issues.apache.org/jira/browse/TIKA-891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14343816#comment-14343816 ] Sergey Beryozkin edited comment on TIKA-891 at 3/2/15 9:44 PM: --- If I said PUT was the same as POST then I'd probably be seriously criticized at the very least :-). My point was, that neither verb is probably that semantically close for the type of the action Tika Server supports. Typically I prefer just to get things done though risking sometimes writing a possibly not very pure REST code :-). However you are right forms do not support PUT hence I guess I'd be better to have POST across the board for the sake of consistency. And if you agree - then keeping PUT for 1.8 as deprecated and removing in 1.9 Sergey was (Author: sergey_beryozkin): If I said PUT was the same as POST I'd be probably seriously criticized at the very least :-). My point was, that neither verb is probably not that semantically close for the type of the action Tika Server supports. Typically I prefer just to get things done though risking sometimes writing a possibly not very pure REST code :-). However you are rights forms do not support PUT hence I guess I'd be better to have POST across the board for the sake of consistency. And if you agree - then keeping PUT for 1.8 as deprecated and removing in 1.9 Sergey > Use POST in addition to PUT on method calls in tika-server > -- > > Key: TIKA-891 > URL: https://issues.apache.org/jira/browse/TIKA-891 > Project: Tika > Issue Type: Improvement > Components: general >Reporter: Chris A. Mattmann >Assignee: Chris A. Mattmann >Priority: Trivial > Labels: newbie > Fix For: 1.8 > > > Per Jukka's email: > http://s.apache.org/uR > It would be a better use of REST/HTTP "verbs" to use POST to put content to a > resource where we don't intend to store that content (which is the > implication of PUT). Max suggested adding: > {code} > @POST > {code} > annotations to the methods we are currently exposing using PUT to take care > of this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-891) Use POST in addition to PUT on method calls in tika-server
[ https://issues.apache.org/jira/browse/TIKA-891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14343816#comment-14343816 ] Sergey Beryozkin commented on TIKA-891: --- If I said PUT was the same as POST I'd be probably seriously criticized at the very least :-). My point was, that neither verb is probably not that semantically close for the type of the action Tika Server supports. Typically I prefer just to get things done though risking sometimes writing a possibly not very pure REST code :-). However you are rights forms do not support PUT hence I guess I'd be better to have POST across the board for the sake of consistency. And if you agree - then keeping PUT for 1.8 as deprecated and removing in 1.9 Sergey > Use POST in addition to PUT on method calls in tika-server > -- > > Key: TIKA-891 > URL: https://issues.apache.org/jira/browse/TIKA-891 > Project: Tika > Issue Type: Improvement > Components: general >Reporter: Chris A. Mattmann >Assignee: Chris A. Mattmann >Priority: Trivial > Labels: newbie > Fix For: 1.8 > > > Per Jukka's email: > http://s.apache.org/uR > It would be a better use of REST/HTTP "verbs" to use POST to put content to a > resource where we don't intend to store that content (which is the > implication of PUT). Max suggested adding: > {code} > @POST > {code} > annotations to the methods we are currently exposing using PUT to take care > of this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-891) Use POST in addition to PUT on method calls in tika-server
[ https://issues.apache.org/jira/browse/TIKA-891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14343373#comment-14343373 ] Sergey Beryozkin commented on TIKA-891: --- Just to clarify: I've no objections to migrating to POST from PUT, I guess neither fits perfectly, perhaps POST is marginally 'cleaner', but ultimately is is not that important IMHO :-). The question is how to support the existing users. Keeping PUT for 1.8 only might make sense. I guess it depends on whether 1.8 is considered a major version or not Thanks > Use POST in addition to PUT on method calls in tika-server > -- > > Key: TIKA-891 > URL: https://issues.apache.org/jira/browse/TIKA-891 > Project: Tika > Issue Type: Improvement > Components: general >Reporter: Chris A. Mattmann >Assignee: Chris A. Mattmann >Priority: Trivial > Labels: newbie > Fix For: 1.8 > > > Per Jukka's email: > http://s.apache.org/uR > It would be a better use of REST/HTTP "verbs" to use POST to put content to a > resource where we don't intend to store that content (which is the > implication of PUT). Max suggested adding: > {code} > @POST > {code} > annotations to the methods we are currently exposing using PUT to take care > of this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-891) Use POST in addition to PUT on method calls in tika-server
[ https://issues.apache.org/jira/browse/TIKA-891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14343330#comment-14343330 ] Sergey Beryozkin commented on TIKA-891: --- Why are both PUT and POST out ? GET does not fit, it offers no input content. > Use POST in addition to PUT on method calls in tika-server > -- > > Key: TIKA-891 > URL: https://issues.apache.org/jira/browse/TIKA-891 > Project: Tika > Issue Type: Improvement > Components: general >Reporter: Chris A. Mattmann >Assignee: Chris A. Mattmann >Priority: Trivial > Labels: newbie > Fix For: 1.8 > > > Per Jukka's email: > http://s.apache.org/uR > It would be a better use of REST/HTTP "verbs" to use POST to put content to a > resource where we don't intend to store that content (which is the > implication of PUT). Max suggested adding: > {code} > @POST > {code} > annotations to the methods we are currently exposing using PUT to take care > of this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-891) Use POST in addition to PUT on method calls in tika-server
[ https://issues.apache.org/jira/browse/TIKA-891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14342977#comment-14342977 ] Sergey Beryozkin commented on TIKA-891: --- IMHO it might make sense to keep PUT as deprecated for the next release so that the existing users can still get it working and provide a documentation suggesting that PUT will be removed in 1.9. Or at the very least offer such a migration guide for 1.8 As a side note, I don't know PUT was put in the first place, but speaking of the semantics, I'm not 100% sure POST (effectively adding a resource to the collection) is the closest match to what Tika JAX-RS server does, by accepting a resource and echoing its metadata/data back. Cheers, Sergey > Use POST in addition to PUT on method calls in tika-server > -- > > Key: TIKA-891 > URL: https://issues.apache.org/jira/browse/TIKA-891 > Project: Tika > Issue Type: Improvement > Components: general >Reporter: Chris A. Mattmann >Assignee: Chris A. Mattmann >Priority: Trivial > Labels: newbie > Fix For: 1.8 > > > Per Jukka's email: > http://s.apache.org/uR > It would be a better use of REST/HTTP "verbs" to use POST to put content to a > resource where we don't intend to store that content (which is the > implication of PUT). Max suggested adding: > {code} > @POST > {code} > annotations to the methods we are currently exposing using PUT to take care > of this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1497) tika-server cannot output JSON
[ https://issues.apache.org/jira/browse/TIKA-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14253426#comment-14253426 ] Sergey Beryozkin commented on TIKA-1497: Hi Tim, we are all JAX-RS experts here, JAX-RS is the easiest part :-) +1 to merging MetadataEp into MetadataResource Thanks, Sergey > tika-server cannot output JSON > -- > > Key: TIKA-1497 > URL: https://issues.apache.org/jira/browse/TIKA-1497 > Project: Tika > Issue Type: Improvement > Components: server >Reporter: Peter Bowyer > Attachments: TIKA-1497.patch, TIKA-1497v2.patch > > > I would like the response from > curl -X PUT -T /path/to/file.pdf http://localhost:9998/meta > to be JSON and not CSV?. > I've discovered JSONMessageBodyWriter.java > (https://github.com/apache/tika/blob/af19f3ea04792cad81b428f1df9f5ebbb2501913/tika-server/src/main/java/org/apache/tika/server/JSONMessageBodyWriter.java) > so I think the functionality is present, tried adding --header "Accept: > application/json" to the cURL call, in line with the documentation for > outputting CSV, but no luck so far. > According to [~sergey_beryozkin] > "I see MetadataResource returning StreamingOutput and it has > @Produces(text/csv) only. As such this MBW has no effect at the moment. > We can update MetadataResource to return Metadata directly if > application/json is requested or update MetadataResource to directly convert > Metadata to JSON in case of JSON being accepted." -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1497) tika-server cannot output JSON
[ https://issues.apache.org/jira/browse/TIKA-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14253240#comment-14253240 ] Sergey Beryozkin commented on TIKA-1497: Hi Tim, Thanks for a fix, IMHO the latest metadata resource code authored by you is absolutely perfect :-) > tika-server cannot output JSON > -- > > Key: TIKA-1497 > URL: https://issues.apache.org/jira/browse/TIKA-1497 > Project: Tika > Issue Type: Improvement > Components: server >Reporter: Peter Bowyer > Attachments: TIKA-1497.patch, TIKA-1497v2.patch > > > I would like the response from > curl -X PUT -T /path/to/file.pdf http://localhost:9998/meta > to be JSON and not CSV?. > I've discovered JSONMessageBodyWriter.java > (https://github.com/apache/tika/blob/af19f3ea04792cad81b428f1df9f5ebbb2501913/tika-server/src/main/java/org/apache/tika/server/JSONMessageBodyWriter.java) > so I think the functionality is present, tried adding --header "Accept: > application/json" to the cURL call, in line with the documentation for > outputting CSV, but no luck so far. > According to [~sergey_beryozkin] > "I see MetadataResource returning StreamingOutput and it has > @Produces(text/csv) only. As such this MBW has no effect at the moment. > We can update MetadataResource to return Metadata directly if > application/json is requested or update MetadataResource to directly convert > Metadata to JSON in case of JSON being accepted." -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1497) tika-server cannot output JSON
[ https://issues.apache.org/jira/browse/TIKA-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14252239#comment-14252239 ] Sergey Beryozkin commented on TIKA-1497: Hi Tim, sorry for a delay, if you are also fine with it then yes, it would be a bit more flexible (we won't need to update method signatures in the furture if the custom headers like etags, etc will need to be added). MetadataEP, should it be merged into MetadataResource as Chris suggested or you;d prefer to wait till 2.0 ? Cheers, Sergey > tika-server cannot output JSON > -- > > Key: TIKA-1497 > URL: https://issues.apache.org/jira/browse/TIKA-1497 > Project: Tika > Issue Type: Improvement > Components: server >Reporter: Peter Bowyer > Attachments: TIKA-1497.patch, TIKA-1497v2.patch > > > I would like the response from > curl -X PUT -T /path/to/file.pdf http://localhost:9998/meta > to be JSON and not CSV?. > I've discovered JSONMessageBodyWriter.java > (https://github.com/apache/tika/blob/af19f3ea04792cad81b428f1df9f5ebbb2501913/tika-server/src/main/java/org/apache/tika/server/JSONMessageBodyWriter.java) > so I think the functionality is present, tried adding --header "Accept: > application/json" to the cURL call, in line with the documentation for > outputting CSV, but no luck so far. > According to [~sergey_beryozkin] > "I see MetadataResource returning StreamingOutput and it has > @Produces(text/csv) only. As such this MBW has no effect at the moment. > We can update MetadataResource to return Metadata directly if > application/json is requested or update MetadataResource to directly convert > Metadata to JSON in case of JSON being accepted." -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1497) tika-server cannot output JSON
[ https://issues.apache.org/jira/browse/TIKA-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14251952#comment-14251952 ] Sergey Beryozkin commented on TIKA-1497: Hi, yes, looks nice, though returning Response is marginally more flexible in cases where extra custom headers need to be set (as opposed to doing it from ContainerResponseFilter), it has no difference for the runtime whether Metadata is returned as Response entity or directly; the only minor issue and this a CXF specific only is that it does not play well with client proxy API, which is not used with Tika tests... Cheers, Sergey > tika-server cannot output JSON > -- > > Key: TIKA-1497 > URL: https://issues.apache.org/jira/browse/TIKA-1497 > Project: Tika > Issue Type: Improvement > Components: server >Reporter: Peter Bowyer > Attachments: TIKA-1497.patch, TIKA-1497v2.patch > > > I would like the response from > curl -X PUT -T /path/to/file.pdf http://localhost:9998/meta > to be JSON and not CSV?. > I've discovered JSONMessageBodyWriter.java > (https://github.com/apache/tika/blob/af19f3ea04792cad81b428f1df9f5ebbb2501913/tika-server/src/main/java/org/apache/tika/server/JSONMessageBodyWriter.java) > so I think the functionality is present, tried adding --header "Accept: > application/json" to the cURL call, in line with the documentation for > outputting CSV, but no luck so far. > According to [~sergey_beryozkin] > "I see MetadataResource returning StreamingOutput and it has > @Produces(text/csv) only. As such this MBW has no effect at the moment. > We can update MetadataResource to return Metadata directly if > application/json is requested or update MetadataResource to directly convert > Metadata to JSON in case of JSON being accepted." -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1497) tika-server cannot output JSON
[ https://issues.apache.org/jira/browse/TIKA-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14251920#comment-14251920 ] Sergey Beryozkin edited comment on TIKA-1497 at 12/18/14 5:27 PM: -- Hi Tim We can probably have {code:java} @Produces({"text/csv", "application/json"}) {code} for both PUT methods and have JSONMessageBodyWriter also have the same Produces and renamed to MetadataMessageBodyWriter and in its writeTo check if the passed media type is text/csv or json and convert accordingly. Or a dedicated CSV provider can be added as you suggest. Indeed, returning Metadata directly (wrapped in Response is Ok too) works well with custom MBWs thanks, Sergey was (Author: sergey_beryozkin): Hi Tim We can probably have {code:java} @Produces({"text/csv", "application/json"}) {code} for both PUT methods and have JSONMessageBodyWriter also have the same Produces and renamed to MetadataMessageBodyWriter and in its writre to check if the passed media type is text/csv or json and convert accordingly. Or a dedicated CSV provider can be added as you suggest. Indeed, returning Metadata directly (wrapped in Response is Ok too) works well with custom MBWs thanks, Sergey > tika-server cannot output JSON > -- > > Key: TIKA-1497 > URL: https://issues.apache.org/jira/browse/TIKA-1497 > Project: Tika > Issue Type: Improvement > Components: server >Reporter: Peter Bowyer > Attachments: TIKA-1497.patch > > > I would like the response from > curl -X PUT -T /path/to/file.pdf http://localhost:9998/meta > to be JSON and not CSV?. > I've discovered JSONMessageBodyWriter.java > (https://github.com/apache/tika/blob/af19f3ea04792cad81b428f1df9f5ebbb2501913/tika-server/src/main/java/org/apache/tika/server/JSONMessageBodyWriter.java) > so I think the functionality is present, tried adding --header "Accept: > application/json" to the cURL call, in line with the documentation for > outputting CSV, but no luck so far. > According to [~sergey_beryozkin] > "I see MetadataResource returning StreamingOutput and it has > @Produces(text/csv) only. As such this MBW has no effect at the moment. > We can update MetadataResource to return Metadata directly if > application/json is requested or update MetadataResource to directly convert > Metadata to JSON in case of JSON being accepted." -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1497) tika-server cannot output JSON
[ https://issues.apache.org/jira/browse/TIKA-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14251920#comment-14251920 ] Sergey Beryozkin commented on TIKA-1497: Hi Tim We can probably have {code:java} @Produces({"text/csv", "application/json"}) {code:java} for both PUT methods and have JSONMessageBodyWriter also have the same Produces and renamed to MetadataMessageBodyWriter and in its writre to check if the passed media type is text/csv or json and convert accordingly. Or a dedicated CSV provider can be added as you suggest. Indeed, returning Metadata directly (wrapped in Response is Ok too) works well with custom MBWs thanks, Sergey > tika-server cannot output JSON > -- > > Key: TIKA-1497 > URL: https://issues.apache.org/jira/browse/TIKA-1497 > Project: Tika > Issue Type: Improvement > Components: server >Reporter: Peter Bowyer > Attachments: TIKA-1497.patch > > > I would like the response from > curl -X PUT -T /path/to/file.pdf http://localhost:9998/meta > to be JSON and not CSV?. > I've discovered JSONMessageBodyWriter.java > (https://github.com/apache/tika/blob/af19f3ea04792cad81b428f1df9f5ebbb2501913/tika-server/src/main/java/org/apache/tika/server/JSONMessageBodyWriter.java) > so I think the functionality is present, tried adding --header "Accept: > application/json" to the cURL call, in line with the documentation for > outputting CSV, but no luck so far. > According to [~sergey_beryozkin] > "I see MetadataResource returning StreamingOutput and it has > @Produces(text/csv) only. As such this MBW has no effect at the moment. > We can update MetadataResource to return Metadata directly if > application/json is requested or update MetadataResource to directly convert > Metadata to JSON in case of JSON being accepted." -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1497) tika-server cannot output JSON
[ https://issues.apache.org/jira/browse/TIKA-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14251920#comment-14251920 ] Sergey Beryozkin edited comment on TIKA-1497 at 12/18/14 5:27 PM: -- Hi Tim We can probably have {code:java} @Produces({"text/csv", "application/json"}) {code} for both PUT methods and have JSONMessageBodyWriter also have the same Produces and renamed to MetadataMessageBodyWriter and in its writre to check if the passed media type is text/csv or json and convert accordingly. Or a dedicated CSV provider can be added as you suggest. Indeed, returning Metadata directly (wrapped in Response is Ok too) works well with custom MBWs thanks, Sergey was (Author: sergey_beryozkin): Hi Tim We can probably have {code:java} @Produces({"text/csv", "application/json"}) {code:java} for both PUT methods and have JSONMessageBodyWriter also have the same Produces and renamed to MetadataMessageBodyWriter and in its writre to check if the passed media type is text/csv or json and convert accordingly. Or a dedicated CSV provider can be added as you suggest. Indeed, returning Metadata directly (wrapped in Response is Ok too) works well with custom MBWs thanks, Sergey > tika-server cannot output JSON > -- > > Key: TIKA-1497 > URL: https://issues.apache.org/jira/browse/TIKA-1497 > Project: Tika > Issue Type: Improvement > Components: server >Reporter: Peter Bowyer > Attachments: TIKA-1497.patch > > > I would like the response from > curl -X PUT -T /path/to/file.pdf http://localhost:9998/meta > to be JSON and not CSV?. > I've discovered JSONMessageBodyWriter.java > (https://github.com/apache/tika/blob/af19f3ea04792cad81b428f1df9f5ebbb2501913/tika-server/src/main/java/org/apache/tika/server/JSONMessageBodyWriter.java) > so I think the functionality is present, tried adding --header "Accept: > application/json" to the cURL call, in line with the documentation for > outputting CSV, but no luck so far. > According to [~sergey_beryozkin] > "I see MetadataResource returning StreamingOutput and it has > @Produces(text/csv) only. As such this MBW has no effect at the moment. > We can update MetadataResource to return Metadata directly if > application/json is requested or update MetadataResource to directly convert > Metadata to JSON in case of JSON being accepted." -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1494) JAXRS server: allow passing PDF password in the request
[ https://issues.apache.org/jira/browse/TIKA-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14246552#comment-14246552 ] Sergey Beryozkin commented on TIKA-1494: Hi, looks like it is not recommended any longer: https://tools.ietf.org/html/rfc7231#section-8.3.1 Cheers, Sergey > JAXRS server: allow passing PDF password in the request > --- > > Key: TIKA-1494 > URL: https://issues.apache.org/jira/browse/TIKA-1494 > Project: Tika > Issue Type: New Feature > Components: server >Reporter: Peter Bowyer > Labels: encryption, pdf, server > Fix For: 1.7 > > > I have to extract content from encrypted PDFs. Setting the PDF password using > the TIKA_PASSWORD environment variable works, however it only allows for one > PDF password. > It would be very useful to be able to pass the password in during the HTTP > request - as a header or by some other means. This way I can run a central > JAXRS server and extract content from all PDFs, rather than a separate server > for each department/group that has its own password. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1494) JAXRS server: allow passing PDF password in the request
[ https://issues.apache.org/jira/browse/TIKA-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14244076#comment-14244076 ] Sergey Beryozkin commented on TIKA-1494: Hi Nick, All It should work indeed. By the way, should it depend on the resolution of https://issues.apache.org/jira/browse/TIKA-894 ? May be at least related to... We can definitely support passing passwords and reporting the decrypted content over plain HTTP as POC but some users would likely prefer having at least a 1-way TLS activated which is where running a Tika in a servlet container would help. Sergey > JAXRS server: allow passing PDF password in the request > --- > > Key: TIKA-1494 > URL: https://issues.apache.org/jira/browse/TIKA-1494 > Project: Tika > Issue Type: New Feature > Components: server >Reporter: Peter Bowyer > Labels: encryption, pdf, server > > I have to extract content from encrypted PDFs. Setting the PDF password using > the TIKA_PASSWORD environment variable works, however it only allows for one > PDF password. > It would be very useful to be able to pass the password in during the HTTP > request - as a header or by some other means. This way I can run a central > JAXRS server and extract content from all PDFs, rather than a separate server > for each department/group that has its own password. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-894) Add webapp mode for Tika Server, simplifies deployment
[ https://issues.apache.org/jira/browse/TIKA-894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14244068#comment-14244068 ] Sergey Beryozkin commented on TIKA-894: --- Hi Lewis, are you still interested, may be you can find some time before Christmas :-) ? I tried to do a quick fix but got confused with what should be included as far as various Tika dependencies/parsers are concerned... Cheers, Sergey > Add webapp mode for Tika Server, simplifies deployment > -- > > Key: TIKA-894 > URL: https://issues.apache.org/jira/browse/TIKA-894 > Project: Tika > Issue Type: Improvement > Components: packaging >Affects Versions: 1.1, 1.2 >Reporter: Chris Wilson > Labels: maven, newbie, patch > Fix For: 1.7 > > Attachments: tika-server-webapp.patch > > > For use in production services, Tika Server should really be deployed as a > WAR file, under a reliable servlet container that knows how to run as a > system service, for example Tomcat or JBoss. > This is especially important on Windows, where I wasted an entire day trying > to make TikaServerCli run as some kind of a service. > Maven makes building a webapp pretty trivial. With the attached patch > applied, "mvn war:war" should work. It seems to run fine in Tomcat, which > makes Windows deployment much simpler. Just install Tomcat and drop the WAR > file into tomcat's webapps directory and you're away. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-894) Add webapp mode for Tika Server, simplifies deployment
[ https://issues.apache.org/jira/browse/TIKA-894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Beryozkin updated TIKA-894: -- Fix Version/s: 1.7 > Add webapp mode for Tika Server, simplifies deployment > -- > > Key: TIKA-894 > URL: https://issues.apache.org/jira/browse/TIKA-894 > Project: Tika > Issue Type: Improvement > Components: packaging >Affects Versions: 1.1, 1.2 >Reporter: Chris Wilson > Labels: maven, newbie, patch > Fix For: 1.7 > > Attachments: tika-server-webapp.patch > > > For use in production services, Tika Server should really be deployed as a > WAR file, under a reliable servlet container that knows how to run as a > system service, for example Tomcat or JBoss. > This is especially important on Windows, where I wasted an entire day trying > to make TikaServerCli run as some kind of a service. > Maven makes building a webapp pretty trivial. With the attached patch > applied, "mvn war:war" should work. It seems to run fine in Tomcat, which > makes Windows deployment much simpler. Just install Tomcat and drop the WAR > file into tomcat's webapps directory and you're away. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1481) TikaJAXRS get metadata calls give different results
[ https://issues.apache.org/jira/browse/TIKA-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14218505#comment-14218505 ] Sergey Beryozkin edited comment on TIKA-1481 at 11/19/14 9:11 PM: -- Hi Darya It is something to do with the curl options. -T is effectively a form payload AFAIK, possibly a multipart/form-data one. The 2nd option is a direct body payload. Please use a tcp trace and see what is different. By the way - it would be more beneficial for the community at large if you could ask the questions at the users list - the questions raised at JIRAs have a very low visibility, unless they do identify genuine issue Thanks, Sergey was (Author: sergey_beryozkin): Hi Darya It is something to do with the curl options. -T is effectively a form payload AFAIK, possibly a multipart/form-data one. The 2nd option is a direct body payload. Please use a tcp trace and see whta is different. By the way - it would be more beneficial for the community at large if you could ask the questions at the users list - the questions raised at JIRAs have a very low visibility, unless they do identify genuine issue Thanks, Sergey > TikaJAXRS get metadata calls give different results > --- > > Key: TIKA-1481 > URL: https://issues.apache.org/jira/browse/TIKA-1481 > Project: Tika > Issue Type: Bug > Components: server >Affects Versions: 1.6 > Environment: Windows 8, JDK 1.8 >Reporter: Darya Arbuzova >Priority: Minor > Attachments: sample.csv > > > Hello! > I'm trying to use Tika in server mode. > I downloaded tika-server-1.6.jar from http://mirror.vorboss.net/apache/tika/. > I have tried to get file metadata in 2 different ways (as explained here: > http://wiki.apache.org/tika/TikaJAXRS ): > {{> curl -T sample.csv http://localhost:9998/meta --header "Content-Type: > text/csv"}} > {{"Content-Encoding","windows-1252"}} > {{"Content-Type","text/plain; charset=windows-1252"}} > and > {{> curl -X PUT -d @sample.csv http://localhost:9998/meta --header > "Content-Type: text/csv"}} > {{"Content-Encoding","ISO-8859-1"}} > {{"Content-Type","text/plain; charset=ISO-8859-1"}} > How come they give different results in encoding if I call the same > {{http://localhost:9998/meta}}? > What could the other differences appear and which is the preferable way to > get metadata? > Many thanks! > Best regards, > Darya Arbuzova -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1481) TikaJAXRS get metadata calls give different results
[ https://issues.apache.org/jira/browse/TIKA-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14218505#comment-14218505 ] Sergey Beryozkin edited comment on TIKA-1481 at 11/19/14 9:12 PM: -- Hi Darya It is something to do with the curl options. -T is effectively a form payload AFAIK, possibly a multipart/form-data one. The 2nd option is a direct body payload. Please use a tcp trace and see what is different. By the way - it would be more beneficial for the community at large if you could ask the questions at the users list - the questions raised at JIRAs have a very low visibility, unless they do identify genuine issues Thanks, Sergey was (Author: sergey_beryozkin): Hi Darya It is something to do with the curl options. -T is effectively a form payload AFAIK, possibly a multipart/form-data one. The 2nd option is a direct body payload. Please use a tcp trace and see what is different. By the way - it would be more beneficial for the community at large if you could ask the questions at the users list - the questions raised at JIRAs have a very low visibility, unless they do identify genuine issue Thanks, Sergey > TikaJAXRS get metadata calls give different results > --- > > Key: TIKA-1481 > URL: https://issues.apache.org/jira/browse/TIKA-1481 > Project: Tika > Issue Type: Bug > Components: server >Affects Versions: 1.6 > Environment: Windows 8, JDK 1.8 >Reporter: Darya Arbuzova >Priority: Minor > Attachments: sample.csv > > > Hello! > I'm trying to use Tika in server mode. > I downloaded tika-server-1.6.jar from http://mirror.vorboss.net/apache/tika/. > I have tried to get file metadata in 2 different ways (as explained here: > http://wiki.apache.org/tika/TikaJAXRS ): > {{> curl -T sample.csv http://localhost:9998/meta --header "Content-Type: > text/csv"}} > {{"Content-Encoding","windows-1252"}} > {{"Content-Type","text/plain; charset=windows-1252"}} > and > {{> curl -X PUT -d @sample.csv http://localhost:9998/meta --header > "Content-Type: text/csv"}} > {{"Content-Encoding","ISO-8859-1"}} > {{"Content-Type","text/plain; charset=ISO-8859-1"}} > How come they give different results in encoding if I call the same > {{http://localhost:9998/meta}}? > What could the other differences appear and which is the preferable way to > get metadata? > Many thanks! > Best regards, > Darya Arbuzova -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1481) TikaJAXRS get metadata calls give different results
[ https://issues.apache.org/jira/browse/TIKA-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14218505#comment-14218505 ] Sergey Beryozkin commented on TIKA-1481: Hi Darya It is something to do with the curl options. -T is effectively a form payload AFAIK, possibly a multipart/form-data one. The 2nd option is a direct body payload. Please use a tcp trace and see whta is different. By the way - it would be more beneficial for the community at large if you could ask the questions at the users list - the questions raised at JIRAs have a very low visibility, unless they do identify genuine issue Thanks, Sergey > TikaJAXRS get metadata calls give different results > --- > > Key: TIKA-1481 > URL: https://issues.apache.org/jira/browse/TIKA-1481 > Project: Tika > Issue Type: Bug > Components: server >Affects Versions: 1.6 > Environment: Windows 8, JDK 1.8 >Reporter: Darya Arbuzova >Priority: Minor > Attachments: sample.csv > > > Hello! > I'm trying to use Tika in server mode. > I downloaded tika-server-1.6.jar from http://mirror.vorboss.net/apache/tika/. > I have tried to get file metadata in 2 different ways (as explained here: > http://wiki.apache.org/tika/TikaJAXRS ): > {{> curl -T sample.csv http://localhost:9998/meta --header "Content-Type: > text/csv"}} > {{"Content-Encoding","windows-1252"}} > {{"Content-Type","text/plain; charset=windows-1252"}} > and > {{> curl -X PUT -d @sample.csv http://localhost:9998/meta --header > "Content-Type: text/csv"}} > {{"Content-Encoding","ISO-8859-1"}} > {{"Content-Type","text/plain; charset=ISO-8859-1"}} > How come they give different results in encoding if I call the same > {{http://localhost:9998/meta}}? > What could the other differences appear and which is the preferable way to > get metadata? > Many thanks! > Best regards, > Darya Arbuzova -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1242) Update CXF version to 3.0.2
[ https://issues.apache.org/jira/browse/TIKA-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Beryozkin resolved TIKA-1242. Resolution: Fixed > Update CXF version to 3.0.2 > --- > > Key: TIKA-1242 > URL: https://issues.apache.org/jira/browse/TIKA-1242 > Project: Tika > Issue Type: Task > Components: server >Reporter: Sergey Beryozkin >Assignee: Sergey Beryozkin >Priority: Minor > Fix For: 1.7 > > > CXF 3.0.2 JAX-RS front-end offers a complete JAX-RS 2.0 support, has fewer > dependencies and is smaller compared to CXF 2.7.x one. It is also > backward-compatible with the applications written against JAX-RS 1.1. > Lets do this upgrade after Tika 1.6 is out. > CXF 3.1.0 is Java 7 based and is still under development. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1242) Update CXF version to 3.0.2
[ https://issues.apache.org/jira/browse/TIKA-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Beryozkin updated TIKA-1242: --- Description: CXF 3.0.2 JAX-RS front-end offers a complete JAX-RS 2.0 support, has fewer dependencies and is smaller compared to CXF 2.7.x one. It is also backward-compatible with the applications written against JAX-RS 1.1. Lets do this upgrade after Tika 1.6 is out. CXF 3.1.0 is Java 7 based and is still under development. was: CXF 3.1.0 JAX-RS front-end offers a complete JAX-RS 2.0 support, has fewer dependencies and is smaller compared to CXF 2.7.x one. It is also backward-compatible with the applications written against JAX-RS 1.1. Lets do this upgrade after Tika 1.6 is out Summary: Update CXF version to 3.0.2 (was: Update CXF version to 3.1.0) > Update CXF version to 3.0.2 > --- > > Key: TIKA-1242 > URL: https://issues.apache.org/jira/browse/TIKA-1242 > Project: Tika > Issue Type: Task > Components: server >Reporter: Sergey Beryozkin >Assignee: Sergey Beryozkin >Priority: Minor > Fix For: 1.7 > > > CXF 3.0.2 JAX-RS front-end offers a complete JAX-RS 2.0 support, has fewer > dependencies and is smaller compared to CXF 2.7.x one. It is also > backward-compatible with the applications written against JAX-RS 1.1. > Lets do this upgrade after Tika 1.6 is out. > CXF 3.1.0 is Java 7 based and is still under development. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1371) passing parameters via URL no longer works (regression)
[ https://issues.apache.org/jira/browse/TIKA-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14088100#comment-14088100 ] Sergey Beryozkin commented on TIKA-1371: I've no idea to be honest, it is there: http://svn.apache.org/viewvc/tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaServerCli.java?r1=1616118&r2=1616117&pathrev=1616118 Cheers, Sergey > passing parameters via URL no longer works (regression) > --- > > Key: TIKA-1371 > URL: https://issues.apache.org/jira/browse/TIKA-1371 > Project: Tika > Issue Type: Bug > Components: server >Affects Versions: 1.5 >Reporter: Rob Tulloh > > In Tika 1.1 and 1.2, it was possible to add some values to the URL that get > logged like this: > http://localhost:9998/tika/GUID/FILENAME > This was very useful for correlating between client and server in a > distributed compute environment. In 1.5 and in the nighty builds (for 1.6), > this feature no longer works. Not having this makes it very difficult to > troubleshoot problems with document processing in a distributed environment. > Please add back this feature so that operations and development teams can > more easily figure out which tika instance is processing which document and > what the result of the processing resulted in. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1371) passing parameters via URL no longer works (regression)
[ https://issues.apache.org/jira/browse/TIKA-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14087404#comment-14087404 ] Sergey Beryozkin commented on TIKA-1371: I've introduced a TikaLoggingFilter, it logs a request URI at info or debug level. The command line option is "-l or log" with either 'debug' or 'info' levels. Hope it will resolve the issue > passing parameters via URL no longer works (regression) > --- > > Key: TIKA-1371 > URL: https://issues.apache.org/jira/browse/TIKA-1371 > Project: Tika > Issue Type: Bug > Components: server >Affects Versions: 1.5 >Reporter: Rob Tulloh > > In Tika 1.1 and 1.2, it was possible to add some values to the URL that get > logged like this: > http://localhost:9998/tika/GUID/FILENAME > This was very useful for correlating between client and server in a > distributed compute environment. In 1.5 and in the nighty builds (for 1.6), > this feature no longer works. Not having this makes it very difficult to > troubleshoot problems with document processing in a distributed environment. > Please add back this feature so that operations and development teams can > more easily figure out which tika instance is processing which document and > what the result of the processing resulted in. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1383) Simplify TikeServerCli endpoint setup code
[ https://issues.apache.org/jira/browse/TIKA-1383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14086671#comment-14086671 ] Sergey Beryozkin commented on TIKA-1383: Hi Nick, yes, I messed it up a bit, removed the test by accident, realized it soon after I signed off :-). Thanks for fixing it > Simplify TikeServerCli endpoint setup code > -- > > Key: TIKA-1383 > URL: https://issues.apache.org/jira/browse/TIKA-1383 > Project: Tika > Issue Type: Improvement > Components: server >Reporter: Sergey Beryozkin >Assignee: Sergey Beryozkin >Priority: Trivial > Fix For: 1.6 > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1383) Simplify TikeServerCli endpoint setup code
[ https://issues.apache.org/jira/browse/TIKA-1383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14086334#comment-14086334 ] Sergey Beryozkin commented on TIKA-1383: Sorry, I updated TikaWelcome to accept a list of resource providers as opposed to a factory bean - its class needs a bit of clean up to avoid some synch issues, but IMHO bypassing the factory is probably better going forward. I've updated the the welcome test where HTML is tested - this was apparently coming from TikaWelcome itself (another reason to bypass the factory), I guess TikaWelcome should not report its own details. I'm off now but I will look into fixing the issues if any later this evening or tomorrow morning Cheers, Sergey > Simplify TikeServerCli endpoint setup code > -- > > Key: TIKA-1383 > URL: https://issues.apache.org/jira/browse/TIKA-1383 > Project: Tika > Issue Type: Improvement > Components: server >Reporter: Sergey Beryozkin >Assignee: Sergey Beryozkin >Priority: Trivial > Fix For: 1.6 > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1383) Simplify TikeServerCli endpoint setup code
[ https://issues.apache.org/jira/browse/TIKA-1383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14086249#comment-14086249 ] Sergey Beryozkin commented on TIKA-1383: Hi Nick, All, I've removed some redundant code in the server registration code. If possible please do a sanity check it has not caused the side-effects, all appears to be OK to me Sergey > Simplify TikeServerCli endpoint setup code > -- > > Key: TIKA-1383 > URL: https://issues.apache.org/jira/browse/TIKA-1383 > Project: Tika > Issue Type: Improvement > Components: server >Reporter: Sergey Beryozkin >Assignee: Sergey Beryozkin >Priority: Trivial > Fix For: 1.6 > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (TIKA-1383) Simplify TikeServerCli endpoint setup code
[ https://issues.apache.org/jira/browse/TIKA-1383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Beryozkin resolved TIKA-1383. Resolution: Fixed > Simplify TikeServerCli endpoint setup code > -- > > Key: TIKA-1383 > URL: https://issues.apache.org/jira/browse/TIKA-1383 > Project: Tika > Issue Type: Improvement > Components: server >Reporter: Sergey Beryozkin >Assignee: Sergey Beryozkin >Priority: Trivial > Fix For: 1.6 > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (TIKA-1383) Simplify TikeServerCli endpoint setup code
Sergey Beryozkin created TIKA-1383: -- Summary: Simplify TikeServerCli endpoint setup code Key: TIKA-1383 URL: https://issues.apache.org/jira/browse/TIKA-1383 Project: Tika Issue Type: Improvement Components: server Reporter: Sergey Beryozkin Assignee: Sergey Beryozkin Priority: Trivial Fix For: 1.6 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1371) passing parameters via URL no longer works (regression)
[ https://issues.apache.org/jira/browse/TIKA-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071516#comment-14071516 ] Sergey Beryozkin commented on TIKA-1371: Can you clarify please what exactly does not work ? Absolute request URI is not logged ? Can you please type a sample request URI issued against Tika 1.5 and explain what do you expect the server to do... Thanks, Sergey > passing parameters via URL no longer works (regression) > --- > > Key: TIKA-1371 > URL: https://issues.apache.org/jira/browse/TIKA-1371 > Project: Tika > Issue Type: Bug > Components: server >Affects Versions: 1.5 >Reporter: Rob Tulloh > > In Tika 1.1 and 1.2, it was possible to add some values to the URL that get > logged like this: > http://localhost:9998/tika/GUID/FILENAME > This was very useful for correlating between client and server in a > distributed compute environment. In 1.5 and in the nighty builds (for 1.6), > this feature no longer works. Not having this makes it very difficult to > troubleshoot problems with document processing in a distributed environment. > Please add back this feature so that operations and development teams can > more easily figure out which tika instance is processing which document and > what the result of the processing resulted in. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1367) Tika documentation should list tika-parsers parser dependencies
[ https://issues.apache.org/jira/browse/TIKA-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062606#comment-14062606 ] Sergey Beryozkin commented on TIKA-1367: Thanks for the proposal, I'm not sure though it would help. Consider we have a user not necessarily knowing what 'grep' is, for example someone working on Windows. Ideally as a user I'd like to have an easy way to solve this typical dependency issue: "My application will work with PDFs and OpenDocument docs only, how can I get all but the relevant dependencies excluded ?". I know some source and Maven based search can yield some info, but it would not something every user can be expected be able to do. For the record, here's what I see after grepping dependency:tree {noformat} [INFO] +- org.apache.tika:tika-core:jar:1.6-SNAPSHOT:compile [INFO] +- org.gagravarr:vorbis-java-tika:jar:0.6:compile [INFO] +- edu.ucar:netcdf:jar:4.2.20:compile [INFO] | +- edu.ucar:unidataCommon:jar:4.2.20:compile [INFO] | | \- net.jcip:jcip-annotations:jar:1.0:compile [INFO] | +- commons-httpclient:commons-httpclient:jar:3.1:compile [INFO] | \- org.slf4j:slf4j-api:jar:1.6.1:compile [INFO] +- net.sourceforge.jmatio:jmatio:jar:1.0:compile [INFO] +- org.apache.james:apache-mime4j-core:jar:0.7.2:compile [INFO] +- org.apache.james:apache-mime4j-dom:jar:0.7.2:compile [INFO] +- org.apache.commons:commons-compress:jar:1.8:compile [INFO] | \- org.tukaani:xz:jar:1.5:compile [INFO] +- commons-codec:commons-codec:jar:1.5:compile [INFO] +- org.apache.pdfbox:pdfbox:jar:1.8.6:compile [INFO] | +- org.apache.pdfbox:fontbox:jar:1.8.6:compile [INFO] | +- org.apache.pdfbox:jempbox:jar:1.8.6:compile [INFO] | \- commons-logging:commons-logging:jar:1.1.1:compile [INFO] +- org.bouncycastle:bcmail-jdk15:jar:1.45:compile [INFO] +- org.bouncycastle:bcprov-jdk15:jar:1.45:compile [INFO] +- org.apache.poi:poi:jar:3.10-FINAL:compile [INFO] +- org.apache.poi:poi-scratchpad:jar:3.10-FINAL:compile [INFO] +- org.apache.poi:poi-ooxml:jar:3.10-FINAL:compile [INFO] | +- org.apache.poi:poi-ooxml-schemas:jar:3.10-FINAL:compile [INFO] | | \- org.apache.xmlbeans:xmlbeans:jar:2.3.0:compile [INFO] | \- dom4j:dom4j:jar:1.6.1:compile [INFO] +- org.apache.geronimo.specs:geronimo-stax-api_1.0_spec:jar:1.0.1:compile [INFO] +- org.ccil.cowan.tagsoup:tagsoup:jar:1.2.1:compile [INFO] +- org.ow2.asm:asm-debug-all:jar:4.1:compile [INFO] +- com.googlecode.mp4parser:isoparser:jar:1.0-RC-1:compile [INFO] | \- org.aspectj:aspectjrt:jar:1.6.11:compile [INFO] +- com.drewnoakes:metadata-extractor:jar:2.6.2:compile [INFO] | +- com.adobe.xmp:xmpcore:jar:5.1.2:compile [INFO] | \- xerces:xercesImpl:jar:2.8.1:compile [INFO] | \- xml-apis:xml-apis:jar:1.3.03:compile [INFO] +- de.l3s.boilerpipe:boilerpipe:jar:1.1.0:compile [INFO] +- rome:rome:jar:1.0:compile [INFO] | \- jdom:jdom:jar:1.0:compile [INFO] +- org.gagravarr:vorbis-java-core:jar:0.6:compile [INFO] +- com.googlecode.juniversalchardet:juniversalchardet:jar:1.0.3:compile [INFO] +- com.uwyn:jhighlight:jar:1.0:compile [INFO] +- com.pff:java-libpst:jar:0.8.1:compile {noformat} It's a difficult task to start excluding. I've no idea as a user what many of those dependencies are for, and if some of them would be needed by all Parser implementations or not. It's easy enough to spot what PDF Parser will need (pdfbox), but more tricky to see what else might be needed for PDF as well as for other types. > Tika documentation should list tika-parsers parser dependencies > --- > > Key: TIKA-1367 > URL: https://issues.apache.org/jira/browse/TIKA-1367 > Project: Tika > Issue Type: Improvement > Components: documentation >Reporter: Sergey Beryozkin > Fix For: 1.6 > > > tika-parsers module has many strong transitive parser dependencies. Maven > users of tika-parsers have to exclude all the transitivie dependencies > manually. Documenting the list of the existing transitive dependencies and > keeping the list up to date will help developers exclude the libraries not > needed for a given project. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (TIKA-1368) Improve the modularity of tika-parsers
[ https://issues.apache.org/jira/browse/TIKA-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062098#comment-14062098 ] Sergey Beryozkin edited comment on TIKA-1368 at 7/15/14 2:14 PM: - #2 is for those users who know what they want, so I disagree with you qualifying #2 as non-viable, with #2 tika-parsers would stay as it is now. Speaking of this runtime exception you are referring to. I'm sorry but it appears to be somewhat academic. I could've said: what don't we have a single Tika module only so that users accidentally do not forget include "tika-parsers" given that the source code would compile even without including tika-parsers. What kind of application is it ? Is it expected to have some tests :-) ? I can only think of the completely generic Tika container, which is TikaServer. But TikaServer would be prepackaged. Can you offer a more realistic example please ? It's not a huge issue. But I hope we will come up with a basic solution without getting locked into arguments :-). I've heard a number of times that users may be affected. IMHO users who just would like to do a quick experiment can download the whole distro or use TikaServer. IMHO this is a rather narrow space where we have a Tika application which can accept anything without users paying any attention to the actual dependencies. On the other hand we will ship a simple Tika-based solution which will be exposed to our users, who would help those users who'd have to manually exclude many dependencies from tika-parsers ? Thanks, Sergey was (Author: sergey_beryozkin): #2 is for those users who know what they want, so I disagree with you qualifying #2 as non-viable, with #2 tika-parsers would stay as it is now. Speaking of this runtime exception you are referring to. I'm sorry but it appears to be somewhat academic. I could've said: what don't we have a single Tika module only so that users accidentally do not forget include "tika-parsers" given that the source code would compile even without including tika-parsers. What kind of application is it ? Is it expected to have some tests :-) ? I can only think of the completely generic Tika container, which is TikaServer. But TikaServer would be prepackaged. Can you offer a more realistic example please ? It's not a huge issue. But I hope we will come up with a basic solution without getting locked into arguments :-). I've heard a number of time that users may be affected. IMHO users who just would like to do a quick experiment can download the whole distro or use TikaServer. IMHO this is a rather narrow space where we have a Tika application which can accept anything without users paying any attention to the actual dependencies. On the other hand we will ship a simple Tika-based solution which will be exposed to our users, who would help those users who'd have to manually exclude many dependencies from tika-parsers ? Thanks, Sergey > Improve the modularity of tika-parsers > -- > > Key: TIKA-1368 > URL: https://issues.apache.org/jira/browse/TIKA-1368 > Project: Tika > Issue Type: Improvement > Components: packaging, parser >Affects Versions: 1.7 >Reporter: Sergey Beryozkin > > tika-parsers module has many strong transitive dependencies. This presents a > challenge to Maven tika-parsers users wishing to use only one or very few > Parser(s). > The fact the new Parsers are regularly added makes the exclusion process very > brittle. For example, an OSGI application switching from Tika 1.6 to Tika 1.7 > and having an exclusion list in place may 'leak' a new parser lib into its > runtime. > https://issues.apache.org/jira/browse/TIKA-1367 > can help on its own but a more complete solution would ideally be in place. > Proposal: > 1. Make tika-parsers transitive dependencies optional > 2. Introduce tika-parsers-optional pom that will depend on tika-parsers but > exclude 3rd-party dependencies > Both 1 and 2 will depend on the resolution of TIKA-1367. IMHO 1 is cleaner, > users will be recommended to check the documentation and add the required > dependencies. 2 also works. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (TIKA-1368) Improve the modularity of tika-parsers
[ https://issues.apache.org/jira/browse/TIKA-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062098#comment-14062098 ] Sergey Beryozkin edited comment on TIKA-1368 at 7/15/14 2:14 PM: - #2 is for those users who know what they want, so I disagree with you qualifying #2 as non-viable, with #2 tika-parsers would stay as it is now. Speaking of this runtime exception you are referring to. I'm sorry but it appears to be somewhat academic. I could've said: what don't we have a single Tika module only so that users accidentally do not forget include "tika-parsers" given that the source code would compile even without including tika-parsers. What kind of application is it ? Is it expected to have some tests :-) ? I can only think of the completely generic Tika container, which is TikaServer. But TikaServer would be prepackaged. Can you offer a more realistic example please ? It's not a huge issue. But I hope we will come up with a basic solution without getting locked into arguments :-). I've heard a number of time that users may be affected. IMHO users who just would like to do a quick experiment can download the whole distro or use TikaServer. IMHO this is a rather narrow space where we have a Tika application which can accept anything without users paying any attention to the actual dependencies. On the other hand we will ship a simple Tika-based solution which will be exposed to our users, who would help those users who'd have to manually exclude many dependencies from tika-parsers ? Thanks, Sergey was (Author: sergey_beryozkin): #2 is for those users who know what they want, so I disagree with you qualifying #2 as non-viable, with #2 tika-parsers would stay as it is now. Speaking of this runtime exception you are referring to. I'm sorry but it appears to be somewhat academic. I could've said: what don't we have a single Tika module only so that users accidentally do not forget include "tika-parsers" given that the source code would compile even without including tika-parsers. What kind of application is it ? Is it expected to have some tests :-) ? I can only think of the completely generic Tika container, which is TikaServer. But TikaServer would be prepackaged. Can you offer a more realistic example please ? It's not a huge issue. But I hope we will come up with a basic solution without getting locked into arguments :-). I've heard a number of time that users may be affected. IMHO users who just would like to do a quick experiment can download the whole distro or use TikaServer. IMHO this is a rather narrow space where we have a Tika application which can accept anything without users paying any attention to the actual dependencies. On the other hand we will ship a simple Tika-based solution who will be exposed to our users, who would help those users who'd have to manually exclude many dependencies from tika-parsers ? Thanks, Sergey > Improve the modularity of tika-parsers > -- > > Key: TIKA-1368 > URL: https://issues.apache.org/jira/browse/TIKA-1368 > Project: Tika > Issue Type: Improvement > Components: packaging, parser >Affects Versions: 1.7 >Reporter: Sergey Beryozkin > > tika-parsers module has many strong transitive dependencies. This presents a > challenge to Maven tika-parsers users wishing to use only one or very few > Parser(s). > The fact the new Parsers are regularly added makes the exclusion process very > brittle. For example, an OSGI application switching from Tika 1.6 to Tika 1.7 > and having an exclusion list in place may 'leak' a new parser lib into its > runtime. > https://issues.apache.org/jira/browse/TIKA-1367 > can help on its own but a more complete solution would ideally be in place. > Proposal: > 1. Make tika-parsers transitive dependencies optional > 2. Introduce tika-parsers-optional pom that will depend on tika-parsers but > exclude 3rd-party dependencies > Both 1 and 2 will depend on the resolution of TIKA-1367. IMHO 1 is cleaner, > users will be recommended to check the documentation and add the required > dependencies. 2 also works. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (TIKA-1368) Improve the modularity of tika-parsers
[ https://issues.apache.org/jira/browse/TIKA-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062098#comment-14062098 ] Sergey Beryozkin edited comment on TIKA-1368 at 7/15/14 2:15 PM: - #2 is for those users who know what they want, so I disagree with you qualifying #2 as non-viable, with #2 tika-parsers would stay as it is now. Speaking of this runtime exception you are referring to. I'm sorry but it appears to be somewhat academic. I could've said: why don't we have a single Tika module only so that users accidentally do not forget include "tika-parsers" given that the source code would compile even without including tika-parsers. What kind of application is it ? Is it expected to have some tests :-) ? I can only think of the completely generic Tika container, which is TikaServer. But TikaServer would be prepackaged. Can you offer a more realistic example please ? It's not a huge issue. But I hope we will come up with a basic solution without getting locked into arguments :-). I've heard a number of times that users may be affected. IMHO users who just would like to do a quick experiment can download the whole distro or use TikaServer. IMHO this is a rather narrow space where we have a Tika application which can accept anything without users paying any attention to the actual dependencies. On the other hand we will ship a simple Tika-based solution which will be exposed to our users, who would help those users who'd have to manually exclude many dependencies from tika-parsers ? Thanks, Sergey was (Author: sergey_beryozkin): #2 is for those users who know what they want, so I disagree with you qualifying #2 as non-viable, with #2 tika-parsers would stay as it is now. Speaking of this runtime exception you are referring to. I'm sorry but it appears to be somewhat academic. I could've said: what don't we have a single Tika module only so that users accidentally do not forget include "tika-parsers" given that the source code would compile even without including tika-parsers. What kind of application is it ? Is it expected to have some tests :-) ? I can only think of the completely generic Tika container, which is TikaServer. But TikaServer would be prepackaged. Can you offer a more realistic example please ? It's not a huge issue. But I hope we will come up with a basic solution without getting locked into arguments :-). I've heard a number of times that users may be affected. IMHO users who just would like to do a quick experiment can download the whole distro or use TikaServer. IMHO this is a rather narrow space where we have a Tika application which can accept anything without users paying any attention to the actual dependencies. On the other hand we will ship a simple Tika-based solution which will be exposed to our users, who would help those users who'd have to manually exclude many dependencies from tika-parsers ? Thanks, Sergey > Improve the modularity of tika-parsers > -- > > Key: TIKA-1368 > URL: https://issues.apache.org/jira/browse/TIKA-1368 > Project: Tika > Issue Type: Improvement > Components: packaging, parser >Affects Versions: 1.7 >Reporter: Sergey Beryozkin > > tika-parsers module has many strong transitive dependencies. This presents a > challenge to Maven tika-parsers users wishing to use only one or very few > Parser(s). > The fact the new Parsers are regularly added makes the exclusion process very > brittle. For example, an OSGI application switching from Tika 1.6 to Tika 1.7 > and having an exclusion list in place may 'leak' a new parser lib into its > runtime. > https://issues.apache.org/jira/browse/TIKA-1367 > can help on its own but a more complete solution would ideally be in place. > Proposal: > 1. Make tika-parsers transitive dependencies optional > 2. Introduce tika-parsers-optional pom that will depend on tika-parsers but > exclude 3rd-party dependencies > Both 1 and 2 will depend on the resolution of TIKA-1367. IMHO 1 is cleaner, > users will be recommended to check the documentation and add the required > dependencies. 2 also works. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1368) Improve the modularity of tika-parsers
[ https://issues.apache.org/jira/browse/TIKA-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062098#comment-14062098 ] Sergey Beryozkin commented on TIKA-1368: #2 is for those users who know what they want, so I disagree with you qualifying #2 as non-viable, with #2 tika-parsers would stay as it is now. Speaking of this runtime exception you are referring to. I'm sorry but it appears to be somewhat academic. I could've said: what don't we have a single Tika module only so that users accidentally do not forget include "tika-parsers" given that the source code would compile even without including tika-parsers. What kind of application is it ? Is it expected to have some tests :-) ? I can only think of the completely generic Tika container, which is TikaServer. But TikaServer would be prepackaged. Can you offer a more realistic example please ? It's not a huge issue. But I hope we will come up with a basic solution without getting locked into arguments :-). I've heard a number of time that users mat be affected. IMHO users who just would like to do a quick experiment can download the whole distro or use TikaServer. IMHO this is a rather narrow space where we have a Tika application which can accept anything without users paying any attention to the actual dependencies. On the other hand we will ship a simple Tika-based solution who will be exposed to our users, who would help those users who'd have to manually exclude many dependencies from tika-parsers ? Thanks, Sergey > Improve the modularity of tika-parsers > -- > > Key: TIKA-1368 > URL: https://issues.apache.org/jira/browse/TIKA-1368 > Project: Tika > Issue Type: Improvement > Components: packaging, parser >Affects Versions: 1.7 >Reporter: Sergey Beryozkin > > tika-parsers module has many strong transitive dependencies. This presents a > challenge to Maven tika-parsers users wishing to use only one or very few > Parser(s). > The fact the new Parsers are regularly added makes the exclusion process very > brittle. For example, an OSGI application switching from Tika 1.6 to Tika 1.7 > and having an exclusion list in place may 'leak' a new parser lib into its > runtime. > https://issues.apache.org/jira/browse/TIKA-1367 > can help on its own but a more complete solution would ideally be in place. > Proposal: > 1. Make tika-parsers transitive dependencies optional > 2. Introduce tika-parsers-optional pom that will depend on tika-parsers but > exclude 3rd-party dependencies > Both 1 and 2 will depend on the resolution of TIKA-1367. IMHO 1 is cleaner, > users will be recommended to check the documentation and add the required > dependencies. 2 also works. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (TIKA-1368) Improve the modularity of tika-parsers
[ https://issues.apache.org/jira/browse/TIKA-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062098#comment-14062098 ] Sergey Beryozkin edited comment on TIKA-1368 at 7/15/14 2:14 PM: - #2 is for those users who know what they want, so I disagree with you qualifying #2 as non-viable, with #2 tika-parsers would stay as it is now. Speaking of this runtime exception you are referring to. I'm sorry but it appears to be somewhat academic. I could've said: what don't we have a single Tika module only so that users accidentally do not forget include "tika-parsers" given that the source code would compile even without including tika-parsers. What kind of application is it ? Is it expected to have some tests :-) ? I can only think of the completely generic Tika container, which is TikaServer. But TikaServer would be prepackaged. Can you offer a more realistic example please ? It's not a huge issue. But I hope we will come up with a basic solution without getting locked into arguments :-). I've heard a number of time that users may be affected. IMHO users who just would like to do a quick experiment can download the whole distro or use TikaServer. IMHO this is a rather narrow space where we have a Tika application which can accept anything without users paying any attention to the actual dependencies. On the other hand we will ship a simple Tika-based solution who will be exposed to our users, who would help those users who'd have to manually exclude many dependencies from tika-parsers ? Thanks, Sergey was (Author: sergey_beryozkin): #2 is for those users who know what they want, so I disagree with you qualifying #2 as non-viable, with #2 tika-parsers would stay as it is now. Speaking of this runtime exception you are referring to. I'm sorry but it appears to be somewhat academic. I could've said: what don't we have a single Tika module only so that users accidentally do not forget include "tika-parsers" given that the source code would compile even without including tika-parsers. What kind of application is it ? Is it expected to have some tests :-) ? I can only think of the completely generic Tika container, which is TikaServer. But TikaServer would be prepackaged. Can you offer a more realistic example please ? It's not a huge issue. But I hope we will come up with a basic solution without getting locked into arguments :-). I've heard a number of time that users mat be affected. IMHO users who just would like to do a quick experiment can download the whole distro or use TikaServer. IMHO this is a rather narrow space where we have a Tika application which can accept anything without users paying any attention to the actual dependencies. On the other hand we will ship a simple Tika-based solution who will be exposed to our users, who would help those users who'd have to manually exclude many dependencies from tika-parsers ? Thanks, Sergey > Improve the modularity of tika-parsers > -- > > Key: TIKA-1368 > URL: https://issues.apache.org/jira/browse/TIKA-1368 > Project: Tika > Issue Type: Improvement > Components: packaging, parser >Affects Versions: 1.7 >Reporter: Sergey Beryozkin > > tika-parsers module has many strong transitive dependencies. This presents a > challenge to Maven tika-parsers users wishing to use only one or very few > Parser(s). > The fact the new Parsers are regularly added makes the exclusion process very > brittle. For example, an OSGI application switching from Tika 1.6 to Tika 1.7 > and having an exclusion list in place may 'leak' a new parser lib into its > runtime. > https://issues.apache.org/jira/browse/TIKA-1367 > can help on its own but a more complete solution would ideally be in place. > Proposal: > 1. Make tika-parsers transitive dependencies optional > 2. Introduce tika-parsers-optional pom that will depend on tika-parsers but > exclude 3rd-party dependencies > Both 1 and 2 will depend on the resolution of TIKA-1367. IMHO 1 is cleaner, > users will be recommended to check the documentation and add the required > dependencies. 2 also works. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TIKA-1368) Improve the modularity of tika-parsers
[ https://issues.apache.org/jira/browse/TIKA-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Beryozkin updated TIKA-1368: --- Affects Version/s: 1.7 > Improve the modularity of tika-parsers > -- > > Key: TIKA-1368 > URL: https://issues.apache.org/jira/browse/TIKA-1368 > Project: Tika > Issue Type: Improvement > Components: packaging, parser >Affects Versions: 1.7 >Reporter: Sergey Beryozkin > > tika-parsers module has many strong transitive dependencies. This presents a > challenge to Maven tika-parsers users wishing to use only one or very few > Parser(s). > The fact the new Parsers are regularly added makes the exclusion process very > brittle. For example, an OSGI application switching from Tika 1.6 to Tika 1.7 > and having an exclusion list in place may 'leak' a new parser lib into its > runtime. > https://issues.apache.org/jira/browse/TIKA-1367 > can help on its own but a more complete solution would ideally be in place. > Proposal: > 1. Make tika-parsers transitive dependencies optional > 2. Introduce tika-parsers-optional pom that will depend on tika-parsers but > exclude 3rd-party dependencies > Both 1 and 2 will depend on the resolution of TIKA-1367. IMHO 1 is cleaner, > users will be recommended to check the documentation and add the required > dependencies. 2 also works. -- This message was sent by Atlassian JIRA (v6.2#6252)