[jira] [Commented] (TIKA-2944) TikaConfig should support the parameters without XML type attribute

2019-09-23 Thread Sergey Beryozkin (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-2944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16936074#comment-16936074
 ] 

Sergey Beryozkin commented on TIKA-2944:


{{Param}} class supporting a {{boolean}} (XMLSchema type) in addition to 
{{bool}} should help in meantime

> TikaConfig should support the parameters without XML type attribute
> ---
>
> Key: TIKA-2944
> URL: https://issues.apache.org/jira/browse/TIKA-2944
> Project: Tika
>  Issue Type: Improvement
>  Components: config
>Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
>Priority: Major
> Fix For: 2.0.0, 1.23
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (TIKA-2946) Review how TikaConfig can avoid parsing XML itself

2019-09-17 Thread Sergey Beryozkin (Jira)
Sergey Beryozkin created TIKA-2946:
--

 Summary: Review how TikaConfig can avoid parsing XML itself
 Key: TIKA-2946
 URL: https://issues.apache.org/jira/browse/TIKA-2946
 Project: Tika
  Issue Type: Improvement
  Components: config
Reporter: Sergey Beryozkin
 Fix For: 2.0


I have some issues right now with initializing the {{TikaConfig}} at the 
Quarkus build time. The reason I'd like to do it is to avoid having the XML 
classes loaded into the memory when the application starts. Moving the XML 
parsing code out (perhaps into a static TikaConfig factory method) and few 
other minor tweaks will help.
I'll try to provide more input when 2.0 will become closer 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (TIKA-2945) AutoDetectParser should skip the content type detection if Metadata already has it

2019-09-13 Thread Sergey Beryozkin (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-2945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Beryozkin updated TIKA-2945:
---
Summary: AutoDetectParser should skip the content type detection if 
Metadata already has it  (was: AutoDetectParser should skip the conetnt type 
detection if Metadata already has it)

> AutoDetectParser should skip the content type detection if Metadata already 
> has it
> --
>
> Key: TIKA-2945
> URL: https://issues.apache.org/jira/browse/TIKA-2945
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
>Priority: Minor
> Fix For: 2.0.0, 1.23
>
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (TIKA-2943) Modularize tika-parsers

2019-09-13 Thread Sergey Beryozkin (Jira)
Sergey Beryozkin created TIKA-2943:
--

 Summary: Modularize tika-parsers
 Key: TIKA-2943
 URL: https://issues.apache.org/jira/browse/TIKA-2943
 Project: Tika
  Issue Type: Improvement
Reporter: Sergey Beryozkin
Assignee: Sergey Beryozkin
 Fix For: 2.0.0


This effort will be based on the work done by Bob at the [2.x 
branch|https://github.com/apache/tika/tree/2.x/tika-parser-modules] 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (TIKA-2944) TikaConfig should support the parameters without XML type attribute

2019-09-13 Thread Sergey Beryozkin (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-2944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Beryozkin updated TIKA-2944:
---
Summary: TikaConfig should support the parameters without XML type 
attribute  (was: TikaConfig should support the parameters with the XML type 
attribute)

> TikaConfig should support the parameters without XML type attribute
> ---
>
> Key: TIKA-2944
> URL: https://issues.apache.org/jira/browse/TIKA-2944
> Project: Tika
>  Issue Type: Improvement
>  Components: config
>Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
>Priority: Major
> Fix For: 2.0.0, 1.23
>
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (TIKA-2945) AutoDetectParser should skip the conetnt type detection if Metadata already has it

2019-09-13 Thread Sergey Beryozkin (Jira)
Sergey Beryozkin created TIKA-2945:
--

 Summary: AutoDetectParser should skip the conetnt type detection 
if Metadata already has it
 Key: TIKA-2945
 URL: https://issues.apache.org/jira/browse/TIKA-2945
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Sergey Beryozkin
Assignee: Sergey Beryozkin
 Fix For: 2.0.0, 1.23






--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (TIKA-2944) TikaConfig should support the parameters with the XML type attribute

2019-09-13 Thread Sergey Beryozkin (Jira)
Sergey Beryozkin created TIKA-2944:
--

 Summary: TikaConfig should support the parameters with the XML 
type attribute
 Key: TIKA-2944
 URL: https://issues.apache.org/jira/browse/TIKA-2944
 Project: Tika
  Issue Type: Improvement
  Components: config
Reporter: Sergey Beryozkin
Assignee: Sergey Beryozkin
 Fix For: 2.0.0, 1.23






--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (TIKA-2882) Parsers should not include HTTP client code

2019-09-13 Thread Sergey Beryozkin (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16929130#comment-16929130
 ] 

Sergey Beryozkin commented on TIKA-2882:


I'll create a dedicated issue so that I can link from it from the other 
sources, etc

> Parsers should not include HTTP client code
> ---
>
> Key: TIKA-2882
> URL: https://issues.apache.org/jira/browse/TIKA-2882
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.21
>Reporter: Jonathan Essex
>Assignee: Sergey Beryozkin
>Priority: Major
>
> Folks, does it really make sense for a parser to have a REST client built in?
> The GROBID and NLTKNERecogniser parsers use the apache CXF client directly. 
>  
> Since I don't use CXF and my entire app is built on a different JAX-RS stack 
> this just dropped me straight into dependency hell.
> Surely it would make more sense to keep the parsers... well, parsers... and 
> build support for delegating parsing to other services into some higher level 
> in the stack (such as the server, where the CXF dependency is more benign). 
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (TIKA-2882) Parsers should not include HTTP client code

2019-08-16 Thread Sergey Beryozkin (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16909117#comment-16909117
 ] 

Sergey Beryozkin commented on TIKA-2882:


OK, I've assigned to myself. Well, now that I'll have to do it, I have to say, 
my priority is to prepare to the Apache Con EU Tika talk well, but I'll give a 
try (and copy Bob's work from the 2.x branch :-) ) asap. Thanks  

> Parsers should not include HTTP client code
> ---
>
> Key: TIKA-2882
> URL: https://issues.apache.org/jira/browse/TIKA-2882
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.21
>Reporter: Jonathan Essex
>Assignee: Sergey Beryozkin
>Priority: Major
>
> Folks, does it really make sense for a parser to have a REST client built in?
> The GROBID and NLTKNERecogniser parsers use the apache CXF client directly. 
>  
> Since I don't use CXF and my entire app is built on a different JAX-RS stack 
> this just dropped me straight into dependency hell.
> Surely it would make more sense to keep the parsers... well, parsers... and 
> build support for delegating parsing to other services into some higher level 
> in the stack (such as the server, where the CXF dependency is more benign). 
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (TIKA-2882) Parsers should not include HTTP client code

2019-08-16 Thread Sergey Beryozkin (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16909117#comment-16909117
 ] 

Sergey Beryozkin edited comment on TIKA-2882 at 8/16/19 3:11 PM:
-

OK, I've assigned to myself. Well, now that I'll have to do it, I have to say, 
my priority is to prepare to the Apache Con EU Tika talk well, but I'll give it 
a try (and copy Bob's work from the 2.x branch :-) ) asap. Thanks  


was (Author: sergey_beryozkin):
OK, I've assigned to myself. Well, now that I'll have to do it, I have to say, 
my priority is to prepare to the Apache Con EU Tika talk well, but I'll give a 
try (and copy Bob's work from the 2.x branch :-) ) asap. Thanks  

> Parsers should not include HTTP client code
> ---
>
> Key: TIKA-2882
> URL: https://issues.apache.org/jira/browse/TIKA-2882
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.21
>Reporter: Jonathan Essex
>Assignee: Sergey Beryozkin
>Priority: Major
>
> Folks, does it really make sense for a parser to have a REST client built in?
> The GROBID and NLTKNERecogniser parsers use the apache CXF client directly. 
>  
> Since I don't use CXF and my entire app is built on a different JAX-RS stack 
> this just dropped me straight into dependency hell.
> Surely it would make more sense to keep the parsers... well, parsers... and 
> build support for delegating parsing to other services into some higher level 
> in the stack (such as the server, where the CXF dependency is more benign). 
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (TIKA-2882) Parsers should not include HTTP client code

2019-08-16 Thread Sergey Beryozkin (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Beryozkin reassigned TIKA-2882:
--

Assignee: Sergey Beryozkin

> Parsers should not include HTTP client code
> ---
>
> Key: TIKA-2882
> URL: https://issues.apache.org/jira/browse/TIKA-2882
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.21
>Reporter: Jonathan Essex
>Assignee: Sergey Beryozkin
>Priority: Major
>
> Folks, does it really make sense for a parser to have a REST client built in?
> The GROBID and NLTKNERecogniser parsers use the apache CXF client directly. 
>  
> Since I don't use CXF and my entire app is built on a different JAX-RS stack 
> this just dropped me straight into dependency hell.
> Surely it would make more sense to keep the parsers... well, parsers... and 
> build support for delegating parsing to other services into some higher level 
> in the stack (such as the server, where the CXF dependency is more benign). 
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (TIKA-2882) Parsers should not include HTTP client code

2019-08-16 Thread Sergey Beryozkin (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16909101#comment-16909101
 ] 

Sergey Beryozkin commented on TIKA-2882:


Hi Tim

Can we consider giving it a go ? Bob agrees to focus on the modules only, so 
all we have to do to get it started is to create few modules grouping the 
specific parsers, and have the existing tika-parsers incorporating those new 
modules. There should be no even coding involved unless I'm missing something. 
If you can create a quick PR only and then I can test it with Quarkus, etc, 
what do you think ?

> Parsers should not include HTTP client code
> ---
>
> Key: TIKA-2882
> URL: https://issues.apache.org/jira/browse/TIKA-2882
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.21
>Reporter: Jonathan Essex
>Priority: Major
>
> Folks, does it really make sense for a parser to have a REST client built in?
> The GROBID and NLTKNERecogniser parsers use the apache CXF client directly. 
>  
> Since I don't use CXF and my entire app is built on a different JAX-RS stack 
> this just dropped me straight into dependency hell.
> Surely it would make more sense to keep the parsers... well, parsers... and 
> build support for delegating parsing to other services into some higher level 
> in the stack (such as the server, where the CXF dependency is more benign). 
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (TIKA-2910) Text extraction using Tika command line and Tika server differs

2019-08-12 Thread Sergey Beryozkin (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16905636#comment-16905636
 ] 

Sergey Beryozkin commented on TIKA-2910:


Hi [~talli...@apache.org], IMHO it should be fixed in the 1.x branch as well, 
may be with a property letting the users to enable or disable this fix at 
runtime 

> Text extraction using Tika command line and Tika server differs
> ---
>
> Key: TIKA-2910
> URL: https://issues.apache.org/jira/browse/TIKA-2910
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.21
>Reporter: Walter
>Priority: Major
>  Labels: newbie
> Attachments: CorpusP_25471990.xml
>
>
> When extracting TXT from the very same XML file using either Tika command 
> line utility or the Tika in server mode, the results differ.
> It looks as if PCDATA in deeper nested XML structures are just ignored and 
> only an empty line is returned.
> I assume both use the same base code. Are there any default settings that may 
> differ or can be set?
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (TIKA-2910) Text extraction using Tika command line and Tika server differs

2019-08-08 Thread Sergey Beryozkin (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902845#comment-16902845
 ] 

Sergey Beryozkin commented on TIKA-2910:


Thanks, Tim may be away so lets wait till he is back

> Text extraction using Tika command line and Tika server differs
> ---
>
> Key: TIKA-2910
> URL: https://issues.apache.org/jira/browse/TIKA-2910
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.21
>Reporter: Walter
>Priority: Major
>  Labels: newbie
> Attachments: CorpusP_25471990.xml
>
>
> When extracting TXT from the very same XML file using either Tika command 
> line utility or the Tika in server mode, the results differ.
> It looks as if PCDATA in deeper nested XML structures are just ignored and 
> only an empty line is returned.
> I assume both use the same base code. Are there any default settings that may 
> differ or can be set?
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (TIKA-2910) Text extraction using Tika command line and Tika server differs

2019-08-08 Thread Sergey Beryozkin (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902836#comment-16902836
 ] 

Sergey Beryozkin commented on TIKA-2910:


Hi [~akit] Can you please download the source and debug ? It can help

> Text extraction using Tika command line and Tika server differs
> ---
>
> Key: TIKA-2910
> URL: https://issues.apache.org/jira/browse/TIKA-2910
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.21
>Reporter: Walter
>Priority: Major
>  Labels: newbie
> Attachments: CorpusP_25471990.xml
>
>
> When extracting TXT from the very same XML file using either Tika command 
> line utility or the Tika in server mode, the results differ.
> It looks as if PCDATA in deeper nested XML structures are just ignored and 
> only an empty line is returned.
> I assume both use the same base code. Are there any default settings that may 
> differ or can be set?
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (TIKA-2896) NullPointerException in MimeTypesReader.releaseParser()

2019-06-18 Thread Sergey Beryozkin (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Beryozkin resolved TIKA-2896.

   Resolution: Fixed
Fix Version/s: 1.22

Thanks for the patch

> NullPointerException in MimeTypesReader.releaseParser()
> ---
>
> Key: TIKA-2896
> URL: https://issues.apache.org/jira/browse/TIKA-2896
> Project: Tika
>  Issue Type: Bug
>  Components: mime
>Affects Versions: 1.21
>Reporter: Eamonn Saunders
>Priority: Major
> Fix For: 1.22
>
>
> We have encountered a situation where the call to parser.reset() in the 
> following code snippet results in a NullPointerException.
> {code:java}
>     private static void releaseParser(SAXParser parser) {
>     try {
>     parser.reset();
>     } catch (UnsupportedOperationException e) {
>     //ignore
>     }
> {code}
> releaseParser() is called in the finally block of MimeTypesReader.read()
> {code:java}
>     public void read(InputStream stream) throws IOException, 
> MimeTypeException {
>     SAXParser parser = null;
>     try {
>     parser = acquireSAXParser();
>     parser.parse(stream, this);
>     } catch (TikaException e) {
>     throw new MimeTypeException("Unable to create an XML parser", e);
>     } catch (SAXException e) {
>     throw new MimeTypeException("Invalid type configuration", e);
>     } finally {
>     releaseParser(parser);
>     }
>     }{code}
> The parser variable will be null coming out of acquireSAXParser() if 
> acquireSAXParser() is called on a thread that is interrupted (i.e. the 
> InterruptedException is handled in the following code):
> {code:java}
>     private static SAXParser acquireSAXParser()
>     throws TikaException {
>     while (true) {
>     SAXParser parser = null;
>     try {
>     READ_WRITE_LOCK.readLock().lock();
>     parser = SAX_PARSERS.poll(10, TimeUnit.MILLISECONDS);
>     } catch (InterruptedException e) {
>     throw new TikaException("interrupted while waiting for 
> SAXParser", e);
>     } finally {
>     READ_WRITE_LOCK.readLock().unlock();
>     }
>     if (parser != null) {
>     return parser;
>     }
>     }
>     }{code}
> A simple fix would be to check for null before calling releaseParser() in the 
> finally block.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2889) Tika Server keeps crashing

2019-06-08 Thread Sergey Beryozkin (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16859265#comment-16859265
 ] 

Sergey Beryozkin commented on TIKA-2889:


[~talli...@apache.org] sorry, was away this week. This issue has been closed 
now so may be CXF/Jetty was not at fault in this case 

> Tika Server keeps crashing
> --
>
> Key: TIKA-2889
> URL: https://issues.apache.org/jira/browse/TIKA-2889
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.18, 1.19, 1.19.1, 1.21
> Environment: Both Ubuntu and Windows have the same bug/issue
>Reporter: Thomas van Hesteren
>Priority: Minor
> Attachments: log4j.xml, tika-2.log, tika-server-everything-2.log, 
> tika-server-everything.log, tika-server-everything.log, 
> tika-server-everything.log, tika-server-everything.log, 
> tika-server-everything3.log, tika.log, tika.log, tika3.log
>
>
> I have a document processor which sends documents to the Tika Server over 
> cUrl. However, the server crashes multiple times (not document specific). The 
> response I get from cUrl if it happens is as follows:
> Connection error: Couldn't connect to server
>  
> The Tika server is started when the script starts executing. For now, I fixed 
> the issue by making a watcher which restarts the tika server when it crashes. 
> It then processes a few other documents and crashes again (after a few 
> minutes, let's say 5 minutes tops).
>  
> Is there any possibility to catch the exception (if it throws any?)
>  
> A log which shows the crash of the server:
> 04-06-2019 15:49:25|Processing a file of: 52.3kB
> 04-06-2019 15:49:24|Processing a file of: 255.5kB
> 04-06-2019 15:49:24|Processing a file of: 241.6kB
> 04-06-2019 15:49:23|Processing a file of: 37.7kB
> 04-06-2019 15:49:22|Processing a file of: 1.27MB
> 04-06-2019 15:49:21|Processing a file of: 55.8kB
> 04-06-2019 15:49:17|Processing a file of: 114.5kB
> 04-06-2019 15:49:08|Server is not running. Restarting Server. Connection 
> error: Couldn't connect to server
> 04-06-2019 15:49:03|Processing a file of: 41.0kB
> 04-06-2019 15:49:00|Processing a file of: 38.0kB
> 04-06-2019 15:48:59|ProcesPsing a file of: 37.1kB
> 04-06-2019 15:48:59|Processing a file of: 60.2kB
> 04-06-2019 15:48:59|Processing a file of: 280.7kB
> 04-06-2019 15:48:59|Processing a file of: 3.30MB



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-2882) Parsers should not include HTTP client code

2019-05-29 Thread Sergey Beryozkin (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16851032#comment-16851032
 ] 

Sergey Beryozkin edited comment on TIKA-2882 at 5/29/19 4:26 PM:
-

I will definitely support getting the tika-parser-modules idea in 2.0 
prioritized (please start a dev thread if you'd like, so that the categories 
can be reviewed what goes where etc). May be it is a simplistic view but if we 
postpone the OSGI-ification aspects till later then it is about moving specific 
parsers out of tika-parsers to respective modules ? Sorry if it is not the case 
:-). I can definitely help with testing. Or moving some parsers to a new module 
once I see how you do it one of these modules :-) 


was (Author: sergey_beryozkin):
I will definitely support getting the tika-parser-modules idea in 2.0 
prioritized (please start a dev thread if you'd like, so that the categories 
can be reviewed what goes where etc). May be it is a simplistic view but if 
postpone the OSGI-ification aspects till later then it is about moving specific 
parsers out of tika-parsers to respective modules ? Sorry if it is not the case 
:-). I can definitely help with testing. Or moving some parsers to a new module 
once I see how you do it one of these modules :-) 

> Parsers should not include HTTP client code
> ---
>
> Key: TIKA-2882
> URL: https://issues.apache.org/jira/browse/TIKA-2882
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.21
>Reporter: Jonathan Essex
>Priority: Major
>
> Folks, does it really make sense for a parser to have a REST client built in?
> The GROBID and NLTKNERecogniser parsers use the apache CXF client directly. 
>  
> Since I don't use CXF and my entire app is built on a different JAX-RS stack 
> this just dropped me straight into dependency hell.
> Surely it would make more sense to keep the parsers... well, parsers... and 
> build support for delegating parsing to other services into some higher level 
> in the stack (such as the server, where the CXF dependency is more benign). 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2882) Parsers should not include HTTP client code

2019-05-29 Thread Sergey Beryozkin (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16851032#comment-16851032
 ] 

Sergey Beryozkin commented on TIKA-2882:


I will definitely support getting the tika-parser-modules idea in 2.0 
prioritized (please start a dev thread if you'd like, so that the categories 
can be reviewed what goes where etc). May be it is a simplistic view but if 
postpone the OSGI-ification aspects till later then it is about moving specific 
parsers out of tika-parsers to respective modules ? Sorry if it is not the case 
:-). I can definitely help with testing. Or moving some parsers to a new module 
once I see how you do it one of these modules :-) 

> Parsers should not include HTTP client code
> ---
>
> Key: TIKA-2882
> URL: https://issues.apache.org/jira/browse/TIKA-2882
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.21
>Reporter: Jonathan Essex
>Priority: Major
>
> Folks, does it really make sense for a parser to have a REST client built in?
> The GROBID and NLTKNERecogniser parsers use the apache CXF client directly. 
>  
> Since I don't use CXF and my entire app is built on a different JAX-RS stack 
> this just dropped me straight into dependency hell.
> Surely it would make more sense to keep the parsers... well, parsers... and 
> build support for delegating parsing to other services into some higher level 
> in the stack (such as the server, where the CXF dependency is more benign). 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-2882) Parsers should not include HTTP client code

2019-05-28 Thread Sergey Beryozkin (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849697#comment-16849697
 ] 

Sergey Beryozkin edited comment on TIKA-2882 at 5/28/19 1:39 PM:
-

I see, I was thinking of the 2.x branch :-)
Lets start with the https://github.com/apache/tika/tree/2.x/tika-parser-modules 
idea in 2.0 master ?


was (Author: sergey_beryozkin):
I see, I was thinking of the 2.x branch :-)
Lets starts with the 
https://github.com/apache/tika/tree/2.x/tika-parser-modules idea in 2.0 master ?

> Parsers should not include HTTP client code
> ---
>
> Key: TIKA-2882
> URL: https://issues.apache.org/jira/browse/TIKA-2882
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.21
>Reporter: Jonathan Essex
>Priority: Major
>
> Folks, does it really make sense for a parser to have a REST client built in?
> The GROBID and NLTKNERecogniser parsers use the apache CXF client directly. 
>  
> Since I don't use CXF and my entire app is built on a different JAX-RS stack 
> this just dropped me straight into dependency hell.
> Surely it would make more sense to keep the parsers... well, parsers... and 
> build support for delegating parsing to other services into some higher level 
> in the stack (such as the server, where the CXF dependency is more benign). 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2882) Parsers should not include HTTP client code

2019-05-28 Thread Sergey Beryozkin (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849697#comment-16849697
 ] 

Sergey Beryozkin commented on TIKA-2882:


I see, I was thinking of the 2.x branch :-)
Lets starts with the 
https://github.com/apache/tika/tree/2.x/tika-parser-modules idea in 2.0 master ?

> Parsers should not include HTTP client code
> ---
>
> Key: TIKA-2882
> URL: https://issues.apache.org/jira/browse/TIKA-2882
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.21
>Reporter: Jonathan Essex
>Priority: Major
>
> Folks, does it really make sense for a parser to have a REST client built in?
> The GROBID and NLTKNERecogniser parsers use the apache CXF client directly. 
>  
> Since I don't use CXF and my entire app is built on a different JAX-RS stack 
> this just dropped me straight into dependency hell.
> Surely it would make more sense to keep the parsers... well, parsers... and 
> build support for delegating parsing to other services into some higher level 
> in the stack (such as the server, where the CXF dependency is more benign). 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2882) Parsers should not include HTTP client code

2019-05-28 Thread Sergey Beryozkin (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849593#comment-16849593
 ] 

Sergey Beryozkin commented on TIKA-2882:


[~talli...@apache.org], so as far as Tika 2.0 is concerned, would it make sense 
to start applying similar ideas in the 1.x line ? Or make the 2.0 branch a 
master ?

> Parsers should not include HTTP client code
> ---
>
> Key: TIKA-2882
> URL: https://issues.apache.org/jira/browse/TIKA-2882
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.21
>Reporter: Jonathan Essex
>Priority: Major
>
> Folks, does it really make sense for a parser to have a REST client built in?
> The GROBID and NLTKNERecogniser parsers use the apache CXF client directly. 
>  
> Since I don't use CXF and my entire app is built on a different JAX-RS stack 
> this just dropped me straight into dependency hell.
> Surely it would make more sense to keep the parsers... well, parsers... and 
> build support for delegating parsing to other services into some higher level 
> in the stack (such as the server, where the CXF dependency is more benign). 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2882) Parsers should not include HTTP client code

2019-05-27 Thread Sergey Beryozkin (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16848866#comment-16848866
 ] 

Sergey Beryozkin commented on TIKA-2882:


Oh, is it multipart ? In that case may be it has to be replaced with something 
neutral such as Apache HttpClient or even the manual multipart payload creation.

> Parsers should not include HTTP client code
> ---
>
> Key: TIKA-2882
> URL: https://issues.apache.org/jira/browse/TIKA-2882
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.21
>Reporter: Jonathan Essex
>Priority: Major
>
> Folks, does it really make sense for a parser to have a REST client built in?
> The GROBID and NLTKNERecogniser parsers use the apache CXF client directly. 
>  
> Since I don't use CXF and my entire app is built on a different JAX-RS stack 
> this just dropped me straight into dependency hell.
> Surely it would make more sense to keep the parsers... well, parsers... and 
> build support for delegating parsing to other services into some higher level 
> in the stack (such as the server, where the CXF dependency is more benign). 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2882) Parsers should not include HTTP client code

2019-05-26 Thread Sergey Beryozkin (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16848489#comment-16848489
 ] 

Sergey Beryozkin commented on TIKA-2882:


Give a try please, I can help with the migration if needed

> Parsers should not include HTTP client code
> ---
>
> Key: TIKA-2882
> URL: https://issues.apache.org/jira/browse/TIKA-2882
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.21
>Reporter: Jonathan Essex
>Priority: Major
>
> Folks, does it really make sense for a parser to have a REST client built in?
> The GROBID and NLTKNERecogniser parsers use the apache CXF client directly. 
>  
> Since I don't use CXF and my entire app is built on a different JAX-RS stack 
> this just dropped me straight into dependency hell.
> Surely it would make more sense to keep the parsers... well, parsers... and 
> build support for delegating parsing to other services into some higher level 
> in the stack (such as the server, where the CXF dependency is more benign). 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2882) Parsers should not include HTTP client code

2019-05-26 Thread Sergey Beryozkin (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16848485#comment-16848485
 ] 

Sergey Beryozkin commented on TIKA-2882:


Can you consider a PR where CXF WebClient code is replaced by JAX-RS 2.0 client 
API ?

> Parsers should not include HTTP client code
> ---
>
> Key: TIKA-2882
> URL: https://issues.apache.org/jira/browse/TIKA-2882
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.21
>Reporter: Jonathan Essex
>Priority: Major
>
> Folks, does it really make sense for a parser to have a REST client built in?
> The GROBID and NLTKNERecogniser parsers use the apache CXF client directly. 
>  
> Since I don't use CXF and my entire app is built on a different JAX-RS stack 
> this just dropped me straight into dependency hell.
> Surely it would make more sense to keep the parsers... well, parsers... and 
> build support for delegating parsing to other services into some higher level 
> in the stack (such as the server, where the CXF dependency is more benign). 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (TIKA-2862) Make PDF Parser GraalVM native mode ready

2019-05-22 Thread Sergey Beryozkin (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Beryozkin resolved TIKA-2862.

Resolution: Not A Problem

The issue is at the PDFBox level so it will be addressed in PDFBox then Tika 
will get it as part of the regular  PDFBox dependency update


> Make PDF Parser GraalVM native mode ready 
> --
>
> Key: TIKA-2862
> URL: https://issues.apache.org/jira/browse/TIKA-2862
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.20
>Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
>Priority: Major
>
> PDF Parser is not Graal native mode ready yet, the following is reported when 
> it is processed as part of Quarkus native mode build:
> {noformat}
> Error: Detected a FileDescriptor in the image heap. You can manually 
> delay class initialization to image run time by using the option 
> --delay-class-initialization-to-runtime=. ...
> Detailed message:
>  Trace:     object org.apache.fontbox.ttf.BufferedRandomAccessFile
>      object org.apache.fontbox.ttf.RAFDataStream
>      object org.apache.fontbox.ttf.TrueTypeFont
>      object org.apache.pdfbox.pdmodel.font.PDType1Font
>      method 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults()
>  Call path from entry point to 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults():
>  
>      at 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults(PDAcroForm.java:106)
>      at 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.(PDAcroForm.java:93)
>      at 
> org.apache.pdfbox.pdmodel.PDDocumentCatalog.getAcroForm(PDDocumentCatalog.java:108)
>      at org.apache.tika.parser.pdf.PDFParser.handleXFAOnly(PDFParser.java:534)
>      at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:164)
> {noformat} 
> See also 
> [https://medium.com/graalvm/understanding-class-initialization-in-graalvm-native-image-generation-d765b7e4d6ed]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2862) Make PDF Parser GraalVM native mode ready

2019-05-14 Thread Sergey Beryozkin (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16839640#comment-16839640
 ] 

Sergey Beryozkin commented on TIKA-2862:


See https://github.com/apache/pdfbox/pull/69

> Make PDF Parser GraalVM native mode ready 
> --
>
> Key: TIKA-2862
> URL: https://issues.apache.org/jira/browse/TIKA-2862
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.20
>Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
>Priority: Major
>
> PDF Parser is not Graal native mode ready yet, the following is reported when 
> it is processed as part of Quarkus native mode build:
> {noformat}
> Error: Detected a FileDescriptor in the image heap. You can manually 
> delay class initialization to image run time by using the option 
> --delay-class-initialization-to-runtime=. ...
> Detailed message:
>  Trace:     object org.apache.fontbox.ttf.BufferedRandomAccessFile
>      object org.apache.fontbox.ttf.RAFDataStream
>      object org.apache.fontbox.ttf.TrueTypeFont
>      object org.apache.pdfbox.pdmodel.font.PDType1Font
>      method 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults()
>  Call path from entry point to 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults():
>  
>      at 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults(PDAcroForm.java:106)
>      at 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.(PDAcroForm.java:93)
>      at 
> org.apache.pdfbox.pdmodel.PDDocumentCatalog.getAcroForm(PDDocumentCatalog.java:108)
>      at org.apache.tika.parser.pdf.PDFParser.handleXFAOnly(PDFParser.java:534)
>      at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:164)
> {noformat} 
> See also 
> [https://medium.com/graalvm/understanding-class-initialization-in-graalvm-native-image-generation-d765b7e4d6ed]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2862) Make PDF Parser GraalVM native mode ready

2019-05-14 Thread Sergey Beryozkin (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Beryozkin updated TIKA-2862:
---
Summary: Make PDF Parser GraalVM native mode ready   (was: Make PDF Parser 
Graal native mode ready )

> Make PDF Parser GraalVM native mode ready 
> --
>
> Key: TIKA-2862
> URL: https://issues.apache.org/jira/browse/TIKA-2862
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.20
>Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
>Priority: Major
>
> PDF Parser is not Graal native mode ready yet, the following is reported when 
> it is processed as part of Quarkus native mode build:
> {noformat}
> Error: Detected a FileDescriptor in the image heap. You can manually 
> delay class initialization to image run time by using the option 
> --delay-class-initialization-to-runtime=. ...
> Detailed message:
>  Trace:     object org.apache.fontbox.ttf.BufferedRandomAccessFile
>      object org.apache.fontbox.ttf.RAFDataStream
>      object org.apache.fontbox.ttf.TrueTypeFont
>      object org.apache.pdfbox.pdmodel.font.PDType1Font
>      method 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults()
>  Call path from entry point to 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults():
>  
>      at 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults(PDAcroForm.java:106)
>      at 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.(PDAcroForm.java:93)
>      at 
> org.apache.pdfbox.pdmodel.PDDocumentCatalog.getAcroForm(PDDocumentCatalog.java:108)
>      at org.apache.tika.parser.pdf.PDFParser.handleXFAOnly(PDFParser.java:534)
>      at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:164)
> {noformat} 
> See also 
> [https://medium.com/graalvm/understanding-class-initialization-in-graalvm-native-image-generation-d765b7e4d6ed]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2862) Make PDF Parser Graal native mode ready

2019-05-10 Thread Sergey Beryozkin (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Beryozkin updated TIKA-2862:
---
Description: 
PDF Parser is not Graal native mode ready yet, the following is reported when 
it is processed as part of Quarkus native mode build:
{noformat}
Error: Detected a FileDescriptor in the image heap. You can manually delay 
class initialization to image run time by using the option 
--delay-class-initialization-to-runtime=. ...
Detailed message:
 Trace:     object org.apache.fontbox.ttf.BufferedRandomAccessFile
     object org.apache.fontbox.ttf.RAFDataStream
     object org.apache.fontbox.ttf.TrueTypeFont
     object org.apache.pdfbox.pdmodel.font.PDType1Font
     method 
org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults()
 Call path from entry point to 
org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults(): 
     at 
org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults(PDAcroForm.java:106)
     at 
org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.(PDAcroForm.java:93)
     at 
org.apache.pdfbox.pdmodel.PDDocumentCatalog.getAcroForm(PDDocumentCatalog.java:108)
     at org.apache.tika.parser.pdf.PDFParser.handleXFAOnly(PDFParser.java:534)
     at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:164)
{noformat} 

See also 
[https://medium.com/graalvm/understanding-class-initialization-in-graalvm-native-image-generation-d765b7e4d6ed]

  was:
PDF Parser is not Graal native mode ready yet, the following is reported when 
it is processed as part of Quarkus native mode build:

Error: Detected a FileDescriptor in the image heap. You can manually delay 
class initialization to image run time by using the option 
--delay-class-initialization-to-runtime=. ...
Detailed message:
Trace:     object org.apache.fontbox.ttf.BufferedRandomAccessFile
    object org.apache.fontbox.ttf.RAFDataStream
    object org.apache.fontbox.ttf.TrueTypeFont
    object org.apache.pdfbox.pdmodel.font.PDType1Font
    method 
org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults()
Call path from entry point to 
org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults(): 
    at 
org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults(PDAcroForm.java:106)
    at 
org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.(PDAcroForm.java:93)
    at 
org.apache.pdfbox.pdmodel.PDDocumentCatalog.getAcroForm(PDDocumentCatalog.java:108)
    at org.apache.tika.parser.pdf.PDFParser.handleXFAOnly(PDFParser.java:534)
    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:164)

 

See also 
[https://medium.com/graalvm/understanding-class-initialization-in-graalvm-native-image-generation-d765b7e4d6ed]


> Make PDF Parser Graal native mode ready 
> 
>
> Key: TIKA-2862
> URL: https://issues.apache.org/jira/browse/TIKA-2862
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.20
>Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
>Priority: Major
>
> PDF Parser is not Graal native mode ready yet, the following is reported when 
> it is processed as part of Quarkus native mode build:
> {noformat}
> Error: Detected a FileDescriptor in the image heap. You can manually 
> delay class initialization to image run time by using the option 
> --delay-class-initialization-to-runtime=. ...
> Detailed message:
>  Trace:     object org.apache.fontbox.ttf.BufferedRandomAccessFile
>      object org.apache.fontbox.ttf.RAFDataStream
>      object org.apache.fontbox.ttf.TrueTypeFont
>      object org.apache.pdfbox.pdmodel.font.PDType1Font
>      method 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults()
>  Call path from entry point to 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults():
>  
>      at 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults(PDAcroForm.java:106)
>      at 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.(PDAcroForm.java:93)
>      at 
> org.apache.pdfbox.pdmodel.PDDocumentCatalog.getAcroForm(PDDocumentCatalog.java:108)
>      at org.apache.tika.parser.pdf.PDFParser.handleXFAOnly(PDFParser.java:534)
>      at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:164)
> {noformat} 
> See also 
> [https://medium.com/graalvm/understanding-class-initialization-in-graalvm-native-image-generation-d765b7e4d6ed]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-2862) Make PDF Parser Graal native mode ready

2019-05-10 Thread Sergey Beryozkin (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837432#comment-16837432
 ] 

Sergey Beryozkin edited comment on TIKA-2862 at 5/10/19 4:37 PM:
-

The call path from PDType1Font to RAFDataStream:

{noformat}
at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:132)
at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:87)
at 
org.apache.pdfbox.pdmodel.font.FileSystemFontProvider.readTrueTypeFont(FileSystemFontProvider.java:731)
at 
org.apache.pdfbox.pdmodel.font.FileSystemFontProvider.getTrueTypeFont(FileSystemFontProvider.java:696)
at 
org.apache.pdfbox.pdmodel.font.FileSystemFontProvider.access$200(FileSystemFontProvider.java:55)
at 
org.apache.pdfbox.pdmodel.font.FileSystemFontProvider$FSFontInfo.getFont(FileSystemFontProvider.java:132)
at 
org.apache.pdfbox.pdmodel.font.FontMapperImpl.findFont(FontMapperImpl.java:436)
at 
org.apache.pdfbox.pdmodel.font.FontMapperImpl.findFontBoxFont(FontMapperImpl.java:382)
at 
org.apache.pdfbox.pdmodel.font.FontMapperImpl.getFontBoxFont(FontMapperImpl.java:359)
at 
org.apache.pdfbox.pdmodel.font.PDType1Font.(PDType1Font.java:146)
at 
org.apache.pdfbox.pdmodel.font.PDType1Font.(PDType1Font.java:91)
{noformat}


was (Author: sergey_beryozkin):
The call path from PDType1Font to RAFDataStream:

{noformat}
17:31:09,714 ERROR [org.apa.pdf.pdm.fon.FileSystemFontProvider] Could not load 
font file: /usr/share/fonts/liberation/LiberationSans-Regular.ttf: 
java.lang.NullPointerException
at 
org.apache.fontbox.ttf.RAFDataStream.readSignedShort(RAFDataStream.java:77)
at 
org.apache.fontbox.ttf.TTFDataStream.read32Fixed(TTFDataStream.java:50)
at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:132)
at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:87)
at 
org.apache.pdfbox.pdmodel.font.FileSystemFontProvider.readTrueTypeFont(FileSystemFontProvider.java:731)
at 
org.apache.pdfbox.pdmodel.font.FileSystemFontProvider.getTrueTypeFont(FileSystemFontProvider.java:696)
at 
org.apache.pdfbox.pdmodel.font.FileSystemFontProvider.access$200(FileSystemFontProvider.java:55)
at 
org.apache.pdfbox.pdmodel.font.FileSystemFontProvider$FSFontInfo.getFont(FileSystemFontProvider.java:132)
at 
org.apache.pdfbox.pdmodel.font.FontMapperImpl.findFont(FontMapperImpl.java:436)
at 
org.apache.pdfbox.pdmodel.font.FontMapperImpl.findFontBoxFont(FontMapperImpl.java:382)
at 
org.apache.pdfbox.pdmodel.font.FontMapperImpl.getFontBoxFont(FontMapperImpl.java:359)
at 
org.apache.pdfbox.pdmodel.font.PDType1Font.(PDType1Font.java:146)
at 
org.apache.pdfbox.pdmodel.font.PDType1Font.(PDType1Font.java:91)
{noformat}

> Make PDF Parser Graal native mode ready 
> 
>
> Key: TIKA-2862
> URL: https://issues.apache.org/jira/browse/TIKA-2862
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.20
>Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
>Priority: Major
>
> PDF Parser is not Graal native mode ready yet, the following is reported when 
> it is processed as part of Quarkus native mode build:
> Error: Detected a FileDescriptor in the image heap. You can manually 
> delay class initialization to image run time by using the option 
> --delay-class-initialization-to-runtime=. ...
> Detailed message:
> Trace:     object org.apache.fontbox.ttf.BufferedRandomAccessFile
>     object org.apache.fontbox.ttf.RAFDataStream
>     object org.apache.fontbox.ttf.TrueTypeFont
>     object org.apache.pdfbox.pdmodel.font.PDType1Font
>     method 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults()
> Call path from entry point to 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults():
>  
>     at 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults(PDAcroForm.java:106)
>     at 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.(PDAcroForm.java:93)
>     at 
> org.apache.pdfbox.pdmodel.PDDocumentCatalog.getAcroForm(PDDocumentCatalog.java:108)
>     at org.apache.tika.parser.pdf.PDFParser.handleXFAOnly(PDFParser.java:534)
>     at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:164)
>  
> See also 
> [https://medium.com/graalvm/understanding-class-initialization-in-graalvm-native-image-generation-d765b7e4d6ed]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2862) Make PDF Parser Graal native mode ready

2019-05-10 Thread Sergey Beryozkin (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837432#comment-16837432
 ] 

Sergey Beryozkin commented on TIKA-2862:


The call path from PDType1Font to RAFDataStream:

{noformat}
17:31:09,714 ERROR [org.apa.pdf.pdm.fon.FileSystemFontProvider] Could not load 
font file: /usr/share/fonts/liberation/LiberationSans-Regular.ttf: 
java.lang.NullPointerException
at 
org.apache.fontbox.ttf.RAFDataStream.readSignedShort(RAFDataStream.java:77)
at 
org.apache.fontbox.ttf.TTFDataStream.read32Fixed(TTFDataStream.java:50)
at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:132)
at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:87)
at 
org.apache.pdfbox.pdmodel.font.FileSystemFontProvider.readTrueTypeFont(FileSystemFontProvider.java:731)
at 
org.apache.pdfbox.pdmodel.font.FileSystemFontProvider.getTrueTypeFont(FileSystemFontProvider.java:696)
at 
org.apache.pdfbox.pdmodel.font.FileSystemFontProvider.access$200(FileSystemFontProvider.java:55)
at 
org.apache.pdfbox.pdmodel.font.FileSystemFontProvider$FSFontInfo.getFont(FileSystemFontProvider.java:132)
at 
org.apache.pdfbox.pdmodel.font.FontMapperImpl.findFont(FontMapperImpl.java:436)
at 
org.apache.pdfbox.pdmodel.font.FontMapperImpl.findFontBoxFont(FontMapperImpl.java:382)
at 
org.apache.pdfbox.pdmodel.font.FontMapperImpl.getFontBoxFont(FontMapperImpl.java:359)
at 
org.apache.pdfbox.pdmodel.font.PDType1Font.(PDType1Font.java:146)
at 
org.apache.pdfbox.pdmodel.font.PDType1Font.(PDType1Font.java:91)
{noformat}

> Make PDF Parser Graal native mode ready 
> 
>
> Key: TIKA-2862
> URL: https://issues.apache.org/jira/browse/TIKA-2862
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.20
>Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
>Priority: Major
>
> PDF Parser is not Graal native mode ready yet, the following is reported when 
> it is processed as part of Quarkus native mode build:
> Error: Detected a FileDescriptor in the image heap. You can manually 
> delay class initialization to image run time by using the option 
> --delay-class-initialization-to-runtime=. ...
> Detailed message:
> Trace:     object org.apache.fontbox.ttf.BufferedRandomAccessFile
>     object org.apache.fontbox.ttf.RAFDataStream
>     object org.apache.fontbox.ttf.TrueTypeFont
>     object org.apache.pdfbox.pdmodel.font.PDType1Font
>     method 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults()
> Call path from entry point to 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults():
>  
>     at 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults(PDAcroForm.java:106)
>     at 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.(PDAcroForm.java:93)
>     at 
> org.apache.pdfbox.pdmodel.PDDocumentCatalog.getAcroForm(PDDocumentCatalog.java:108)
>     at org.apache.tika.parser.pdf.PDFParser.handleXFAOnly(PDFParser.java:534)
>     at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:164)
>  
> See also 
> [https://medium.com/graalvm/understanding-class-initialization-in-graalvm-native-image-generation-d765b7e4d6ed]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (TIKA-2862) Make PDF Parser Graal native mode ready

2019-05-03 Thread Sergey Beryozkin (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Beryozkin reassigned TIKA-2862:
--

Assignee: Sergey Beryozkin

> Make PDF Parser Graal native mode ready 
> 
>
> Key: TIKA-2862
> URL: https://issues.apache.org/jira/browse/TIKA-2862
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.20
>Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
>Priority: Major
>
> PDF Parser is not Graal native mode ready yet, the following is reported when 
> it is processed as part of Quarkus native mode build:
> Error: Detected a FileDescriptor in the image heap. You can manually 
> delay class initialization to image run time by using the option 
> --delay-class-initialization-to-runtime=. ...
> Detailed message:
> Trace:     object org.apache.fontbox.ttf.BufferedRandomAccessFile
>     object org.apache.fontbox.ttf.RAFDataStream
>     object org.apache.fontbox.ttf.TrueTypeFont
>     object org.apache.pdfbox.pdmodel.font.PDType1Font
>     method 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults()
> Call path from entry point to 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults():
>  
>     at 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults(PDAcroForm.java:106)
>     at 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.(PDAcroForm.java:93)
>     at 
> org.apache.pdfbox.pdmodel.PDDocumentCatalog.getAcroForm(PDDocumentCatalog.java:108)
>     at org.apache.tika.parser.pdf.PDFParser.handleXFAOnly(PDFParser.java:534)
>     at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:164)
>  
> See also 
> [https://medium.com/graalvm/understanding-class-initialization-in-graalvm-native-image-generation-d765b7e4d6ed]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TIKA-2862) Make PDF Parser Graal native mode ready

2019-05-02 Thread Sergey Beryozkin (JIRA)
Sergey Beryozkin created TIKA-2862:
--

 Summary: Make PDF Parser Graal native mode ready 
 Key: TIKA-2862
 URL: https://issues.apache.org/jira/browse/TIKA-2862
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.20
Reporter: Sergey Beryozkin


PDF Parser is not Graal native mode ready yet, the following is reported when 
it is processed as part of Quarkus native mode build:

Error: Detected a FileDescriptor in the image heap. You can manually delay 
class initialization to image run time by using the option 
--delay-class-initialization-to-runtime=. ...
Detailed message:
Trace:     object org.apache.fontbox.ttf.BufferedRandomAccessFile
    object org.apache.fontbox.ttf.RAFDataStream
    object org.apache.fontbox.ttf.TrueTypeFont
    object org.apache.pdfbox.pdmodel.font.PDType1Font
    method 
org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults()
Call path from entry point to 
org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults(): 
    at 
org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults(PDAcroForm.java:106)
    at 
org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.(PDAcroForm.java:93)
    at 
org.apache.pdfbox.pdmodel.PDDocumentCatalog.getAcroForm(PDDocumentCatalog.java:108)
    at org.apache.tika.parser.pdf.PDFParser.handleXFAOnly(PDFParser.java:534)
    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:164)

 

See also 
[https://medium.com/graalvm/understanding-class-initialization-in-graalvm-native-image-generation-d765b7e4d6ed]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (TIKA-2476) Metadata.toString always returns a trailing space

2017-10-11 Thread Sergey Beryozkin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Beryozkin resolved TIKA-2476.

Resolution: Fixed
  Assignee: Sergey Beryozkin

> Metadata.toString always returns a trailing space
> -
>
> Key: TIKA-2476
> URL: https://issues.apache.org/jira/browse/TIKA-2476
> Project: Tika
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 1.16
>Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
>Priority: Trivial
> Fix For: 1.17
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (TIKA-2476) Metadata.toString always returns a trailing space

2017-10-11 Thread Sergey Beryozkin (JIRA)
Sergey Beryozkin created TIKA-2476:
--

 Summary: Metadata.toString always returns a trailing space
 Key: TIKA-2476
 URL: https://issues.apache.org/jira/browse/TIKA-2476
 Project: Tika
  Issue Type: Improvement
  Components: core
Affects Versions: 1.16
Reporter: Sergey Beryozkin
Priority: Trivial
 Fix For: 1.17






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (TIKA-2472) Implement Metadata.hashCode

2017-10-11 Thread Sergey Beryozkin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Beryozkin resolved TIKA-2472.

Resolution: Fixed

> Implement Metadata.hashCode
> ---
>
> Key: TIKA-2472
> URL: https://issues.apache.org/jira/browse/TIKA-2472
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 1.16
>Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
>Priority: Trivial
> Fix For: 1.17
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2472) Implement Metadata.hashCode

2017-10-10 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16198709#comment-16198709
 ] 

Sergey Beryozkin commented on TIKA-2472:


This is fixed now...

> Implement Metadata.hashCode
> ---
>
> Key: TIKA-2472
> URL: https://issues.apache.org/jira/browse/TIKA-2472
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 1.16
>Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
>Priority: Trivial
> Fix For: 1.17
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2472) Implement Metadata.hashCode

2017-10-07 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16195792#comment-16195792
 ] 

Sergey Beryozkin commented on TIKA-2472:


Ken, thanks for the tip, makes sense to follow this path at the Metadata level 
as well

> Implement Metadata.hashCode
> ---
>
> Key: TIKA-2472
> URL: https://issues.apache.org/jira/browse/TIKA-2472
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 1.16
>Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
>Priority: Trivial
> Fix For: 1.17
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2472) Implement Metadata.hashCode

2017-10-06 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16195303#comment-16195303
 ] 

Sergey Beryozkin commented on TIKA-2472:


I've got a bit of shock with this code:
{code:java}
@Test
public void testIt() {
Map map1 = new HashMap();
map1.put("A", new String[] {"a"});
Map map2 = new HashMap();
map2.put("A", new String[] {"a"});

System.out.println(map1.equals(map2));
System.out.println(map1.hashCode() == map2.hashCode());
}
{code}
Seeing 'false' printed in both cases which is obvious really given that 
'identity' situation for the arrays.
Eugene, you are right, thanks for being on top of these changes, you'll make me 
a Java champion soon :-)

Guys, should we update Metadata to use List of Strings ? (though it is a sep 
issue)

> Implement Metadata.hashCode
> ---
>
> Key: TIKA-2472
> URL: https://issues.apache.org/jira/browse/TIKA-2472
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 1.16
>Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
>Priority: Trivial
> Fix For: 1.17
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Reopened] (TIKA-2472) Implement Metadata.hashCode

2017-10-06 Thread Sergey Beryozkin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Beryozkin reopened TIKA-2472:


With thanks to Eugene...

> Implement Metadata.hashCode
> ---
>
> Key: TIKA-2472
> URL: https://issues.apache.org/jira/browse/TIKA-2472
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 1.16
>Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
>Priority: Trivial
> Fix For: 1.17
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2472) Implement Metadata.hashCode

2017-10-06 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16195275#comment-16195275
 ] 

Sergey Beryozkin commented on TIKA-2472:


I'd not qualify it as incorrect but as sub-optimal. And I know how the relevant 
Map hashCode is implemented - I copied that to ParseResult as a temp 
substitution (to be honest it does not really matter how hashCode or even 
equals are implemented if ParseResult will keep a file location which is the 
real key). That said I've no problems with making this code done better

> Implement Metadata.hashCode
> ---
>
> Key: TIKA-2472
> URL: https://issues.apache.org/jira/browse/TIKA-2472
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 1.16
>Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
>Priority: Trivial
> Fix For: 1.17
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (TIKA-2472) Implement Metadata.hashCode

2017-10-02 Thread Sergey Beryozkin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Beryozkin resolved TIKA-2472.

Resolution: Fixed

> Implement Metadata.hashCode
> ---
>
> Key: TIKA-2472
> URL: https://issues.apache.org/jira/browse/TIKA-2472
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 1.16
>Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
>Priority: Trivial
> Fix For: 1.17
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (TIKA-2472) Implement Metadata.hashCode

2017-10-02 Thread Sergey Beryozkin (JIRA)
Sergey Beryozkin created TIKA-2472:
--

 Summary: Implement Metadata.hashCode
 Key: TIKA-2472
 URL: https://issues.apache.org/jira/browse/TIKA-2472
 Project: Tika
  Issue Type: Improvement
Affects Versions: 1.16
Reporter: Sergey Beryozkin
Assignee: Sergey Beryozkin
Priority: Trivial
 Fix For: 1.17






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2384) Double close of InputStream in accept text/plain in tika-server

2017-06-02 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034923#comment-16034923
 ] 

Sergey Beryozkin commented on TIKA-2384:


Not sure - may be they can be closed in TikaResource's StreamingOutput 
implementation ?

> Double close of InputStream in accept text/plain in tika-server
> ---
>
> Key: TIKA-2384
> URL: https://issues.apache.org/jira/browse/TIKA-2384
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.15
>Reporter: Tim Allison
>Priority: Blocker
>
> As reported by Haris Osmanagic on the user list, TikaResource closes the 
> InputStream twice on requests for text/plain.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (TIKA-2384) Double close of InputStream in accept text/plain in tika-server

2017-06-02 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034859#comment-16034859
 ] 

Sergey Beryozkin edited comment on TIKA-2384 at 6/2/17 3:24 PM:


Hi Tim, CXF will only auto-close InputStream if its task is to copy InputStream 
(or it can avoid doing it if configured as such), which can only happen if 
InputStream is returned as a service method response (directly or as JAX-RS 
Response entity) 


was (Author: sergey_beryozkin):
Hi Tim, CXF will only auto-close InputStream if its task is to copy InputStream 
(or it can avoid doing it if configured as such), which can only happen if 
InputStream returned as a service method response (directly or as JAX-RS 
Response entity) 

> Double close of InputStream in accept text/plain in tika-server
> ---
>
> Key: TIKA-2384
> URL: https://issues.apache.org/jira/browse/TIKA-2384
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.15
>Reporter: Tim Allison
>Priority: Blocker
>
> As reported by Haris Osmanagic on the user list, TikaResource closes the 
> InputStream twice on requests for text/plain.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (TIKA-2384) Double close of InputStream in accept text/plain in tika-server

2017-06-02 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034859#comment-16034859
 ] 

Sergey Beryozkin commented on TIKA-2384:


Hi Tim, CXF will only auto-close InputStream if its task is to copy InputStream 
(or it can avoid doing it if configured as such), which can only happen if 
InputStream returned as a service method response (directly or as JAX-RS 
Response entity) 

> Double close of InputStream in accept text/plain in tika-server
> ---
>
> Key: TIKA-2384
> URL: https://issues.apache.org/jira/browse/TIKA-2384
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.15
>Reporter: Tim Allison
>Priority: Blocker
>
> As reported by Haris Osmanagic on the user list, TikaResource closes the 
> InputStream twice on requests for text/plain.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (TIKA-2292) Update CXF version to 3.0.12

2017-03-12 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15906697#comment-15906697
 ] 

Sergey Beryozkin commented on TIKA-2292:


Hi - thanks for this fix :-)

> Update CXF version to 3.0.12
> 
>
> Key: TIKA-2292
> URL: https://issues.apache.org/jira/browse/TIKA-2292
> Project: Tika
>  Issue Type: Task
>  Components: server
>Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
>Priority: Minor
> Fix For: 1.15
>
>
> This is the last version in the CXF 3.0.x line which supports Java 6



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (TIKA-2292) Update CXF version to 3.0.12

2017-03-07 Thread Sergey Beryozkin (JIRA)
Sergey Beryozkin created TIKA-2292:
--

 Summary: Update CXF version to 3.0.12
 Key: TIKA-2292
 URL: https://issues.apache.org/jira/browse/TIKA-2292
 Project: Tika
  Issue Type: Task
  Components: server
Reporter: Sergey Beryozkin
Assignee: Sergey Beryozkin
Priority: Minor
 Fix For: 1.15


This is the last version in the CXF 3.0.x line which supports Java 6



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (TIKA-2017) Tika Server Cannot handle large files

2016-06-24 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15348132#comment-15348132
 ] 

Sergey Beryozkin commented on TIKA-2017:


Might also be worth trying multiparts, I've updated the wiki to note that 
Metadata, RecursiveMetadata and TikaResource support multipart requests:
https://wiki.apache.org/tika/TikaJAXRS#preview

By the way I recall updating a PDF parser awhile back for it to parse the 
metadata only without touching the content, ContentHandler needs to be set to 
null. See testPdfParsingMetadataOnly in PDFParserTest. Might make sense to 
update other parsers too, though in this case using multiparts alone might 
help. 



 

> Tika Server Cannot handle large files
> -
>
> Key: TIKA-2017
> URL: https://issues.apache.org/jira/browse/TIKA-2017
> Project: Tika
>  Issue Type: Bug
>Reporter: Harshavardhan Manjunatha
> Fix For: 1.14
>
>
> Tika-Python uses Tika REST Server to parse both content & metadata. In this 
> case, the CSV file was 600 MB in size. Tika REST Server runs out of Heap 
> Space since it tries to parse Content also. There should an option to make a 
> REST API call to Tika Server just to parse & return metadata.
> {code}
> Jun 22, 2016 6:38:40 PM org.slf4j.impl.JCLLoggerAdapter warn
> WARNING: /rmeta/text
> java.lang.RuntimeException: org.apache.cxf.interceptor.Fault: Java heap space
> at 
> org.apache.cxf.interceptor.AbstractFaultChainInitiatorObserver.onMessage(AbstractFaultChainInitiatorObserver.java:116)
> at 
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:371)
> at 
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
> at 
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251)
> at 
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
> at 
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
> at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
> at org.eclipse.jetty.server.Server.handle(Server.java:370)
> at 
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
> at 
> org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:982)
> at 
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1043)
> at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865)
> at 
> org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
> at 
> org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
> at 
> org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696)
> at 
> org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53)
> at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
> at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.cxf.interceptor.Fault: Java heap space
> at 
> org.apache.cxf.service.invoker.AbstractInvoker.createFault(AbstractInvoker.java:163)
> at 
> org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:129)
> at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:200)
> at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:99)
> at 
> org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
> at 
> org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
> at 
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
> ... 21 more
> Caused by: java.lang.OutOfMemoryError: Java heap space
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1871) Update Tika JAXRS wiki page with the info about multipart/form-data

2016-03-21 Thread Sergey Beryozkin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Beryozkin resolved TIKA-1871.

Resolution: Fixed

> Update Tika JAXRS wiki page with the info about multipart/form-data
> ---
>
> Key: TIKA-1871
> URL: https://issues.apache.org/jira/browse/TIKA-1871
> Project: Tika
>  Issue Type: Task
>  Components: documentation
>Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
> Fix For: 1.13
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1871) Update Tika JAXRS wiki page with the info about multipart/form-data

2016-03-21 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15204947#comment-15204947
 ] 

Sergey Beryozkin commented on TIKA-1871:


Minor update done to the introduction at
https://wiki.apache.org/tika/TikaJAXRS#Services

and an extra sub-section added at

https://wiki.apache.org/tika/TikaJAXRS#Tika_Resource:

https://wiki.apache.org/tika/TikaJAXRS#Multipart_Support

> Update Tika JAXRS wiki page with the info about multipart/form-data
> ---
>
> Key: TIKA-1871
> URL: https://issues.apache.org/jira/browse/TIKA-1871
> Project: Tika
>  Issue Type: Task
>  Components: documentation
>Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
> Fix For: 1.13
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1871) Update Tika JAXRS wiki page with the info about multipart/form-data

2016-03-19 Thread Sergey Beryozkin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Beryozkin updated TIKA-1871:
---
 Priority: Major  (was: Minor)
Fix Version/s: 1.13

> Update Tika JAXRS wiki page with the info about multipart/form-data
> ---
>
> Key: TIKA-1871
> URL: https://issues.apache.org/jira/browse/TIKA-1871
> Project: Tika
>  Issue Type: Task
>  Components: documentation
>Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
> Fix For: 1.13
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1871) Update Tika JAXRS wiki page with the info about multipart/form-data

2016-02-24 Thread Sergey Beryozkin (JIRA)
Sergey Beryozkin created TIKA-1871:
--

 Summary: Update Tika JAXRS wiki page with the info about 
multipart/form-data
 Key: TIKA-1871
 URL: https://issues.apache.org/jira/browse/TIKA-1871
 Project: Tika
  Issue Type: Task
  Components: documentation
Reporter: Sergey Beryozkin
Assignee: Sergey Beryozkin
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1712) GROBID parser fails in tika-app

2015-08-17 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14699760#comment-14699760
 ] 

Sergey Beryozkin commented on TIKA-1712:


Great, thanks

> GROBID parser fails in tika-app
> ---
>
> Key: TIKA-1712
> URL: https://issues.apache.org/jira/browse/TIKA-1712
> Project: Tika
>  Issue Type: Bug
>  Components: cli, server
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.11
>
>
> Hey Sergey do you have any idea why CXF's 3.0.3 rt-client would work fine in 
> tika-server, but fail in tika-app? I'm seeing that with the GROBID parser. 
> See:
> https://issues.apache.org/jira/browse/CXF-6545
> Try calling the GROBID parser from Tika app:
> java -classpath 
> $HOME/git/grobidparser-resources/:target/tika-app-1.11-SNAPSHOT.jar 
> org.apache.tika.cli.TikaCLI 
> --config=$HOME/git/grobidparser-resources/tika-config.xml -J 
> $HOME/git/grobid/papers/ICSE06.pdf
> After following this guide:
> https://wiki.apache.org/tika/GrobidJournalParser
> Works fine in Tika-Server - dies in Tika-app with:
> {noformat}
> java.lang.NullPointerException
>   at 
> org.apache.cxf.jaxrs.client.AbstractClient.setupOutInterceptorChain(AbstractClient.java:849)
>   at 
> org.apache.cxf.jaxrs.client.AbstractClient.createMessage(AbstractClient.java:923)
>   at 
> org.apache.cxf.jaxrs.client.WebClient.finalizeMessage(WebClient.java:1125)
>   at 
> org.apache.cxf.jaxrs.client.WebClient.doChainedInvocation(WebClient.java:1098)
>   at org.apache.cxf.jaxrs.client.WebClient.doInvoke(WebClient.java:894)
>   at org.apache.cxf.jaxrs.client.WebClient.doInvoke(WebClient.java:865)
>   at org.apache.cxf.jaxrs.client.WebClient.invoke(WebClient.java:331)
>   at org.apache.cxf.jaxrs.client.WebClient.post(WebClient.java:340)
>   at 
> org.apache.tika.parser.journal.GrobidRESTParser.parse(GrobidRESTParser.java:82)
>   at 
> org.apache.tika.parser.journal.JournalParser.parse(JournalParser.java:67)
>   at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at 
> org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:158)
>   at org.apache.tika.cli.TikaCLI.handleRecursiveJson(TikaCLI.java:504)
>   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:484)
>   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:139)
> java.lang.NullPointerException
>   at 
> org.apache.tika.parser.journal.GrobidRESTParser.parse(GrobidRESTParser.java:89)
>   at 
> org.apache.tika.parser.journal.JournalParser.parse(JournalParser.java:67)
>   at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at 
> org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:158)
>   at org.apache.tika.cli.TikaCLI.handleRecursiveJson(TikaCLI.java:504)
>   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:484)
>   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:139)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1712) GROBID parser fails in tika-app

2015-08-17 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14699687#comment-14699687
 ] 

Sergey Beryozkin commented on TIKA-1712:


Hi Chris, 
Dan Kulp reminded me that back in CXF 2.7.x a maven-shade-plugin was used:

https://fisheye6.atlassian.com/browse/cxf/osgi/bundle/all/pom.xml?r=12acd46e3dbe98fa1321374b09174d5876271f08#to448

and it can help with collapsing files into a single one. 
I guess tika-app assembly needs to be updated accordingly, I've never tried 
working with the shade plugin myself

thanks 

> GROBID parser fails in tika-app
> ---
>
> Key: TIKA-1712
> URL: https://issues.apache.org/jira/browse/TIKA-1712
> Project: Tika
>  Issue Type: Bug
>  Components: cli, server
>Reporter: Chris A. Mattmann
>Assignee: Sergey Beryozkin
> Fix For: 1.11
>
>
> Hey Sergey do you have any idea why CXF's 3.0.3 rt-client would work fine in 
> tika-server, but fail in tika-app? I'm seeing that with the GROBID parser. 
> See:
> https://issues.apache.org/jira/browse/CXF-6545
> Try calling the GROBID parser from Tika app:
> java -classpath 
> $HOME/git/grobidparser-resources/:target/tika-app-1.11-SNAPSHOT.jar 
> org.apache.tika.cli.TikaCLI 
> --config=$HOME/git/grobidparser-resources/tika-config.xml -J 
> $HOME/git/grobid/papers/ICSE06.pdf
> After following this guide:
> https://wiki.apache.org/tika/GrobidJournalParser
> Works fine in Tika-Server - dies in Tika-app with:
> {noformat}
> java.lang.NullPointerException
>   at 
> org.apache.cxf.jaxrs.client.AbstractClient.setupOutInterceptorChain(AbstractClient.java:849)
>   at 
> org.apache.cxf.jaxrs.client.AbstractClient.createMessage(AbstractClient.java:923)
>   at 
> org.apache.cxf.jaxrs.client.WebClient.finalizeMessage(WebClient.java:1125)
>   at 
> org.apache.cxf.jaxrs.client.WebClient.doChainedInvocation(WebClient.java:1098)
>   at org.apache.cxf.jaxrs.client.WebClient.doInvoke(WebClient.java:894)
>   at org.apache.cxf.jaxrs.client.WebClient.doInvoke(WebClient.java:865)
>   at org.apache.cxf.jaxrs.client.WebClient.invoke(WebClient.java:331)
>   at org.apache.cxf.jaxrs.client.WebClient.post(WebClient.java:340)
>   at 
> org.apache.tika.parser.journal.GrobidRESTParser.parse(GrobidRESTParser.java:82)
>   at 
> org.apache.tika.parser.journal.JournalParser.parse(JournalParser.java:67)
>   at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at 
> org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:158)
>   at org.apache.tika.cli.TikaCLI.handleRecursiveJson(TikaCLI.java:504)
>   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:484)
>   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:139)
> java.lang.NullPointerException
>   at 
> org.apache.tika.parser.journal.GrobidRESTParser.parse(GrobidRESTParser.java:89)
>   at 
> org.apache.tika.parser.journal.JournalParser.parse(JournalParser.java:67)
>   at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at 
> org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:158)
>   at org.apache.tika.cli.TikaCLI.handleRecursiveJson(TikaCLI.java:504)
>   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:484)
>   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:139)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1712) GROBID parser fails in tika-app

2015-08-17 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14699646#comment-14699646
 ] 

Sergey Beryozkin edited comment on TIKA-1712 at 8/17/15 3:09 PM:
-

Add the following to bus-extensions.txt in your local tika-app.jar to validate 
it will fix it:

org.apache.cxf.bus.managers.PhaseManagerImpl:org.apache.cxf.phase.PhaseManager:true
org.apache.cxf.bus.managers.WorkQueueManagerImpl:org.apache.cxf.workqueue.WorkQueueManager:true
org.apache.cxf.bus.managers.CXFBusLifeCycleManager:org.apache.cxf.buslifecycle.BusLifeCycleManager:true

org.apache.cxf.bus.managers.ServerRegistryImpl:org.apache.cxf.endpoint.ServerRegistry:true
org.apache.cxf.bus.managers.EndpointResolverRegistryImpl:org.apache.cxf.endpoint.EndpointResolverRegistry:true
org.apache.cxf.bus.managers.HeaderManagerImpl:org.apache.cxf.headers.HeaderManager:true
org.apache.cxf.service.factory.FactoryBeanListenerManager::true
org.apache.cxf.bus.managers.ServerLifeCycleManagerImpl:org.apache.cxf.endpoint.ServerLifeCycleManager:true
org.apache.cxf.bus.managers.ClientLifeCycleManagerImpl:org.apache.cxf.endpoint.ClientLifeCycleManager:true
org.apache.cxf.bus.resource.ResourceManagerImpl:org.apache.cxf.resource.ResourceManager:true
org.apache.cxf.catalog.OASISCatalogManager:org.apache.cxf.catalog.OASISCatalogManager:true

The 1st line is probably enough...



was (Author: sergey_beryozkin):
Add the following to bus-extensions.txt in your local tika-app.jar to validate 
it will fix it:

org.apache.cxf.bus.managers.PhaseManagerImpl:org.apache.cxf.phase.PhaseManager:true
org.apache.cxf.bus.managers.WorkQueueManagerImpl:org.apache.cxf.workqueue.WorkQueueManager:true
org.apache.cxf.bus.managers.CXFBusLifeCycleManager:org.apache.cxf.buslifecycle.BusLifeCycleManager:true

org.apache.cxf.bus.managers.ServerRegistryImpl:org.apache.cxf.endpoint.ServerRegistry:true
org.apache.cxf.bus.managers.EndpointResolverRegistryImpl:org.apache.cxf.endpoint.EndpointResolverRegistry:true
org.apache.cxf.bus.managers.HeaderManagerImpl:org.apache.cxf.headers.HeaderManager:true
org.apache.cxf.service.factory.FactoryBeanListenerManager::true
org.apache.cxf.bus.managers.ServerLifeCycleManagerImpl:org.apache.cxf.endpoint.ServerLifeCycleManager:true
org.apache.cxf.bus.managers.ClientLifeCycleManagerImpl:org.apache.cxf.endpoint.ClientLifeCycleManager:true
org.apache.cxf.bus.resource.ResourceManagerImpl:org.apache.cxf.resource.ResourceManager:true
org.apache.cxf.catalog.OASISCatalogManager:org.apache.cxf.catalog.OASISCatalogManager:true



> GROBID parser fails in tika-app
> ---
>
> Key: TIKA-1712
> URL: https://issues.apache.org/jira/browse/TIKA-1712
> Project: Tika
>  Issue Type: Bug
>  Components: cli, server
>Reporter: Chris A. Mattmann
>Assignee: Sergey Beryozkin
> Fix For: 1.11
>
>
> Hey Sergey do you have any idea why CXF's 3.0.3 rt-client would work fine in 
> tika-server, but fail in tika-app? I'm seeing that with the GROBID parser. 
> See:
> https://issues.apache.org/jira/browse/CXF-6545
> Try calling the GROBID parser from Tika app:
> java -classpath 
> $HOME/git/grobidparser-resources/:target/tika-app-1.11-SNAPSHOT.jar 
> org.apache.tika.cli.TikaCLI 
> --config=$HOME/git/grobidparser-resources/tika-config.xml -J 
> $HOME/git/grobid/papers/ICSE06.pdf
> After following this guide:
> https://wiki.apache.org/tika/GrobidJournalParser
> Works fine in Tika-Server - dies in Tika-app with:
> {noformat}
> java.lang.NullPointerException
>   at 
> org.apache.cxf.jaxrs.client.AbstractClient.setupOutInterceptorChain(AbstractClient.java:849)
>   at 
> org.apache.cxf.jaxrs.client.AbstractClient.createMessage(AbstractClient.java:923)
>   at 
> org.apache.cxf.jaxrs.client.WebClient.finalizeMessage(WebClient.java:1125)
>   at 
> org.apache.cxf.jaxrs.client.WebClient.doChainedInvocation(WebClient.java:1098)
>   at org.apache.cxf.jaxrs.client.WebClient.doInvoke(WebClient.java:894)
>   at org.apache.cxf.jaxrs.client.WebClient.doInvoke(WebClient.java:865)
>   at org.apache.cxf.jaxrs.client.WebClient.invoke(WebClient.java:331)
>   at org.apache.cxf.jaxrs.client.WebClient.post(WebClient.java:340)
>   at 
> org.apache.tika.parser.journal.GrobidRESTParser.parse(GrobidRESTParser.java:82)
>   at 
> org.apache.tika.parser.journal.JournalParser.parse(JournalParser.java:67)
>   at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at 
> org.apache.tika.parser.RecursiveParserWrapper.parse(Recurs

[jira] [Commented] (TIKA-1712) GROBID parser fails in tika-app

2015-08-17 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14699646#comment-14699646
 ] 

Sergey Beryozkin commented on TIKA-1712:


Add the following to bus-extensions.txt in your local tika-app.jar to validate 
it will fix it:

org.apache.cxf.bus.managers.PhaseManagerImpl:org.apache.cxf.phase.PhaseManager:true
org.apache.cxf.bus.managers.WorkQueueManagerImpl:org.apache.cxf.workqueue.WorkQueueManager:true
org.apache.cxf.bus.managers.CXFBusLifeCycleManager:org.apache.cxf.buslifecycle.BusLifeCycleManager:true

org.apache.cxf.bus.managers.ServerRegistryImpl:org.apache.cxf.endpoint.ServerRegistry:true
org.apache.cxf.bus.managers.EndpointResolverRegistryImpl:org.apache.cxf.endpoint.EndpointResolverRegistry:true
org.apache.cxf.bus.managers.HeaderManagerImpl:org.apache.cxf.headers.HeaderManager:true
org.apache.cxf.service.factory.FactoryBeanListenerManager::true
org.apache.cxf.bus.managers.ServerLifeCycleManagerImpl:org.apache.cxf.endpoint.ServerLifeCycleManager:true
org.apache.cxf.bus.managers.ClientLifeCycleManagerImpl:org.apache.cxf.endpoint.ClientLifeCycleManager:true
org.apache.cxf.bus.resource.ResourceManagerImpl:org.apache.cxf.resource.ResourceManager:true
org.apache.cxf.catalog.OASISCatalogManager:org.apache.cxf.catalog.OASISCatalogManager:true



> GROBID parser fails in tika-app
> ---
>
> Key: TIKA-1712
> URL: https://issues.apache.org/jira/browse/TIKA-1712
> Project: Tika
>  Issue Type: Bug
>  Components: cli, server
>Reporter: Chris A. Mattmann
>Assignee: Sergey Beryozkin
> Fix For: 1.11
>
>
> Hey Sergey do you have any idea why CXF's 3.0.3 rt-client would work fine in 
> tika-server, but fail in tika-app? I'm seeing that with the GROBID parser. 
> See:
> https://issues.apache.org/jira/browse/CXF-6545
> Try calling the GROBID parser from Tika app:
> java -classpath 
> $HOME/git/grobidparser-resources/:target/tika-app-1.11-SNAPSHOT.jar 
> org.apache.tika.cli.TikaCLI 
> --config=$HOME/git/grobidparser-resources/tika-config.xml -J 
> $HOME/git/grobid/papers/ICSE06.pdf
> After following this guide:
> https://wiki.apache.org/tika/GrobidJournalParser
> Works fine in Tika-Server - dies in Tika-app with:
> {noformat}
> java.lang.NullPointerException
>   at 
> org.apache.cxf.jaxrs.client.AbstractClient.setupOutInterceptorChain(AbstractClient.java:849)
>   at 
> org.apache.cxf.jaxrs.client.AbstractClient.createMessage(AbstractClient.java:923)
>   at 
> org.apache.cxf.jaxrs.client.WebClient.finalizeMessage(WebClient.java:1125)
>   at 
> org.apache.cxf.jaxrs.client.WebClient.doChainedInvocation(WebClient.java:1098)
>   at org.apache.cxf.jaxrs.client.WebClient.doInvoke(WebClient.java:894)
>   at org.apache.cxf.jaxrs.client.WebClient.doInvoke(WebClient.java:865)
>   at org.apache.cxf.jaxrs.client.WebClient.invoke(WebClient.java:331)
>   at org.apache.cxf.jaxrs.client.WebClient.post(WebClient.java:340)
>   at 
> org.apache.tika.parser.journal.GrobidRESTParser.parse(GrobidRESTParser.java:82)
>   at 
> org.apache.tika.parser.journal.JournalParser.parse(JournalParser.java:67)
>   at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at 
> org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:158)
>   at org.apache.tika.cli.TikaCLI.handleRecursiveJson(TikaCLI.java:504)
>   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:484)
>   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:139)
> java.lang.NullPointerException
>   at 
> org.apache.tika.parser.journal.GrobidRESTParser.parse(GrobidRESTParser.java:89)
>   at 
> org.apache.tika.parser.journal.JournalParser.parse(JournalParser.java:67)
>   at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at 
> org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:158)
>   at org.apache.tika.cli.TikaCLI.handleRecursiveJson(TikaCLI.java:504)
>   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:484)
>   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:139)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1712) GROBID parser fails in tika-app

2015-08-17 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14699635#comment-14699635
 ] 

Sergey Beryozkin commented on TIKA-1712:


Hi Chris

The problem is META-INF/cxf/bus-extensions.txt in tika-app.jar is incomplete, 
it only contains what is available inside cxf-rt-transports-http, but has no 
content available in cxf-core/bus-extensions.txt, after updating the file in 
tika-app manually I got the GROBID example working.

The solution is to have the content of all META-INF/cxf/bus-extensions.txt 
files available in tika-app, in a single file.
Not sure how this can be realized though
Sergey
 

> GROBID parser fails in tika-app
> ---
>
> Key: TIKA-1712
> URL: https://issues.apache.org/jira/browse/TIKA-1712
> Project: Tika
>  Issue Type: Bug
>  Components: cli, server
>Reporter: Chris A. Mattmann
>Assignee: Sergey Beryozkin
> Fix For: 1.11
>
>
> Hey Sergey do you have any idea why CXF's 3.0.3 rt-client would work fine in 
> tika-server, but fail in tika-app? I'm seeing that with the GROBID parser. 
> See:
> https://issues.apache.org/jira/browse/CXF-6545
> Try calling the GROBID parser from Tika app:
> java -classpath 
> $HOME/git/grobidparser-resources/:target/tika-app-1.11-SNAPSHOT.jar 
> org.apache.tika.cli.TikaCLI 
> --config=$HOME/git/grobidparser-resources/tika-config.xml -J 
> $HOME/git/grobid/papers/ICSE06.pdf
> After following this guide:
> https://wiki.apache.org/tika/GrobidJournalParser
> Works fine in Tika-Server - dies in Tika-app with:
> {noformat}
> java.lang.NullPointerException
>   at 
> org.apache.cxf.jaxrs.client.AbstractClient.setupOutInterceptorChain(AbstractClient.java:849)
>   at 
> org.apache.cxf.jaxrs.client.AbstractClient.createMessage(AbstractClient.java:923)
>   at 
> org.apache.cxf.jaxrs.client.WebClient.finalizeMessage(WebClient.java:1125)
>   at 
> org.apache.cxf.jaxrs.client.WebClient.doChainedInvocation(WebClient.java:1098)
>   at org.apache.cxf.jaxrs.client.WebClient.doInvoke(WebClient.java:894)
>   at org.apache.cxf.jaxrs.client.WebClient.doInvoke(WebClient.java:865)
>   at org.apache.cxf.jaxrs.client.WebClient.invoke(WebClient.java:331)
>   at org.apache.cxf.jaxrs.client.WebClient.post(WebClient.java:340)
>   at 
> org.apache.tika.parser.journal.GrobidRESTParser.parse(GrobidRESTParser.java:82)
>   at 
> org.apache.tika.parser.journal.JournalParser.parse(JournalParser.java:67)
>   at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at 
> org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:158)
>   at org.apache.tika.cli.TikaCLI.handleRecursiveJson(TikaCLI.java:504)
>   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:484)
>   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:139)
> java.lang.NullPointerException
>   at 
> org.apache.tika.parser.journal.GrobidRESTParser.parse(GrobidRESTParser.java:89)
>   at 
> org.apache.tika.parser.journal.JournalParser.parse(JournalParser.java:67)
>   at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at 
> org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:158)
>   at org.apache.tika.cli.TikaCLI.handleRecursiveJson(TikaCLI.java:504)
>   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:484)
>   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:139)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1712) GROBID parser fails in tika-app

2015-08-17 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14699377#comment-14699377
 ] 

Sergey Beryozkin commented on TIKA-1712:


Never mind, steps are described :-)

> GROBID parser fails in tika-app
> ---
>
> Key: TIKA-1712
> URL: https://issues.apache.org/jira/browse/TIKA-1712
> Project: Tika
>  Issue Type: Bug
>  Components: cli, server
>Reporter: Chris A. Mattmann
>Assignee: Sergey Beryozkin
> Fix For: 1.11
>
>
> Hey Sergey do you have any idea why CXF's 3.0.3 rt-client would work fine in 
> tika-server, but fail in tika-app? I'm seeing that with the GROBID parser. 
> See:
> https://issues.apache.org/jira/browse/CXF-6545
> Try calling the GROBID parser from Tika app:
> java -classpath 
> $HOME/git/grobidparser-resources/:target/tika-app-1.11-SNAPSHOT.jar 
> org.apache.tika.cli.TikaCLI 
> --config=$HOME/git/grobidparser-resources/tika-config.xml -J 
> $HOME/git/grobid/papers/ICSE06.pdf
> After following this guide:
> https://wiki.apache.org/tika/GrobidJournalParser
> Works fine in Tika-Server - dies in Tika-app with:
> {noformat}
> java.lang.NullPointerException
>   at 
> org.apache.cxf.jaxrs.client.AbstractClient.setupOutInterceptorChain(AbstractClient.java:849)
>   at 
> org.apache.cxf.jaxrs.client.AbstractClient.createMessage(AbstractClient.java:923)
>   at 
> org.apache.cxf.jaxrs.client.WebClient.finalizeMessage(WebClient.java:1125)
>   at 
> org.apache.cxf.jaxrs.client.WebClient.doChainedInvocation(WebClient.java:1098)
>   at org.apache.cxf.jaxrs.client.WebClient.doInvoke(WebClient.java:894)
>   at org.apache.cxf.jaxrs.client.WebClient.doInvoke(WebClient.java:865)
>   at org.apache.cxf.jaxrs.client.WebClient.invoke(WebClient.java:331)
>   at org.apache.cxf.jaxrs.client.WebClient.post(WebClient.java:340)
>   at 
> org.apache.tika.parser.journal.GrobidRESTParser.parse(GrobidRESTParser.java:82)
>   at 
> org.apache.tika.parser.journal.JournalParser.parse(JournalParser.java:67)
>   at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at 
> org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:158)
>   at org.apache.tika.cli.TikaCLI.handleRecursiveJson(TikaCLI.java:504)
>   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:484)
>   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:139)
> java.lang.NullPointerException
>   at 
> org.apache.tika.parser.journal.GrobidRESTParser.parse(GrobidRESTParser.java:89)
>   at 
> org.apache.tika.parser.journal.JournalParser.parse(JournalParser.java:67)
>   at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at 
> org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:158)
>   at org.apache.tika.cli.TikaCLI.handleRecursiveJson(TikaCLI.java:504)
>   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:484)
>   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:139)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1712) GROBID parser fails in tika-app

2015-08-17 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14699362#comment-14699362
 ] 

Sergey Beryozkin commented on TIKA-1712:


Hi Chris, all, sorry I did miss this issue completely, I might've thought it 
was related to the actual parsing only.

Can you please give me a favour and attach a sample GROBID resource plus 
whatever is needed in a grobidparser-resources folder to have the server 
starting and the issue reproduced ?

Thanks

> GROBID parser fails in tika-app
> ---
>
> Key: TIKA-1712
> URL: https://issues.apache.org/jira/browse/TIKA-1712
> Project: Tika
>  Issue Type: Bug
>  Components: cli, server
>Reporter: Chris A. Mattmann
>Assignee: Sergey Beryozkin
> Fix For: 1.11
>
>
> Hey Sergey do you have any idea why CXF's 3.0.3 rt-client would work fine in 
> tika-server, but fail in tika-app? I'm seeing that with the GROBID parser. 
> See:
> https://issues.apache.org/jira/browse/CXF-6545
> Try calling the GROBID parser from Tika app:
> java -classpath 
> $HOME/git/grobidparser-resources/:target/tika-app-1.11-SNAPSHOT.jar 
> org.apache.tika.cli.TikaCLI 
> --config=$HOME/git/grobidparser-resources/tika-config.xml -J 
> $HOME/git/grobid/papers/ICSE06.pdf
> After following this guide:
> https://wiki.apache.org/tika/GrobidJournalParser
> Works fine in Tika-Server - dies in Tika-app with:
> {noformat}
> java.lang.NullPointerException
>   at 
> org.apache.cxf.jaxrs.client.AbstractClient.setupOutInterceptorChain(AbstractClient.java:849)
>   at 
> org.apache.cxf.jaxrs.client.AbstractClient.createMessage(AbstractClient.java:923)
>   at 
> org.apache.cxf.jaxrs.client.WebClient.finalizeMessage(WebClient.java:1125)
>   at 
> org.apache.cxf.jaxrs.client.WebClient.doChainedInvocation(WebClient.java:1098)
>   at org.apache.cxf.jaxrs.client.WebClient.doInvoke(WebClient.java:894)
>   at org.apache.cxf.jaxrs.client.WebClient.doInvoke(WebClient.java:865)
>   at org.apache.cxf.jaxrs.client.WebClient.invoke(WebClient.java:331)
>   at org.apache.cxf.jaxrs.client.WebClient.post(WebClient.java:340)
>   at 
> org.apache.tika.parser.journal.GrobidRESTParser.parse(GrobidRESTParser.java:82)
>   at 
> org.apache.tika.parser.journal.JournalParser.parse(JournalParser.java:67)
>   at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at 
> org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:158)
>   at org.apache.tika.cli.TikaCLI.handleRecursiveJson(TikaCLI.java:504)
>   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:484)
>   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:139)
> java.lang.NullPointerException
>   at 
> org.apache.tika.parser.journal.GrobidRESTParser.parse(GrobidRESTParser.java:89)
>   at 
> org.apache.tika.parser.journal.JournalParser.parse(JournalParser.java:67)
>   at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at 
> org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:158)
>   at org.apache.tika.cli.TikaCLI.handleRecursiveJson(TikaCLI.java:504)
>   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:484)
>   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:139)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new HashMap data structure for persitsence of Tika Metadata

2015-04-21 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14504999#comment-14504999
 ] 

Sergey Beryozkin commented on TIKA-1607:


Hi, 
IMHO it indeed makes sense to keep the existing Metadata methods that return 
String values but also offer an optional support for representing Metadata as a 
multivalued map of arbitrary object key/values where the original String to 
String[] pairs are converted into something more sophisticated if required...

By the way, JAX-RS API has this interface:
http://docs.oracle.com/javaee/7/api/javax/ws/rs/core/MultivaluedMap.html

Not suggesting to use natively in Tika, but it might be of interest...

Cheers, Sergey



> Introduce new HashMap data structure for persitsence of Tika 
> Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.9
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection HashMap> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-891) Use POST in addition to PUT on method calls in tika-server

2015-03-15 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14362467#comment-14362467
 ] 

Sergey Beryozkin commented on TIKA-891:
---

What I do feel rather strongly about is that the existing users should not 
start seeing the client code expecting PUT supported breaking. 
The duplication is usually of some concern to developers initially but always 
proves to be negligible in terms of what actually (how much) is duplicated. 
Some people can do a ContainerRequestFilter that will adapt PUT requests to 
POST and save a bit on typing duplicate method definitions though at a minor 
extra cost of having to maintain yet another provider. 

> Use POST in addition to PUT on method calls in tika-server
> --
>
> Key: TIKA-891
> URL: https://issues.apache.org/jira/browse/TIKA-891
> Project: Tika
>  Issue Type: Improvement
>  Components: general
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>Priority: Trivial
>  Labels: newbie
> Fix For: 1.9
>
>
> Per Jukka's email:
> http://s.apache.org/uR
> It would be a better use of REST/HTTP "verbs" to use POST to put content to a 
> resource where we don't intend to store that content (which is the 
> implication of PUT). Max suggested adding:
> {code}
> @POST
> {code}
> annotations to the methods we are currently exposing using PUT to take care 
> of this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-891) Use POST in addition to PUT on method calls in tika-server

2015-03-14 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14362039#comment-14362039
 ] 

Sergey Beryozkin commented on TIKA-891:
---

It is not the case, in JAX-RS, if you need to have multiple methods supported 
on the same path then one just has multiple methods, annotated with PUT, POST, 
etc, delegating to a common implementation.

> Use POST in addition to PUT on method calls in tika-server
> --
>
> Key: TIKA-891
> URL: https://issues.apache.org/jira/browse/TIKA-891
> Project: Tika
>  Issue Type: Improvement
>  Components: general
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>Priority: Trivial
>  Labels: newbie
> Fix For: 1.9
>
>
> Per Jukka's email:
> http://s.apache.org/uR
> It would be a better use of REST/HTTP "verbs" to use POST to put content to a 
> resource where we don't intend to store that content (which is the 
> implication of PUT). Max suggested adding:
> {code}
> @POST
> {code}
> annotations to the methods we are currently exposing using PUT to take care 
> of this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-891) Use POST in addition to PUT on method calls in tika-server

2015-03-03 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14344857#comment-14344857
 ] 

Sergey Beryozkin commented on TIKA-891:
---

Well, I guess we have to be careful with replacing all PUTs with POSTs. As far 
as I recall, one of curl commands documented at the Tika JAXRS page uses PUT 
implicitly. Supporting both methods is a more conservative approach in the 
short term at least.

> Use POST in addition to PUT on method calls in tika-server
> --
>
> Key: TIKA-891
> URL: https://issues.apache.org/jira/browse/TIKA-891
> Project: Tika
>  Issue Type: Improvement
>  Components: general
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>Priority: Trivial
>  Labels: newbie
> Fix For: 1.9
>
>
> Per Jukka's email:
> http://s.apache.org/uR
> It would be a better use of REST/HTTP "verbs" to use POST to put content to a 
> resource where we don't intend to store that content (which is the 
> implication of PUT). Max suggested adding:
> {code}
> @POST
> {code}
> annotations to the methods we are currently exposing using PUT to take care 
> of this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-891) Use POST in addition to PUT on method calls in tika-server

2015-03-02 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14343816#comment-14343816
 ] 

Sergey Beryozkin edited comment on TIKA-891 at 3/2/15 9:44 PM:
---

If I said PUT was the same as POST then I'd probably be seriously criticized at 
the very least :-). My point was, that neither verb is probably that 
semantically close for the type of the action Tika Server supports. Typically I 
prefer just to get things done though risking sometimes writing a possibly not 
very pure REST code :-). However you are right forms do not support PUT hence I 
guess I'd be better to have POST across the board for the sake of consistency. 
And if you agree - then keeping PUT for 1.8 as deprecated and removing in 1.9

Sergey


was (Author: sergey_beryozkin):
If I said PUT was the same as POST I'd be probably seriously criticized at the 
very least :-). My point was, that neither verb is probably not that 
semantically close for the type of the action Tika Server supports. Typically I 
prefer just to get things done though risking sometimes writing a possibly not 
very pure REST code :-). However you are rights forms do not support PUT hence 
I guess I'd be better to have POST across the board for the sake of 
consistency. And if you agree - then keeping PUT for 1.8 as deprecated and 
removing in 1.9

Sergey

> Use POST in addition to PUT on method calls in tika-server
> --
>
> Key: TIKA-891
> URL: https://issues.apache.org/jira/browse/TIKA-891
> Project: Tika
>  Issue Type: Improvement
>  Components: general
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>Priority: Trivial
>  Labels: newbie
> Fix For: 1.8
>
>
> Per Jukka's email:
> http://s.apache.org/uR
> It would be a better use of REST/HTTP "verbs" to use POST to put content to a 
> resource where we don't intend to store that content (which is the 
> implication of PUT). Max suggested adding:
> {code}
> @POST
> {code}
> annotations to the methods we are currently exposing using PUT to take care 
> of this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-891) Use POST in addition to PUT on method calls in tika-server

2015-03-02 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14343816#comment-14343816
 ] 

Sergey Beryozkin commented on TIKA-891:
---

If I said PUT was the same as POST I'd be probably seriously criticized at the 
very least :-). My point was, that neither verb is probably not that 
semantically close for the type of the action Tika Server supports. Typically I 
prefer just to get things done though risking sometimes writing a possibly not 
very pure REST code :-). However you are rights forms do not support PUT hence 
I guess I'd be better to have POST across the board for the sake of 
consistency. And if you agree - then keeping PUT for 1.8 as deprecated and 
removing in 1.9

Sergey

> Use POST in addition to PUT on method calls in tika-server
> --
>
> Key: TIKA-891
> URL: https://issues.apache.org/jira/browse/TIKA-891
> Project: Tika
>  Issue Type: Improvement
>  Components: general
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>Priority: Trivial
>  Labels: newbie
> Fix For: 1.8
>
>
> Per Jukka's email:
> http://s.apache.org/uR
> It would be a better use of REST/HTTP "verbs" to use POST to put content to a 
> resource where we don't intend to store that content (which is the 
> implication of PUT). Max suggested adding:
> {code}
> @POST
> {code}
> annotations to the methods we are currently exposing using PUT to take care 
> of this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-891) Use POST in addition to PUT on method calls in tika-server

2015-03-02 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14343373#comment-14343373
 ] 

Sergey Beryozkin commented on TIKA-891:
---

Just to clarify: I've no objections to migrating to POST from PUT, I guess 
neither fits perfectly, perhaps POST is marginally 'cleaner', but ultimately is 
is not that important IMHO :-).

The question is how to support the existing users. Keeping PUT for 1.8 only 
might make sense. I guess it depends on whether 1.8 is considered a major 
version or not

Thanks  


> Use POST in addition to PUT on method calls in tika-server
> --
>
> Key: TIKA-891
> URL: https://issues.apache.org/jira/browse/TIKA-891
> Project: Tika
>  Issue Type: Improvement
>  Components: general
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>Priority: Trivial
>  Labels: newbie
> Fix For: 1.8
>
>
> Per Jukka's email:
> http://s.apache.org/uR
> It would be a better use of REST/HTTP "verbs" to use POST to put content to a 
> resource where we don't intend to store that content (which is the 
> implication of PUT). Max suggested adding:
> {code}
> @POST
> {code}
> annotations to the methods we are currently exposing using PUT to take care 
> of this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-891) Use POST in addition to PUT on method calls in tika-server

2015-03-02 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14343330#comment-14343330
 ] 

Sergey Beryozkin commented on TIKA-891:
---

Why are both PUT and POST out ? GET does not fit, it offers no input content.



> Use POST in addition to PUT on method calls in tika-server
> --
>
> Key: TIKA-891
> URL: https://issues.apache.org/jira/browse/TIKA-891
> Project: Tika
>  Issue Type: Improvement
>  Components: general
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>Priority: Trivial
>  Labels: newbie
> Fix For: 1.8
>
>
> Per Jukka's email:
> http://s.apache.org/uR
> It would be a better use of REST/HTTP "verbs" to use POST to put content to a 
> resource where we don't intend to store that content (which is the 
> implication of PUT). Max suggested adding:
> {code}
> @POST
> {code}
> annotations to the methods we are currently exposing using PUT to take care 
> of this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-891) Use POST in addition to PUT on method calls in tika-server

2015-03-02 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14342977#comment-14342977
 ] 

Sergey Beryozkin commented on TIKA-891:
---

IMHO it might make sense to keep PUT as deprecated for the next release so that 
the existing users can still get it working and provide a documentation 
suggesting that PUT will be removed in 1.9. Or at the very least offer such a 
migration guide for 1.8

As a side note, I don't know PUT was put in the first place, but speaking of 
the semantics, I'm not 100% sure POST (effectively adding a resource to the 
collection) is the closest match to what Tika JAX-RS server does, by accepting 
a resource and echoing its metadata/data back. 

Cheers, Sergey

> Use POST in addition to PUT on method calls in tika-server
> --
>
> Key: TIKA-891
> URL: https://issues.apache.org/jira/browse/TIKA-891
> Project: Tika
>  Issue Type: Improvement
>  Components: general
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>Priority: Trivial
>  Labels: newbie
> Fix For: 1.8
>
>
> Per Jukka's email:
> http://s.apache.org/uR
> It would be a better use of REST/HTTP "verbs" to use POST to put content to a 
> resource where we don't intend to store that content (which is the 
> implication of PUT). Max suggested adding:
> {code}
> @POST
> {code}
> annotations to the methods we are currently exposing using PUT to take care 
> of this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1497) tika-server cannot output JSON

2014-12-19 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14253426#comment-14253426
 ] 

Sergey Beryozkin commented on TIKA-1497:


Hi Tim, we are all JAX-RS experts here, JAX-RS is the easiest part :-)
+1  to merging MetadataEp into MetadataResource
Thanks, Sergey

> tika-server cannot output JSON
> --
>
> Key: TIKA-1497
> URL: https://issues.apache.org/jira/browse/TIKA-1497
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Reporter: Peter Bowyer
> Attachments: TIKA-1497.patch, TIKA-1497v2.patch
>
>
> I would like the response from 
> curl -X PUT -T /path/to/file.pdf http://localhost:9998/meta
> to be JSON and not CSV?.
> I've discovered JSONMessageBodyWriter.java 
> (https://github.com/apache/tika/blob/af19f3ea04792cad81b428f1df9f5ebbb2501913/tika-server/src/main/java/org/apache/tika/server/JSONMessageBodyWriter.java)
>  so I think the functionality is present, tried adding --header "Accept: 
> application/json" to the cURL call, in line with the documentation for 
> outputting CSV, but no luck so far.
> According to [~sergey_beryozkin]
> "I see MetadataResource returning StreamingOutput and it has 
> @Produces(text/csv) only. As such this MBW has no effect at the moment.
> We can update MetadataResource to return Metadata directly if 
> application/json is requested or update MetadataResource to directly convert 
> Metadata to JSON in case of JSON being accepted."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1497) tika-server cannot output JSON

2014-12-19 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14253240#comment-14253240
 ] 

Sergey Beryozkin commented on TIKA-1497:


Hi Tim, 
Thanks for a fix, IMHO the latest metadata resource code authored by you is 
absolutely perfect :-)




> tika-server cannot output JSON
> --
>
> Key: TIKA-1497
> URL: https://issues.apache.org/jira/browse/TIKA-1497
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Reporter: Peter Bowyer
> Attachments: TIKA-1497.patch, TIKA-1497v2.patch
>
>
> I would like the response from 
> curl -X PUT -T /path/to/file.pdf http://localhost:9998/meta
> to be JSON and not CSV?.
> I've discovered JSONMessageBodyWriter.java 
> (https://github.com/apache/tika/blob/af19f3ea04792cad81b428f1df9f5ebbb2501913/tika-server/src/main/java/org/apache/tika/server/JSONMessageBodyWriter.java)
>  so I think the functionality is present, tried adding --header "Accept: 
> application/json" to the cURL call, in line with the documentation for 
> outputting CSV, but no luck so far.
> According to [~sergey_beryozkin]
> "I see MetadataResource returning StreamingOutput and it has 
> @Produces(text/csv) only. As such this MBW has no effect at the moment.
> We can update MetadataResource to return Metadata directly if 
> application/json is requested or update MetadataResource to directly convert 
> Metadata to JSON in case of JSON being accepted."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1497) tika-server cannot output JSON

2014-12-18 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14252239#comment-14252239
 ] 

Sergey Beryozkin commented on TIKA-1497:


Hi Tim, sorry for a delay, if you are also fine with it then yes, it would be a 
bit more flexible (we won't need to update method signatures in the furture if 
the custom headers like etags, etc will need to be added).

MetadataEP, should it be merged into MetadataResource as Chris suggested or 
you;d prefer to wait till 2.0 ?

Cheers, Sergey


> tika-server cannot output JSON
> --
>
> Key: TIKA-1497
> URL: https://issues.apache.org/jira/browse/TIKA-1497
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Reporter: Peter Bowyer
> Attachments: TIKA-1497.patch, TIKA-1497v2.patch
>
>
> I would like the response from 
> curl -X PUT -T /path/to/file.pdf http://localhost:9998/meta
> to be JSON and not CSV?.
> I've discovered JSONMessageBodyWriter.java 
> (https://github.com/apache/tika/blob/af19f3ea04792cad81b428f1df9f5ebbb2501913/tika-server/src/main/java/org/apache/tika/server/JSONMessageBodyWriter.java)
>  so I think the functionality is present, tried adding --header "Accept: 
> application/json" to the cURL call, in line with the documentation for 
> outputting CSV, but no luck so far.
> According to [~sergey_beryozkin]
> "I see MetadataResource returning StreamingOutput and it has 
> @Produces(text/csv) only. As such this MBW has no effect at the moment.
> We can update MetadataResource to return Metadata directly if 
> application/json is requested or update MetadataResource to directly convert 
> Metadata to JSON in case of JSON being accepted."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1497) tika-server cannot output JSON

2014-12-18 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14251952#comment-14251952
 ] 

Sergey Beryozkin commented on TIKA-1497:


Hi, yes, looks nice, though returning Response is marginally more flexible in 
cases where extra custom headers need to be set (as opposed to doing it from 
ContainerResponseFilter), it has no difference for the runtime whether Metadata 
is returned as Response entity or directly; the only minor issue and this a CXF 
specific only is that it does not play well with client proxy API, which is not 
used with Tika tests...

Cheers, Sergey 

> tika-server cannot output JSON
> --
>
> Key: TIKA-1497
> URL: https://issues.apache.org/jira/browse/TIKA-1497
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Reporter: Peter Bowyer
> Attachments: TIKA-1497.patch, TIKA-1497v2.patch
>
>
> I would like the response from 
> curl -X PUT -T /path/to/file.pdf http://localhost:9998/meta
> to be JSON and not CSV?.
> I've discovered JSONMessageBodyWriter.java 
> (https://github.com/apache/tika/blob/af19f3ea04792cad81b428f1df9f5ebbb2501913/tika-server/src/main/java/org/apache/tika/server/JSONMessageBodyWriter.java)
>  so I think the functionality is present, tried adding --header "Accept: 
> application/json" to the cURL call, in line with the documentation for 
> outputting CSV, but no luck so far.
> According to [~sergey_beryozkin]
> "I see MetadataResource returning StreamingOutput and it has 
> @Produces(text/csv) only. As such this MBW has no effect at the moment.
> We can update MetadataResource to return Metadata directly if 
> application/json is requested or update MetadataResource to directly convert 
> Metadata to JSON in case of JSON being accepted."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1497) tika-server cannot output JSON

2014-12-18 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14251920#comment-14251920
 ] 

Sergey Beryozkin edited comment on TIKA-1497 at 12/18/14 5:27 PM:
--

Hi Tim

We can probably have
{code:java}
@Produces({"text/csv", "application/json"})
{code}

for both PUT methods and have JSONMessageBodyWriter also have the same Produces 
and renamed to MetadataMessageBodyWriter and in its writeTo  check if the 
passed media type is text/csv or json and convert accordingly. Or a dedicated 
CSV provider can be added as you suggest. Indeed, returning Metadata directly 
(wrapped in Response is Ok too) works well with custom MBWs
thanks, Sergey  


was (Author: sergey_beryozkin):
Hi Tim

We can probably have
{code:java}
@Produces({"text/csv", "application/json"})
{code}

for both PUT methods and have JSONMessageBodyWriter also have the same Produces 
and renamed to MetadataMessageBodyWriter and in its writre to  check if the 
passed media type is text/csv or json and convert accordingly. Or a dedicated 
CSV provider can be added as you suggest. Indeed, returning Metadata directly 
(wrapped in Response is Ok too) works well with custom MBWs
thanks, Sergey  

> tika-server cannot output JSON
> --
>
> Key: TIKA-1497
> URL: https://issues.apache.org/jira/browse/TIKA-1497
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Reporter: Peter Bowyer
> Attachments: TIKA-1497.patch
>
>
> I would like the response from 
> curl -X PUT -T /path/to/file.pdf http://localhost:9998/meta
> to be JSON and not CSV?.
> I've discovered JSONMessageBodyWriter.java 
> (https://github.com/apache/tika/blob/af19f3ea04792cad81b428f1df9f5ebbb2501913/tika-server/src/main/java/org/apache/tika/server/JSONMessageBodyWriter.java)
>  so I think the functionality is present, tried adding --header "Accept: 
> application/json" to the cURL call, in line with the documentation for 
> outputting CSV, but no luck so far.
> According to [~sergey_beryozkin]
> "I see MetadataResource returning StreamingOutput and it has 
> @Produces(text/csv) only. As such this MBW has no effect at the moment.
> We can update MetadataResource to return Metadata directly if 
> application/json is requested or update MetadataResource to directly convert 
> Metadata to JSON in case of JSON being accepted."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1497) tika-server cannot output JSON

2014-12-18 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14251920#comment-14251920
 ] 

Sergey Beryozkin commented on TIKA-1497:


Hi Tim

We can probably have
{code:java}
@Produces({"text/csv", "application/json"})
{code:java}

for both PUT methods and have JSONMessageBodyWriter also have the same Produces 
and renamed to MetadataMessageBodyWriter and in its writre to  check if the 
passed media type is text/csv or json and convert accordingly. Or a dedicated 
CSV provider can be added as you suggest. Indeed, returning Metadata directly 
(wrapped in Response is Ok too) works well with custom MBWs
thanks, Sergey  

> tika-server cannot output JSON
> --
>
> Key: TIKA-1497
> URL: https://issues.apache.org/jira/browse/TIKA-1497
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Reporter: Peter Bowyer
> Attachments: TIKA-1497.patch
>
>
> I would like the response from 
> curl -X PUT -T /path/to/file.pdf http://localhost:9998/meta
> to be JSON and not CSV?.
> I've discovered JSONMessageBodyWriter.java 
> (https://github.com/apache/tika/blob/af19f3ea04792cad81b428f1df9f5ebbb2501913/tika-server/src/main/java/org/apache/tika/server/JSONMessageBodyWriter.java)
>  so I think the functionality is present, tried adding --header "Accept: 
> application/json" to the cURL call, in line with the documentation for 
> outputting CSV, but no luck so far.
> According to [~sergey_beryozkin]
> "I see MetadataResource returning StreamingOutput and it has 
> @Produces(text/csv) only. As such this MBW has no effect at the moment.
> We can update MetadataResource to return Metadata directly if 
> application/json is requested or update MetadataResource to directly convert 
> Metadata to JSON in case of JSON being accepted."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1497) tika-server cannot output JSON

2014-12-18 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14251920#comment-14251920
 ] 

Sergey Beryozkin edited comment on TIKA-1497 at 12/18/14 5:27 PM:
--

Hi Tim

We can probably have
{code:java}
@Produces({"text/csv", "application/json"})
{code}

for both PUT methods and have JSONMessageBodyWriter also have the same Produces 
and renamed to MetadataMessageBodyWriter and in its writre to  check if the 
passed media type is text/csv or json and convert accordingly. Or a dedicated 
CSV provider can be added as you suggest. Indeed, returning Metadata directly 
(wrapped in Response is Ok too) works well with custom MBWs
thanks, Sergey  


was (Author: sergey_beryozkin):
Hi Tim

We can probably have
{code:java}
@Produces({"text/csv", "application/json"})
{code:java}

for both PUT methods and have JSONMessageBodyWriter also have the same Produces 
and renamed to MetadataMessageBodyWriter and in its writre to  check if the 
passed media type is text/csv or json and convert accordingly. Or a dedicated 
CSV provider can be added as you suggest. Indeed, returning Metadata directly 
(wrapped in Response is Ok too) works well with custom MBWs
thanks, Sergey  

> tika-server cannot output JSON
> --
>
> Key: TIKA-1497
> URL: https://issues.apache.org/jira/browse/TIKA-1497
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Reporter: Peter Bowyer
> Attachments: TIKA-1497.patch
>
>
> I would like the response from 
> curl -X PUT -T /path/to/file.pdf http://localhost:9998/meta
> to be JSON and not CSV?.
> I've discovered JSONMessageBodyWriter.java 
> (https://github.com/apache/tika/blob/af19f3ea04792cad81b428f1df9f5ebbb2501913/tika-server/src/main/java/org/apache/tika/server/JSONMessageBodyWriter.java)
>  so I think the functionality is present, tried adding --header "Accept: 
> application/json" to the cURL call, in line with the documentation for 
> outputting CSV, but no luck so far.
> According to [~sergey_beryozkin]
> "I see MetadataResource returning StreamingOutput and it has 
> @Produces(text/csv) only. As such this MBW has no effect at the moment.
> We can update MetadataResource to return Metadata directly if 
> application/json is requested or update MetadataResource to directly convert 
> Metadata to JSON in case of JSON being accepted."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1494) JAXRS server: allow passing PDF password in the request

2014-12-15 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14246552#comment-14246552
 ] 

Sergey Beryozkin commented on TIKA-1494:


Hi, looks like it is not recommended any longer:

https://tools.ietf.org/html/rfc7231#section-8.3.1

Cheers, Sergey

> JAXRS server: allow passing PDF password in the request
> ---
>
> Key: TIKA-1494
> URL: https://issues.apache.org/jira/browse/TIKA-1494
> Project: Tika
>  Issue Type: New Feature
>  Components: server
>Reporter: Peter Bowyer
>  Labels: encryption, pdf, server
> Fix For: 1.7
>
>
> I have to extract content from encrypted PDFs. Setting the PDF password using 
> the TIKA_PASSWORD environment variable works, however it only allows for one 
> PDF password.
> It would be very useful to be able to pass the password in during the HTTP 
> request - as a header or by some other means. This way I can run a central 
> JAXRS server and extract content from all PDFs, rather than a separate server 
> for each department/group that has its own password.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1494) JAXRS server: allow passing PDF password in the request

2014-12-12 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14244076#comment-14244076
 ] 

Sergey Beryozkin commented on TIKA-1494:


Hi Nick, All
It should work indeed.
By the way, should it depend on the resolution of 
https://issues.apache.org/jira/browse/TIKA-894 ? May be at least related to...
We can definitely support passing passwords and reporting the decrypted content 
over plain HTTP as POC but 
some users would likely prefer having at least a 1-way TLS activated which is 
where running a Tika in a servlet container would help.
Sergey


> JAXRS server: allow passing PDF password in the request
> ---
>
> Key: TIKA-1494
> URL: https://issues.apache.org/jira/browse/TIKA-1494
> Project: Tika
>  Issue Type: New Feature
>  Components: server
>Reporter: Peter Bowyer
>  Labels: encryption, pdf, server
>
> I have to extract content from encrypted PDFs. Setting the PDF password using 
> the TIKA_PASSWORD environment variable works, however it only allows for one 
> PDF password.
> It would be very useful to be able to pass the password in during the HTTP 
> request - as a header or by some other means. This way I can run a central 
> JAXRS server and extract content from all PDFs, rather than a separate server 
> for each department/group that has its own password.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-894) Add webapp mode for Tika Server, simplifies deployment

2014-12-12 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14244068#comment-14244068
 ] 

Sergey Beryozkin commented on TIKA-894:
---

Hi Lewis, 
are you still interested, may be you can find some time before Christmas :-) ?
I tried to do a quick fix but got confused with what should be included as far 
as various Tika dependencies/parsers are concerned...
Cheers, Sergey


> Add webapp mode for Tika Server, simplifies deployment
> --
>
> Key: TIKA-894
> URL: https://issues.apache.org/jira/browse/TIKA-894
> Project: Tika
>  Issue Type: Improvement
>  Components: packaging
>Affects Versions: 1.1, 1.2
>Reporter: Chris Wilson
>  Labels: maven, newbie, patch
> Fix For: 1.7
>
> Attachments: tika-server-webapp.patch
>
>
> For use in production services, Tika Server should really be deployed as a 
> WAR file, under a reliable servlet container that knows how to run as a 
> system service, for example Tomcat or JBoss.
> This is especially important on Windows, where I wasted an entire day trying 
> to make TikaServerCli run as some kind of a service. 
> Maven makes building a webapp pretty trivial. With the attached patch 
> applied, "mvn war:war" should work. It seems to run fine in Tomcat, which 
> makes Windows deployment much simpler. Just install Tomcat and drop the WAR 
> file into tomcat's webapps directory and you're away.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-894) Add webapp mode for Tika Server, simplifies deployment

2014-12-12 Thread Sergey Beryozkin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Beryozkin updated TIKA-894:
--
Fix Version/s: 1.7

> Add webapp mode for Tika Server, simplifies deployment
> --
>
> Key: TIKA-894
> URL: https://issues.apache.org/jira/browse/TIKA-894
> Project: Tika
>  Issue Type: Improvement
>  Components: packaging
>Affects Versions: 1.1, 1.2
>Reporter: Chris Wilson
>  Labels: maven, newbie, patch
> Fix For: 1.7
>
> Attachments: tika-server-webapp.patch
>
>
> For use in production services, Tika Server should really be deployed as a 
> WAR file, under a reliable servlet container that knows how to run as a 
> system service, for example Tomcat or JBoss.
> This is especially important on Windows, where I wasted an entire day trying 
> to make TikaServerCli run as some kind of a service. 
> Maven makes building a webapp pretty trivial. With the attached patch 
> applied, "mvn war:war" should work. It seems to run fine in Tomcat, which 
> makes Windows deployment much simpler. Just install Tomcat and drop the WAR 
> file into tomcat's webapps directory and you're away.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1481) TikaJAXRS get metadata calls give different results

2014-11-19 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14218505#comment-14218505
 ] 

Sergey Beryozkin edited comment on TIKA-1481 at 11/19/14 9:11 PM:
--

Hi Darya
It is something to do with the curl options. -T is effectively a form payload 
AFAIK, possibly a multipart/form-data one. The 2nd option is a direct body 
payload. Please use a tcp trace and see what is different.
By the way - it would be more beneficial for the community at large if you 
could ask the questions at the users list - the questions raised at JIRAs have 
a very low visibility, unless they do identify genuine issue
Thanks, Sergey


was (Author: sergey_beryozkin):
Hi Darya
It is something to do with the curl options. -T is effectively a form payload 
AFAIK, possibly a multipart/form-data one. The 2nd option is a direct body 
payload. Please use a tcp trace and see whta is different.
By the way - it would be more beneficial for the community at large if you 
could ask the questions at the users list - the questions raised at JIRAs have 
a very low visibility, unless they do identify genuine issue
Thanks, Sergey

> TikaJAXRS get metadata calls give different results
> ---
>
> Key: TIKA-1481
> URL: https://issues.apache.org/jira/browse/TIKA-1481
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.6
> Environment: Windows 8, JDK 1.8
>Reporter: Darya Arbuzova
>Priority: Minor
> Attachments: sample.csv
>
>
> Hello!
> I'm trying to use Tika in server mode.
> I downloaded tika-server-1.6.jar from http://mirror.vorboss.net/apache/tika/.
> I have tried to get file metadata in 2 different ways (as explained here: 
> http://wiki.apache.org/tika/TikaJAXRS ):
> {{> curl -T sample.csv http://localhost:9998/meta --header "Content-Type: 
> text/csv"}}
> {{"Content-Encoding","windows-1252"}}
> {{"Content-Type","text/plain; charset=windows-1252"}}
> and
> {{> curl -X PUT -d @sample.csv http://localhost:9998/meta --header 
> "Content-Type: text/csv"}}
> {{"Content-Encoding","ISO-8859-1"}}
> {{"Content-Type","text/plain; charset=ISO-8859-1"}}
> How come they give different results in encoding if I call the same 
> {{http://localhost:9998/meta}}?
> What could the other differences appear and which is the preferable way to 
> get metadata?
> Many thanks!
> Best regards,
> Darya Arbuzova



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1481) TikaJAXRS get metadata calls give different results

2014-11-19 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14218505#comment-14218505
 ] 

Sergey Beryozkin edited comment on TIKA-1481 at 11/19/14 9:12 PM:
--

Hi Darya
It is something to do with the curl options. -T is effectively a form payload 
AFAIK, possibly a multipart/form-data one. The 2nd option is a direct body 
payload. Please use a tcp trace and see what is different.
By the way - it would be more beneficial for the community at large if you 
could ask the questions at the users list - the questions raised at JIRAs have 
a very low visibility, unless they do identify genuine issues
Thanks, Sergey


was (Author: sergey_beryozkin):
Hi Darya
It is something to do with the curl options. -T is effectively a form payload 
AFAIK, possibly a multipart/form-data one. The 2nd option is a direct body 
payload. Please use a tcp trace and see what is different.
By the way - it would be more beneficial for the community at large if you 
could ask the questions at the users list - the questions raised at JIRAs have 
a very low visibility, unless they do identify genuine issue
Thanks, Sergey

> TikaJAXRS get metadata calls give different results
> ---
>
> Key: TIKA-1481
> URL: https://issues.apache.org/jira/browse/TIKA-1481
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.6
> Environment: Windows 8, JDK 1.8
>Reporter: Darya Arbuzova
>Priority: Minor
> Attachments: sample.csv
>
>
> Hello!
> I'm trying to use Tika in server mode.
> I downloaded tika-server-1.6.jar from http://mirror.vorboss.net/apache/tika/.
> I have tried to get file metadata in 2 different ways (as explained here: 
> http://wiki.apache.org/tika/TikaJAXRS ):
> {{> curl -T sample.csv http://localhost:9998/meta --header "Content-Type: 
> text/csv"}}
> {{"Content-Encoding","windows-1252"}}
> {{"Content-Type","text/plain; charset=windows-1252"}}
> and
> {{> curl -X PUT -d @sample.csv http://localhost:9998/meta --header 
> "Content-Type: text/csv"}}
> {{"Content-Encoding","ISO-8859-1"}}
> {{"Content-Type","text/plain; charset=ISO-8859-1"}}
> How come they give different results in encoding if I call the same 
> {{http://localhost:9998/meta}}?
> What could the other differences appear and which is the preferable way to 
> get metadata?
> Many thanks!
> Best regards,
> Darya Arbuzova



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1481) TikaJAXRS get metadata calls give different results

2014-11-19 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14218505#comment-14218505
 ] 

Sergey Beryozkin commented on TIKA-1481:


Hi Darya
It is something to do with the curl options. -T is effectively a form payload 
AFAIK, possibly a multipart/form-data one. The 2nd option is a direct body 
payload. Please use a tcp trace and see whta is different.
By the way - it would be more beneficial for the community at large if you 
could ask the questions at the users list - the questions raised at JIRAs have 
a very low visibility, unless they do identify genuine issue
Thanks, Sergey

> TikaJAXRS get metadata calls give different results
> ---
>
> Key: TIKA-1481
> URL: https://issues.apache.org/jira/browse/TIKA-1481
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.6
> Environment: Windows 8, JDK 1.8
>Reporter: Darya Arbuzova
>Priority: Minor
> Attachments: sample.csv
>
>
> Hello!
> I'm trying to use Tika in server mode.
> I downloaded tika-server-1.6.jar from http://mirror.vorboss.net/apache/tika/.
> I have tried to get file metadata in 2 different ways (as explained here: 
> http://wiki.apache.org/tika/TikaJAXRS ):
> {{> curl -T sample.csv http://localhost:9998/meta --header "Content-Type: 
> text/csv"}}
> {{"Content-Encoding","windows-1252"}}
> {{"Content-Type","text/plain; charset=windows-1252"}}
> and
> {{> curl -X PUT -d @sample.csv http://localhost:9998/meta --header 
> "Content-Type: text/csv"}}
> {{"Content-Encoding","ISO-8859-1"}}
> {{"Content-Type","text/plain; charset=ISO-8859-1"}}
> How come they give different results in encoding if I call the same 
> {{http://localhost:9998/meta}}?
> What could the other differences appear and which is the preferable way to 
> get metadata?
> Many thanks!
> Best regards,
> Darya Arbuzova



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1242) Update CXF version to 3.0.2

2014-10-14 Thread Sergey Beryozkin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Beryozkin resolved TIKA-1242.

Resolution: Fixed

> Update CXF version to 3.0.2
> ---
>
> Key: TIKA-1242
> URL: https://issues.apache.org/jira/browse/TIKA-1242
> Project: Tika
>  Issue Type: Task
>  Components: server
>Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
>Priority: Minor
> Fix For: 1.7
>
>
> CXF 3.0.2 JAX-RS front-end offers a complete JAX-RS 2.0 support, has fewer 
> dependencies and is smaller compared to CXF 2.7.x one. It is also 
> backward-compatible with the applications written against JAX-RS 1.1.
> Lets do this upgrade after Tika 1.6 is out.
> CXF 3.1.0 is Java 7 based and is still under development.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1242) Update CXF version to 3.0.2

2014-10-14 Thread Sergey Beryozkin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Beryozkin updated TIKA-1242:
---
Description: 
CXF 3.0.2 JAX-RS front-end offers a complete JAX-RS 2.0 support, has fewer 
dependencies and is smaller compared to CXF 2.7.x one. It is also 
backward-compatible with the applications written against JAX-RS 1.1.
Lets do this upgrade after Tika 1.6 is out.

CXF 3.1.0 is Java 7 based and is still under development.

  was:
CXF 3.1.0 JAX-RS front-end offers a complete JAX-RS 2.0 support, has fewer 
dependencies and is smaller compared to CXF 2.7.x one. It is also 
backward-compatible with the applications written against JAX-RS 1.1.
Lets do this upgrade after Tika 1.6 is out

Summary: Update CXF version to 3.0.2  (was: Update CXF version to 3.1.0)

> Update CXF version to 3.0.2
> ---
>
> Key: TIKA-1242
> URL: https://issues.apache.org/jira/browse/TIKA-1242
> Project: Tika
>  Issue Type: Task
>  Components: server
>Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
>Priority: Minor
> Fix For: 1.7
>
>
> CXF 3.0.2 JAX-RS front-end offers a complete JAX-RS 2.0 support, has fewer 
> dependencies and is smaller compared to CXF 2.7.x one. It is also 
> backward-compatible with the applications written against JAX-RS 1.1.
> Lets do this upgrade after Tika 1.6 is out.
> CXF 3.1.0 is Java 7 based and is still under development.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1371) passing parameters via URL no longer works (regression)

2014-08-06 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14088100#comment-14088100
 ] 

Sergey Beryozkin commented on TIKA-1371:


I've no idea to be honest, it is there:
http://svn.apache.org/viewvc/tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaServerCli.java?r1=1616118&r2=1616117&pathrev=1616118

Cheers, Sergey

> passing parameters via URL no longer works (regression)
> ---
>
> Key: TIKA-1371
> URL: https://issues.apache.org/jira/browse/TIKA-1371
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.5
>Reporter: Rob Tulloh
>
> In Tika 1.1 and 1.2, it was possible to add some values to the URL that get 
> logged like this:
> http://localhost:9998/tika/GUID/FILENAME
> This was very useful for correlating between client and server in a 
> distributed compute environment. In 1.5 and in the nighty builds (for 1.6), 
> this feature no longer works. Not having this makes it very difficult to 
> troubleshoot problems with document processing in a distributed environment. 
> Please add back this feature so that operations and development teams can 
> more easily figure out which tika instance is processing which document and 
> what the result of the processing resulted in.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1371) passing parameters via URL no longer works (regression)

2014-08-06 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14087404#comment-14087404
 ] 

Sergey Beryozkin commented on TIKA-1371:


I've introduced a TikaLoggingFilter, it logs a request URI at info or debug 
level. The command line option is "-l or log" with either 'debug' or 'info' 
levels. 
Hope it will resolve the issue 

> passing parameters via URL no longer works (regression)
> ---
>
> Key: TIKA-1371
> URL: https://issues.apache.org/jira/browse/TIKA-1371
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.5
>Reporter: Rob Tulloh
>
> In Tika 1.1 and 1.2, it was possible to add some values to the URL that get 
> logged like this:
> http://localhost:9998/tika/GUID/FILENAME
> This was very useful for correlating between client and server in a 
> distributed compute environment. In 1.5 and in the nighty builds (for 1.6), 
> this feature no longer works. Not having this makes it very difficult to 
> troubleshoot problems with document processing in a distributed environment. 
> Please add back this feature so that operations and development teams can 
> more easily figure out which tika instance is processing which document and 
> what the result of the processing resulted in.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1383) Simplify TikeServerCli endpoint setup code

2014-08-05 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14086671#comment-14086671
 ] 

Sergey Beryozkin commented on TIKA-1383:


Hi Nick, yes, I messed it up a bit, removed the test by accident, realized it 
soon after I signed off :-). Thanks for fixing it 

> Simplify TikeServerCli endpoint setup code
> --
>
> Key: TIKA-1383
> URL: https://issues.apache.org/jira/browse/TIKA-1383
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
>Priority: Trivial
> Fix For: 1.6
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1383) Simplify TikeServerCli endpoint setup code

2014-08-05 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14086334#comment-14086334
 ] 

Sergey Beryozkin commented on TIKA-1383:


Sorry, I updated TikaWelcome to accept a list of resource providers as opposed 
to a factory bean - its class needs a bit of clean up to avoid some synch 
issues, but IMHO bypassing the factory is probably better going forward.
I've updated the the welcome test where HTML is tested - this was apparently 
coming from TikaWelcome itself (another reason to bypass the factory), I guess 
TikaWelcome should not report its own details.
I'm off now but I will look into fixing the issues if any later this evening or 
tomorrow morning
Cheers, Sergey

> Simplify TikeServerCli endpoint setup code
> --
>
> Key: TIKA-1383
> URL: https://issues.apache.org/jira/browse/TIKA-1383
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
>Priority: Trivial
> Fix For: 1.6
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1383) Simplify TikeServerCli endpoint setup code

2014-08-05 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14086249#comment-14086249
 ] 

Sergey Beryozkin commented on TIKA-1383:


Hi Nick, All,
I've removed some redundant code in the server registration code. If possible 
please do a sanity check it  has not caused the side-effects, all appears to be 
OK to me
Sergey 

> Simplify TikeServerCli endpoint setup code
> --
>
> Key: TIKA-1383
> URL: https://issues.apache.org/jira/browse/TIKA-1383
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
>Priority: Trivial
> Fix For: 1.6
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (TIKA-1383) Simplify TikeServerCli endpoint setup code

2014-08-05 Thread Sergey Beryozkin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Beryozkin resolved TIKA-1383.


Resolution: Fixed

> Simplify TikeServerCli endpoint setup code
> --
>
> Key: TIKA-1383
> URL: https://issues.apache.org/jira/browse/TIKA-1383
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
>Priority: Trivial
> Fix For: 1.6
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (TIKA-1383) Simplify TikeServerCli endpoint setup code

2014-08-05 Thread Sergey Beryozkin (JIRA)
Sergey Beryozkin created TIKA-1383:
--

 Summary: Simplify TikeServerCli endpoint setup code
 Key: TIKA-1383
 URL: https://issues.apache.org/jira/browse/TIKA-1383
 Project: Tika
  Issue Type: Improvement
  Components: server
Reporter: Sergey Beryozkin
Assignee: Sergey Beryozkin
Priority: Trivial
 Fix For: 1.6






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1371) passing parameters via URL no longer works (regression)

2014-07-23 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071516#comment-14071516
 ] 

Sergey Beryozkin commented on TIKA-1371:


Can you clarify please what exactly does not work ?
Absolute request URI is not logged ? Can you please type a sample request URI 
issued against Tika 1.5 and explain what do you expect the server to do...

Thanks, Sergey

> passing parameters via URL no longer works (regression)
> ---
>
> Key: TIKA-1371
> URL: https://issues.apache.org/jira/browse/TIKA-1371
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.5
>Reporter: Rob Tulloh
>
> In Tika 1.1 and 1.2, it was possible to add some values to the URL that get 
> logged like this:
> http://localhost:9998/tika/GUID/FILENAME
> This was very useful for correlating between client and server in a 
> distributed compute environment. In 1.5 and in the nighty builds (for 1.6), 
> this feature no longer works. Not having this makes it very difficult to 
> troubleshoot problems with document processing in a distributed environment. 
> Please add back this feature so that operations and development teams can 
> more easily figure out which tika instance is processing which document and 
> what the result of the processing resulted in.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1367) Tika documentation should list tika-parsers parser dependencies

2014-07-15 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062606#comment-14062606
 ] 

Sergey Beryozkin commented on TIKA-1367:


Thanks for the proposal, I'm not sure though it would help. Consider we have a 
user not necessarily knowing what 'grep' is, for example someone working on 
Windows. Ideally as a user I'd like to have an easy way to solve this typical 
dependency issue: "My application will work with PDFs and OpenDocument docs 
only, how can I get all but the relevant dependencies excluded ?". I know some 
source and Maven based search can yield some info, but it would not something 
every user can be expected be able to do. 
For the record, here's what I see after grepping dependency:tree

{noformat}
[INFO] +- org.apache.tika:tika-core:jar:1.6-SNAPSHOT:compile
[INFO] +- org.gagravarr:vorbis-java-tika:jar:0.6:compile
[INFO] +- edu.ucar:netcdf:jar:4.2.20:compile
[INFO] |  +- edu.ucar:unidataCommon:jar:4.2.20:compile
[INFO] |  |  \- net.jcip:jcip-annotations:jar:1.0:compile
[INFO] |  +- commons-httpclient:commons-httpclient:jar:3.1:compile
[INFO] |  \- org.slf4j:slf4j-api:jar:1.6.1:compile
[INFO] +- net.sourceforge.jmatio:jmatio:jar:1.0:compile
[INFO] +- org.apache.james:apache-mime4j-core:jar:0.7.2:compile
[INFO] +- org.apache.james:apache-mime4j-dom:jar:0.7.2:compile
[INFO] +- org.apache.commons:commons-compress:jar:1.8:compile
[INFO] |  \- org.tukaani:xz:jar:1.5:compile
[INFO] +- commons-codec:commons-codec:jar:1.5:compile
[INFO] +- org.apache.pdfbox:pdfbox:jar:1.8.6:compile
[INFO] |  +- org.apache.pdfbox:fontbox:jar:1.8.6:compile
[INFO] |  +- org.apache.pdfbox:jempbox:jar:1.8.6:compile
[INFO] |  \- commons-logging:commons-logging:jar:1.1.1:compile
[INFO] +- org.bouncycastle:bcmail-jdk15:jar:1.45:compile
[INFO] +- org.bouncycastle:bcprov-jdk15:jar:1.45:compile
[INFO] +- org.apache.poi:poi:jar:3.10-FINAL:compile
[INFO] +- org.apache.poi:poi-scratchpad:jar:3.10-FINAL:compile
[INFO] +- org.apache.poi:poi-ooxml:jar:3.10-FINAL:compile
[INFO] |  +- org.apache.poi:poi-ooxml-schemas:jar:3.10-FINAL:compile
[INFO] |  |  \- org.apache.xmlbeans:xmlbeans:jar:2.3.0:compile
[INFO] |  \- dom4j:dom4j:jar:1.6.1:compile
[INFO] +- org.apache.geronimo.specs:geronimo-stax-api_1.0_spec:jar:1.0.1:compile
[INFO] +- org.ccil.cowan.tagsoup:tagsoup:jar:1.2.1:compile
[INFO] +- org.ow2.asm:asm-debug-all:jar:4.1:compile
[INFO] +- com.googlecode.mp4parser:isoparser:jar:1.0-RC-1:compile
[INFO] |  \- org.aspectj:aspectjrt:jar:1.6.11:compile
[INFO] +- com.drewnoakes:metadata-extractor:jar:2.6.2:compile
[INFO] |  +- com.adobe.xmp:xmpcore:jar:5.1.2:compile
[INFO] |  \- xerces:xercesImpl:jar:2.8.1:compile
[INFO] | \- xml-apis:xml-apis:jar:1.3.03:compile
[INFO] +- de.l3s.boilerpipe:boilerpipe:jar:1.1.0:compile
[INFO] +- rome:rome:jar:1.0:compile
[INFO] |  \- jdom:jdom:jar:1.0:compile
[INFO] +- org.gagravarr:vorbis-java-core:jar:0.6:compile
[INFO] +- com.googlecode.juniversalchardet:juniversalchardet:jar:1.0.3:compile
[INFO] +- com.uwyn:jhighlight:jar:1.0:compile
[INFO] +- com.pff:java-libpst:jar:0.8.1:compile

{noformat}

It's a difficult task to start excluding. I've no idea as a user what many of 
those dependencies are for, and if some of them would be needed by all Parser 
implementations or not. It's easy enough to spot what PDF Parser will need 
(pdfbox), but more tricky to see what else might be needed for PDF as well as 
for other types.

> Tika documentation should list tika-parsers parser dependencies
> ---
>
> Key: TIKA-1367
> URL: https://issues.apache.org/jira/browse/TIKA-1367
> Project: Tika
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Sergey Beryozkin
> Fix For: 1.6
>
>
> tika-parsers module has many strong transitive parser dependencies. Maven 
> users of tika-parsers have to exclude all the transitivie dependencies 
> manually. Documenting the list of the existing transitive dependencies and 
> keeping the list up to date will help developers exclude the libraries not 
> needed for a given project.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (TIKA-1368) Improve the modularity of tika-parsers

2014-07-15 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062098#comment-14062098
 ] 

Sergey Beryozkin edited comment on TIKA-1368 at 7/15/14 2:14 PM:
-

#2 is for those users who know what they want, so I disagree with you 
qualifying #2 as non-viable, with #2 tika-parsers would stay as it is now.

Speaking of this runtime exception you are referring to. I'm sorry but it 
appears to be somewhat academic. I could've said: what don't we have a single 
Tika module only so that users accidentally do not forget include 
"tika-parsers" given that the source code would compile even without including 
tika-parsers.

What kind of application is it ? Is it expected to have some tests :-) ? I can 
only think of the completely generic Tika container, which is TikaServer. But 
TikaServer would be prepackaged. Can you offer a more realistic example please ?

It's not a huge issue. But I hope we will come up with a basic solution without 
getting locked into arguments :-). I've heard a number of times that users may 
be affected. IMHO users who just would like to do a quick experiment can 
download the whole distro or use TikaServer. IMHO this is a rather narrow space 
where we have a Tika application which can accept anything without users paying 
any attention to the actual dependencies. On the other hand we will ship a 
simple Tika-based solution which will be exposed to our users, who would help 
those users who'd have to manually exclude many dependencies from tika-parsers ?


Thanks, Sergey



was (Author: sergey_beryozkin):
#2 is for those users who know what they want, so I disagree with you 
qualifying #2 as non-viable, with #2 tika-parsers would stay as it is now.

Speaking of this runtime exception you are referring to. I'm sorry but it 
appears to be somewhat academic. I could've said: what don't we have a single 
Tika module only so that users accidentally do not forget include 
"tika-parsers" given that the source code would compile even without including 
tika-parsers.

What kind of application is it ? Is it expected to have some tests :-) ? I can 
only think of the completely generic Tika container, which is TikaServer. But 
TikaServer would be prepackaged. Can you offer a more realistic example please ?

It's not a huge issue. But I hope we will come up with a basic solution without 
getting locked into arguments :-). I've heard a number of time that users may 
be affected. IMHO users who just would like to do a quick experiment can 
download the whole distro or use TikaServer. IMHO this is a rather narrow space 
where we have a Tika application which can accept anything without users paying 
any attention to the actual dependencies. On the other hand we will ship a 
simple Tika-based solution which will be exposed to our users, who would help 
those users who'd have to manually exclude many dependencies from tika-parsers ?


Thanks, Sergey


> Improve the modularity of tika-parsers
> --
>
> Key: TIKA-1368
> URL: https://issues.apache.org/jira/browse/TIKA-1368
> Project: Tika
>  Issue Type: Improvement
>  Components: packaging, parser
>Affects Versions: 1.7
>Reporter: Sergey Beryozkin
>
> tika-parsers module has many strong transitive dependencies. This presents a 
> challenge to Maven tika-parsers users wishing to use only one or very few 
> Parser(s).
> The fact the new Parsers are regularly added makes the exclusion process very 
> brittle. For example, an OSGI application switching from Tika 1.6 to Tika 1.7 
> and having an exclusion list in place may 'leak' a new parser lib into its 
> runtime. 
> https://issues.apache.org/jira/browse/TIKA-1367
> can help on its own but a more complete solution would ideally be in place.
> Proposal:
> 1. Make tika-parsers transitive dependencies optional
> 2. Introduce tika-parsers-optional pom that will depend on tika-parsers but 
> exclude 3rd-party dependencies
> Both 1 and 2 will depend on the resolution of TIKA-1367. IMHO 1 is cleaner, 
> users will be recommended to check the documentation and add the required 
> dependencies. 2 also works.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (TIKA-1368) Improve the modularity of tika-parsers

2014-07-15 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062098#comment-14062098
 ] 

Sergey Beryozkin edited comment on TIKA-1368 at 7/15/14 2:14 PM:
-

#2 is for those users who know what they want, so I disagree with you 
qualifying #2 as non-viable, with #2 tika-parsers would stay as it is now.

Speaking of this runtime exception you are referring to. I'm sorry but it 
appears to be somewhat academic. I could've said: what don't we have a single 
Tika module only so that users accidentally do not forget include 
"tika-parsers" given that the source code would compile even without including 
tika-parsers.

What kind of application is it ? Is it expected to have some tests :-) ? I can 
only think of the completely generic Tika container, which is TikaServer. But 
TikaServer would be prepackaged. Can you offer a more realistic example please ?

It's not a huge issue. But I hope we will come up with a basic solution without 
getting locked into arguments :-). I've heard a number of time that users may 
be affected. IMHO users who just would like to do a quick experiment can 
download the whole distro or use TikaServer. IMHO this is a rather narrow space 
where we have a Tika application which can accept anything without users paying 
any attention to the actual dependencies. On the other hand we will ship a 
simple Tika-based solution which will be exposed to our users, who would help 
those users who'd have to manually exclude many dependencies from tika-parsers ?


Thanks, Sergey



was (Author: sergey_beryozkin):
#2 is for those users who know what they want, so I disagree with you 
qualifying #2 as non-viable, with #2 tika-parsers would stay as it is now.

Speaking of this runtime exception you are referring to. I'm sorry but it 
appears to be somewhat academic. I could've said: what don't we have a single 
Tika module only so that users accidentally do not forget include 
"tika-parsers" given that the source code would compile even without including 
tika-parsers.

What kind of application is it ? Is it expected to have some tests :-) ? I can 
only think of the completely generic Tika container, which is TikaServer. But 
TikaServer would be prepackaged. Can you offer a more realistic example please ?

It's not a huge issue. But I hope we will come up with a basic solution without 
getting locked into arguments :-). I've heard a number of time that users may 
be affected. IMHO users who just would like to do a quick experiment can 
download the whole distro or use TikaServer. IMHO this is a rather narrow space 
where we have a Tika application which can accept anything without users paying 
any attention to the actual dependencies. On the other hand we will ship a 
simple Tika-based solution who will be exposed to our users, who would help 
those users who'd have to manually exclude many dependencies from tika-parsers ?


Thanks, Sergey


> Improve the modularity of tika-parsers
> --
>
> Key: TIKA-1368
> URL: https://issues.apache.org/jira/browse/TIKA-1368
> Project: Tika
>  Issue Type: Improvement
>  Components: packaging, parser
>Affects Versions: 1.7
>Reporter: Sergey Beryozkin
>
> tika-parsers module has many strong transitive dependencies. This presents a 
> challenge to Maven tika-parsers users wishing to use only one or very few 
> Parser(s).
> The fact the new Parsers are regularly added makes the exclusion process very 
> brittle. For example, an OSGI application switching from Tika 1.6 to Tika 1.7 
> and having an exclusion list in place may 'leak' a new parser lib into its 
> runtime. 
> https://issues.apache.org/jira/browse/TIKA-1367
> can help on its own but a more complete solution would ideally be in place.
> Proposal:
> 1. Make tika-parsers transitive dependencies optional
> 2. Introduce tika-parsers-optional pom that will depend on tika-parsers but 
> exclude 3rd-party dependencies
> Both 1 and 2 will depend on the resolution of TIKA-1367. IMHO 1 is cleaner, 
> users will be recommended to check the documentation and add the required 
> dependencies. 2 also works.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (TIKA-1368) Improve the modularity of tika-parsers

2014-07-15 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062098#comment-14062098
 ] 

Sergey Beryozkin edited comment on TIKA-1368 at 7/15/14 2:15 PM:
-

#2 is for those users who know what they want, so I disagree with you 
qualifying #2 as non-viable, with #2 tika-parsers would stay as it is now.

Speaking of this runtime exception you are referring to. I'm sorry but it 
appears to be somewhat academic. I could've said: why don't we have a single 
Tika module only so that users accidentally do not forget include 
"tika-parsers" given that the source code would compile even without including 
tika-parsers.

What kind of application is it ? Is it expected to have some tests :-) ? I can 
only think of the completely generic Tika container, which is TikaServer. But 
TikaServer would be prepackaged. Can you offer a more realistic example please ?

It's not a huge issue. But I hope we will come up with a basic solution without 
getting locked into arguments :-). I've heard a number of times that users may 
be affected. IMHO users who just would like to do a quick experiment can 
download the whole distro or use TikaServer. IMHO this is a rather narrow space 
where we have a Tika application which can accept anything without users paying 
any attention to the actual dependencies. On the other hand we will ship a 
simple Tika-based solution which will be exposed to our users, who would help 
those users who'd have to manually exclude many dependencies from tika-parsers ?


Thanks, Sergey



was (Author: sergey_beryozkin):
#2 is for those users who know what they want, so I disagree with you 
qualifying #2 as non-viable, with #2 tika-parsers would stay as it is now.

Speaking of this runtime exception you are referring to. I'm sorry but it 
appears to be somewhat academic. I could've said: what don't we have a single 
Tika module only so that users accidentally do not forget include 
"tika-parsers" given that the source code would compile even without including 
tika-parsers.

What kind of application is it ? Is it expected to have some tests :-) ? I can 
only think of the completely generic Tika container, which is TikaServer. But 
TikaServer would be prepackaged. Can you offer a more realistic example please ?

It's not a huge issue. But I hope we will come up with a basic solution without 
getting locked into arguments :-). I've heard a number of times that users may 
be affected. IMHO users who just would like to do a quick experiment can 
download the whole distro or use TikaServer. IMHO this is a rather narrow space 
where we have a Tika application which can accept anything without users paying 
any attention to the actual dependencies. On the other hand we will ship a 
simple Tika-based solution which will be exposed to our users, who would help 
those users who'd have to manually exclude many dependencies from tika-parsers ?


Thanks, Sergey


> Improve the modularity of tika-parsers
> --
>
> Key: TIKA-1368
> URL: https://issues.apache.org/jira/browse/TIKA-1368
> Project: Tika
>  Issue Type: Improvement
>  Components: packaging, parser
>Affects Versions: 1.7
>Reporter: Sergey Beryozkin
>
> tika-parsers module has many strong transitive dependencies. This presents a 
> challenge to Maven tika-parsers users wishing to use only one or very few 
> Parser(s).
> The fact the new Parsers are regularly added makes the exclusion process very 
> brittle. For example, an OSGI application switching from Tika 1.6 to Tika 1.7 
> and having an exclusion list in place may 'leak' a new parser lib into its 
> runtime. 
> https://issues.apache.org/jira/browse/TIKA-1367
> can help on its own but a more complete solution would ideally be in place.
> Proposal:
> 1. Make tika-parsers transitive dependencies optional
> 2. Introduce tika-parsers-optional pom that will depend on tika-parsers but 
> exclude 3rd-party dependencies
> Both 1 and 2 will depend on the resolution of TIKA-1367. IMHO 1 is cleaner, 
> users will be recommended to check the documentation and add the required 
> dependencies. 2 also works.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1368) Improve the modularity of tika-parsers

2014-07-15 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062098#comment-14062098
 ] 

Sergey Beryozkin commented on TIKA-1368:


#2 is for those users who know what they want, so I disagree with you 
qualifying #2 as non-viable, with #2 tika-parsers would stay as it is now.

Speaking of this runtime exception you are referring to. I'm sorry but it 
appears to be somewhat academic. I could've said: what don't we have a single 
Tika module only so that users accidentally do not forget include 
"tika-parsers" given that the source code would compile even without including 
tika-parsers.

What kind of application is it ? Is it expected to have some tests :-) ? I can 
only think of the completely generic Tika container, which is TikaServer. But 
TikaServer would be prepackaged. Can you offer a more realistic example please ?

It's not a huge issue. But I hope we will come up with a basic solution without 
getting locked into arguments :-). I've heard a number of time that users mat 
be affected. IMHO users who just would like to do a quick experiment can 
download the whole distro or use TikaServer. IMHO this is a rather narrow space 
where we have a Tika application which can accept anything without users paying 
any attention to the actual dependencies. On the other hand we will ship a 
simple Tika-based solution who will be exposed to our users, who would help 
those users who'd have to manually exclude many dependencies from tika-parsers ?


Thanks, Sergey


> Improve the modularity of tika-parsers
> --
>
> Key: TIKA-1368
> URL: https://issues.apache.org/jira/browse/TIKA-1368
> Project: Tika
>  Issue Type: Improvement
>  Components: packaging, parser
>Affects Versions: 1.7
>Reporter: Sergey Beryozkin
>
> tika-parsers module has many strong transitive dependencies. This presents a 
> challenge to Maven tika-parsers users wishing to use only one or very few 
> Parser(s).
> The fact the new Parsers are regularly added makes the exclusion process very 
> brittle. For example, an OSGI application switching from Tika 1.6 to Tika 1.7 
> and having an exclusion list in place may 'leak' a new parser lib into its 
> runtime. 
> https://issues.apache.org/jira/browse/TIKA-1367
> can help on its own but a more complete solution would ideally be in place.
> Proposal:
> 1. Make tika-parsers transitive dependencies optional
> 2. Introduce tika-parsers-optional pom that will depend on tika-parsers but 
> exclude 3rd-party dependencies
> Both 1 and 2 will depend on the resolution of TIKA-1367. IMHO 1 is cleaner, 
> users will be recommended to check the documentation and add the required 
> dependencies. 2 also works.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (TIKA-1368) Improve the modularity of tika-parsers

2014-07-15 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062098#comment-14062098
 ] 

Sergey Beryozkin edited comment on TIKA-1368 at 7/15/14 2:14 PM:
-

#2 is for those users who know what they want, so I disagree with you 
qualifying #2 as non-viable, with #2 tika-parsers would stay as it is now.

Speaking of this runtime exception you are referring to. I'm sorry but it 
appears to be somewhat academic. I could've said: what don't we have a single 
Tika module only so that users accidentally do not forget include 
"tika-parsers" given that the source code would compile even without including 
tika-parsers.

What kind of application is it ? Is it expected to have some tests :-) ? I can 
only think of the completely generic Tika container, which is TikaServer. But 
TikaServer would be prepackaged. Can you offer a more realistic example please ?

It's not a huge issue. But I hope we will come up with a basic solution without 
getting locked into arguments :-). I've heard a number of time that users may 
be affected. IMHO users who just would like to do a quick experiment can 
download the whole distro or use TikaServer. IMHO this is a rather narrow space 
where we have a Tika application which can accept anything without users paying 
any attention to the actual dependencies. On the other hand we will ship a 
simple Tika-based solution who will be exposed to our users, who would help 
those users who'd have to manually exclude many dependencies from tika-parsers ?


Thanks, Sergey



was (Author: sergey_beryozkin):
#2 is for those users who know what they want, so I disagree with you 
qualifying #2 as non-viable, with #2 tika-parsers would stay as it is now.

Speaking of this runtime exception you are referring to. I'm sorry but it 
appears to be somewhat academic. I could've said: what don't we have a single 
Tika module only so that users accidentally do not forget include 
"tika-parsers" given that the source code would compile even without including 
tika-parsers.

What kind of application is it ? Is it expected to have some tests :-) ? I can 
only think of the completely generic Tika container, which is TikaServer. But 
TikaServer would be prepackaged. Can you offer a more realistic example please ?

It's not a huge issue. But I hope we will come up with a basic solution without 
getting locked into arguments :-). I've heard a number of time that users mat 
be affected. IMHO users who just would like to do a quick experiment can 
download the whole distro or use TikaServer. IMHO this is a rather narrow space 
where we have a Tika application which can accept anything without users paying 
any attention to the actual dependencies. On the other hand we will ship a 
simple Tika-based solution who will be exposed to our users, who would help 
those users who'd have to manually exclude many dependencies from tika-parsers ?


Thanks, Sergey


> Improve the modularity of tika-parsers
> --
>
> Key: TIKA-1368
> URL: https://issues.apache.org/jira/browse/TIKA-1368
> Project: Tika
>  Issue Type: Improvement
>  Components: packaging, parser
>Affects Versions: 1.7
>Reporter: Sergey Beryozkin
>
> tika-parsers module has many strong transitive dependencies. This presents a 
> challenge to Maven tika-parsers users wishing to use only one or very few 
> Parser(s).
> The fact the new Parsers are regularly added makes the exclusion process very 
> brittle. For example, an OSGI application switching from Tika 1.6 to Tika 1.7 
> and having an exclusion list in place may 'leak' a new parser lib into its 
> runtime. 
> https://issues.apache.org/jira/browse/TIKA-1367
> can help on its own but a more complete solution would ideally be in place.
> Proposal:
> 1. Make tika-parsers transitive dependencies optional
> 2. Introduce tika-parsers-optional pom that will depend on tika-parsers but 
> exclude 3rd-party dependencies
> Both 1 and 2 will depend on the resolution of TIKA-1367. IMHO 1 is cleaner, 
> users will be recommended to check the documentation and add the required 
> dependencies. 2 also works.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1368) Improve the modularity of tika-parsers

2014-07-15 Thread Sergey Beryozkin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Beryozkin updated TIKA-1368:
---

Affects Version/s: 1.7

> Improve the modularity of tika-parsers
> --
>
> Key: TIKA-1368
> URL: https://issues.apache.org/jira/browse/TIKA-1368
> Project: Tika
>  Issue Type: Improvement
>  Components: packaging, parser
>Affects Versions: 1.7
>Reporter: Sergey Beryozkin
>
> tika-parsers module has many strong transitive dependencies. This presents a 
> challenge to Maven tika-parsers users wishing to use only one or very few 
> Parser(s).
> The fact the new Parsers are regularly added makes the exclusion process very 
> brittle. For example, an OSGI application switching from Tika 1.6 to Tika 1.7 
> and having an exclusion list in place may 'leak' a new parser lib into its 
> runtime. 
> https://issues.apache.org/jira/browse/TIKA-1367
> can help on its own but a more complete solution would ideally be in place.
> Proposal:
> 1. Make tika-parsers transitive dependencies optional
> 2. Introduce tika-parsers-optional pom that will depend on tika-parsers but 
> exclude 3rd-party dependencies
> Both 1 and 2 will depend on the resolution of TIKA-1367. IMHO 1 is cleaner, 
> users will be recommended to check the documentation and add the required 
> dependencies. 2 also works.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


  1   2   >