[jira] [Commented] (TIKA-3170) PDF extraction space issue

2020-08-18 Thread Akash (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17179469#comment-17179469 ] Akash commented on TIKA-3170: - Tried extracting using pdfbox-app jar for both versions. Observ

[jira] [Commented] (TIKA-3170) PDF extraction space issue

2020-08-18 Thread Akash (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17179656#comment-17179656 ] Akash commented on TIKA-3170: - 1 more observation. Extracted output remains same from tika app

Re: [EXTERNAL] Tika 2.0 modularization

2020-08-18 Thread Tim Allison
If anyone has any time, please take a look here: https://github.com/apache/tika/tree/branch_2x/tika-parser-modules Does this basically look ok? I've put the integration tests in https://github.com/apache/tika/tree/branch_2x/tika-parser-integration-tests ... that doesn't build yet. I've flipped B

[jira] [Commented] (TIKA-3170) PDF extraction space issue

2020-08-18 Thread Akash (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17179666#comment-17179666 ] Akash commented on TIKA-3170: - [https://github.com/apache/tika/compare/35a2cd35129db3aae58fd65

[jira] [Commented] (TIKA-3148) Remove apache-cxf dependency from tika-parsers

2020-08-18 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17179675#comment-17179675 ] Tim Allison commented on TIKA-3148: --- [~sebastien.lep...@ymail.com], can you take a look

[jira] [Updated] (TIKA-3170) PDF extraction space issue

2020-08-18 Thread Akash (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akash updated TIKA-3170: Attachment: image-2020-08-18-20-23-16-159.png > PDF extraction space issue > -- > >

[jira] [Commented] (TIKA-3170) PDF extraction space issue

2020-08-18 Thread Akash (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17179679#comment-17179679 ] Akash commented on TIKA-3170: - Difference because of    !image-2020-08-18-20-23-16-159.png!

Re: [EXTERNAL] Tika 2.0 modularization

2020-08-18 Thread Tim Allison
Thank you! >Somehow I did not find a couple of parsers, probably it is because of on-going work ... Yep. Exactly. I didn't want to put in the work in this direction if there were any showstoppers. >If we are going to make Tika more modern, maybe gradle can do a trick? My gradle isn't as strong

Re: [EXTERNAL] Tika 2.0 modularization

2020-08-18 Thread Ken Krugler
Hi Tim, I looked at the HTML module, and seems logical/straightforward. Thanks for pushing on this. — Ken > On Aug 18, 2020, at 7:40 AM, Tim Allison wrote: > > If anyone has any time, please take a look here: > https://github.com/apache/tika/tree/branch_2x/tika-parser-modules > > Does this b

[jira] [Issue Comment Deleted] (TIKA-3172) PDF Parser configuration enable auto space using tika config file

2020-08-18 Thread Tilman Hausherr (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-3172: -- Comment: was deleted (was: Please try if setting and changing "sortByPosition" has any effect. T

Re: [EXTERNAL] Tika 2.0 modularization

2020-08-18 Thread Oleg Tikhonov
Hi Tim, looks awesome. Somehow I did not find a couple of parsers, probably it is because of on-going work ... In addition, I was thinking about "getting rid of" maven. If we are going to make Tika more modern, maybe gradle can do a trick? Do we plan to add new Java "gooddies" like lambdas, foreign

[jira] [Commented] (TIKA-3172) PDF Parser configuration enable auto space using tika config file

2020-08-18 Thread Tilman Hausherr (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17179978#comment-17179978 ] Tilman Hausherr commented on TIKA-3172: --- I'm researching this a bit... what I found

[jira] [Commented] (TIKA-3170) PDF extraction space issue

2020-08-18 Thread Akash (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17179990#comment-17179990 ] Akash commented on TIKA-3170: - Seems issue is already fixed as part of this commit - [https:/

[jira] [Comment Edited] (TIKA-3170) PDF extraction space issue

2020-08-18 Thread Akash (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17179990#comment-17179990 ] Akash edited comment on TIKA-3170 at 8/18/20, 6:07 PM: --- Seems issue

[jira] [Closed] (TIKA-3170) PDF extraction space issue

2020-08-18 Thread Akash (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akash closed TIKA-3170. --- Fix Version/s: 1.25 Resolution: Duplicate Duplicate of TIKA-3131 > PDF extraction space issue > --

[jira] [Commented] (TIKA-3172) PDF Parser configuration enable auto space using tika config file

2020-08-18 Thread Akash (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17179993#comment-17179993 ] Akash commented on TIKA-3172: - Ok [~tilman]. If possible we can have that option. It will be a

[jira] [Commented] (TIKA-3173) Tika server with spawnChild - server does not recover from OOM until an additional file comes in

2020-08-18 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17180014#comment-17180014 ] Tim Allison commented on TIKA-3173: --- Hi [~ndipiazza_gmail], OOM causes tika-server to go

[jira] [Commented] (TIKA-3173) Tika server with spawnChild - server does not recover from OOM until an additional file comes in

2020-08-18 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17180015#comment-17180015 ] Tim Allison commented on TIKA-3173: --- We could also create a different parameter to allow

[jira] [Commented] (TIKA-3173) Tika server with spawnChild - server does not recover from OOM until an additional file comes in

2020-08-18 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17180016#comment-17180016 ] Tim Allison commented on TIKA-3173: --- Is the behavior I've described above roughly what y

[jira] [Commented] (TIKA-3082) OpenAPI for tika-server

2020-08-18 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17180017#comment-17180017 ] Tim Allison commented on TIKA-3082: --- Thanks to [~lewismc] for taking the time to explain

[jira] [Commented] (TIKA-3173) Tika server with spawnChild - server does not recover from OOM until an additional file comes in

2020-08-18 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17180018#comment-17180018 ] Tim Allison commented on TIKA-3173: --- Doesn't look like it...as I look above...hmmm

[jira] [Commented] (TIKA-3172) PDF Parser configuration enable auto space using tika config file

2020-08-18 Thread Tilman Hausherr (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17180025#comment-17180025 ] Tilman Hausherr commented on TIKA-3172: --- AnnotationUtils.assignFieldParams() has a l

[jira] [Comment Edited] (TIKA-3173) Tika server with spawnChild - server does not recover from OOM until an additional file comes in

2020-08-18 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17180027#comment-17180027 ] Tim Allison edited comment on TIKA-3173 at 8/18/20, 7:24 PM: -

[jira] [Commented] (TIKA-3173) Tika server with spawnChild - server does not recover from OOM until an additional file comes in

2020-08-18 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17180027#comment-17180027 ] Tim Allison commented on TIKA-3173: --- > Should the Watchdog have detected tika was dead i

[jira] [Commented] (TIKA-3173) Tika server with spawnChild - server does not recover from OOM until an additional file comes in

2020-08-18 Thread Nicholas DiPiazza (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17180097#comment-17180097 ] Nicholas DiPiazza commented on TIKA-3173: - Between 54:36 and 57:00 the client side

[jira] [Comment Edited] (TIKA-3173) Tika server with spawnChild - server does not recover from OOM until an additional file comes in

2020-08-18 Thread Nicholas DiPiazza (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17180097#comment-17180097 ] Nicholas DiPiazza edited comment on TIKA-3173 at 8/18/20, 8:54 PM: -

[jira] [Comment Edited] (TIKA-3173) Tika server with spawnChild - server does not recover from OOM until an additional file comes in

2020-08-18 Thread Nicholas DiPiazza (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17180097#comment-17180097 ] Nicholas DiPiazza edited comment on TIKA-3173 at 8/18/20, 8:56 PM: -

Re: [EXTERNAL] Tika 2.0 modularization

2020-08-18 Thread Bob Paulin
Hey Tim, Just started taking a look.  The test-jar approach could work but I recall I ran into some issues with getting access to some of the test files inside the test-jars for some of the junits.  For many tests this was simple but for some I think it would require larger functional changes to t

[jira] [Commented] (TIKA-3172) PDF Parser configuration enable auto space using tika config file

2020-08-18 Thread Tilman Hausherr (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17180244#comment-17180244 ] Tilman Hausherr commented on TIKA-3172: --- Valid field names with the change: ocrStra